Topic Modeling with MALLET

I was recently inspired by the Journal of Digital Humanities issue on topic modeling in the humanities to do some topic modeling of my own. Much of my research has deliberately avoided this technique because I have been concerned with other matters of discourse, and honestly, I had more than enough natural language processing tasks to occupy myself with, not the least of which is the hand-coding and compiling of corpus of rhetorical moves.

MALLET, however, provides an easy method to import documents and create topic models based on latent Dirichlet allocation.

For this experiment, I processed 77 scientific articles from journal specializing in human evolutionary biology, climate studies, poultry science, and plant biology. I used MALLET’s default stopword list and generated 20 categories. I should note here that the science article files could be cleaner. Some artifacts of previous processing and analysis were present; however, because this is only an exploratory experiment in topic modeling, my concern over these idiosyncrasies is minimal.

Below, you will find a table that contains the topics and keys generated from these 77 scientific articles.

0 0.03434 true kin fertility women living children residence age time marriage birth influence contraceptive child number journal significant virilocally effect
1 0.0308 model class cuii mhp female xala site ann models hypergyny probability homosexual values mlr females migration stratification binding societies
2 0.10804 al egg fed eggs diet hens diets breed higher age laying fatty feed birds meal observed acid kadaknath aseel
3 0.0375 local forest people land resources production households adaptation groups access livestock income policies areas gum drought trade collection government
4 0.04795 true litter al birds perfringens ice broiler production broilers treatment cake false flocks barrier flock chicks density pen content
5 0.01368 heels high female attractiveness walkers flat gait wearing shoes participants judgements females cv sd condition flexion attractive women walking
6 0.02839 reaction yield equiv table cl metal scheme temperature cs alcohol precipitation thumbnail catalyst mol image
article product phosphine entry
7 0.08426 al immune response stress expression onac il cells corticosterone responses birds chickens innate dietary genes treatment rice cell system
8 0.04974 temperature hens gene embryonic incubation heat experiment al egg eggs early mortality feathered higher performance development dw ambient stress
9 0.06695 social support individual moralization learning trait individuals optimum mating payoff behavior artifact trial strategy friends eq population learner strategies
10 0.04691 cdm projects countries ldcs project seed supply coat cer cers nanoparticle protein demand cent energy poas potential eu scenarios
11 0.73786 effects high data important time increase conditions increased study results effect significant level studies similar higher factors table
12 0.0769 al bcn binding avt uv bacteria pituitary light birds chicken receptor cell bacterial crh neurohypophysis jejuni ct peptides genes
13 0.04168 women male men preferences faces masculinity cues wealth facial exposure high competition low sex female ratings images participants scents
14 0.0546 carbon environmental pes services climate disaster change energy development emissions adaptation service countries interventions drm local land reconstruction reduce
15 0.09328 adaptation climate change sustainable social development risk vulnerability state problematization policy knowledge poverty report practices neoliberal context discourse
16 0.02862 children reciprocity partner choice altruism games participants indirect previous cooperation age model public behavior contributions sex goods shared partners
17 0.08861 al propolis mc samples rev kg ml min study vaccine poultry mg virus concentration fowl performed reported pcr chicks
18 0.09039 animal welfare animals selection genetic al traits production breeding environment poultry species activity behavior ducks natural physiological genetics birds
19 0.03237 climate countries baseline finance al strains da salmonella developing funds resistance cent poultry level enteritidis sources additional oda global

Read thematically, we can see that MALLET has arranged topics according to fairly cohesive sets of key words. Terms associated with climate are clustered together. Meanwhile, terms such as “egg” are clustered around “incubation” and “chicken.” The fact that more topics are pulled toward poultry science seem to reflect the fact that the documents contain more articles from poultry science.

The most notable finding to me seems to be Topic 11. This has been judged the most frequent topic in the corpus by a fair margin and involves what could be considered genre cues of scientific research: “data,” “results,” “significance,” “high importance.” These terms are generally, and generically, found in the abstract, introduction, and discussion section of scientific articles and cut across disciplines.