Finding Genre Signals in Academic Writing: Benchmarking Method

The following post functions as supplementary material for “Finding Genre Signals in Academic Writing” for the Journal of Writing Research. This post explains how we automatically processed 505 research articles from the Spring OpenAccess database to filter citational sentences from non-citational sentences. While the primary analysis of “Finding Genre Signals in Academic Writing” relies on hand-coded sentences, the authors have developed this automated routine in order to test the viability of our citational coding scheme (which targets the lexical content of the sentence) and with an eye toward future citation analysis projects that may benefit from automated analysis.

To gather citation data in order to  benchmark our coding scheme, surface-level parser, and gain a global sense of how Extraction, Grouping, and Author(s) as Actant(s) citational types operated within the larger field of academic research, we screen scraped 505 research articles from journals hosted by Springer OpenAccess. These journals are peer reviewed and write to the genre conventions of academic audiences, including the Introduction-Methods-Results-Discussion (IMRaD) format often used to structure scientific and social scientific journals (see Christensen and Kawakami, 2009; Hannick and Flanigan, 2013; Salager-Meyer, 1994;  . This screen scrape captured the meta-data of the article (author names, date of publication, institutional affiliation, and document object index), the full text of the article without images, and works cited list. Only articles types labeled as “Research Article” by the Springer OpenAccess filtering tool were used for this exploratory analysis.

We then tokenize each article at the sentence level. Using regular expression searches, we tagged all in-text citations and non-citations. For this study, an in-text citation denotes a sentence-token that attributes a source via author name or author name and date of publication in Harvard-style in-text citation formatting. Because the citation style varied across journals due to vagaries in HTML markup presentations, we narrowed our selection to those journals that employed the following in-text citation patterns:

Author last name (Year of Publication)
2 author last names (Year of Publication)
First author last name, et al. (Year of Publication)
Author last name (Year of Publication + a-z index where different articles by the same authors appear)
2 author last names (Year of Publication + a-z index where different articles by the same authors appear)
First author last name, et al. (Year of Publication + a-z index where different articles by the same authors appear)
Author last name [Year of Publication]
2 author last names [Year of Publication]
First author last name, et al. [Year of Publication]
Author last name [Year of Publication + a-z index where different articles by the same authors appear]
2 author last names [Year of Publication + a-z index where different articles by the same authors appear]
First author last name, et al. [Year of Publication + + a-z index where different articles by the same authors appear]
Author last name ([Year of Publication])
2 author last names ([Year of Publication])
First author last name, et al. ([Year of Publication])
Author last name ([Year of Publication + a-z index where different articles by the same authors appear])
2 author last names ([Year of Publication + a-z index where different articles by the same authors appear])
First author last name, et al. ([Year of Publication + + a-z index where different articles by the same authors appear])
Author last name ([Year of Publication])
2 author last names ([Year of Publication])
First author last name, et al. ([Year of Publication])
(Author last name Year of Publication)
(2 author last names [Year of Publication])
(First author last name, et al. [Year of Publication])

Works cited entries were excluded from this pattern matching. In addition, statements that may be considered citational in nature, but did not contain explicit references to authors or dates of publication were also excluded. For example, the sentence from Ogada, et al. (2014):

These authors concluded that initial adoption may be low due to imperfect information on management and profitability of the new technology but as this becomes clearer from the experiences of their neighbors and their own experience, adoption is scaled up.”

While this sentence functions to synthesize the work of several authors previously cited in the article by Ogada, et al. (2014), it does not contain markers of author attribution or date of publication. Thus, interpretive sentences of this type were not included in the initial in-text citation search. Although we do see the potential contribution of tracking these rhetorical moves of extended synthesis, making judgments about the nature of such moves proved difficult for the lexical pattern matching routines.

For sentence that name authors, but do not provide a date of publication, we configured the screen scraper program to parse the DOM tree of the article for its References section. The last name of the primary author of the publication is sequestered into a list. If an in-text citation has stumped the initial regular expression searching parameters and received a tag of “non-intext-citation”, then the script will check for the presence of the primary author’s last name in the sentence by comparing the extant words with the list of author last names compiled in the screen scrape. To preclude radical changes, only the names greater than two characters in length are retained. If there is a match between the first author’s last name and a word in the “non-intext-citation” sentence, the tag is changed to “intext-citation.” This update of the search protocol assumes that the correspondence between a capitalized word and an author name listed in the reference section of the article most likely indicates an in-text citation. In some cases, the author’s last name could also function as a content word (verb or noun), leading to a falsely assigned label. For a generic example, consider an author whose last name is “House.” The entry “House” would not match “house” because the latter lacks an initial capital letter; however, a sentence containing the collocation “White House” would lead to a false positive. Another false permutation that we have encountered in the study occurs when an article is discussing an organization and cites work produced by that organization or other organizations that share a similar appellation. For example, in Rissler, et al. (2014), the authors write:

In the only nationwide survey of high school science teachers (n?=?939), Berkman et al ([2008]) found that at least 17% of biology teachers are young-earth creationists, and about one in eight teach creationism or intelligent design in a positive light.

Only 23% of teachers strongly agreed that evolution is the unifying theme of biology, as accepted by the National Academy of Science and the National Research Council.

The first sentence from Rissler et al (2014) is tagged as a citation because of the reference to “Berkman et al ([2008]).” The information of the second sentence has been sourced from the previous sentence. By our thin definition of what constitutes and in-text citation, the second sentence should not be tagged; however, because the word “Council” is present in the second sentence and “Council” is included in the article’s reference list in the position of a last name, the second sentence is classed as an in-text citation. We consider this cross-referencing step a contingency for articles that may present copy-editing inconsistent with the journal style guide. However, the bank of names harvested from the reference sections of article is reused in a subsequent processing step, which replaces an instance of an author’s name in the text with the cognate tag “AUTHOR.”

When one of the above citational conditions are met by the lexical content of a sentence, that sentence is tagged as an in-text citation (1). Those sentences that do not match the lexical patterns above receive a non-in-text citation (0) tag.

After initial processing by the screen scraping and the in-text citation/non-in-text citation tagging routines, we then pass the marked sentences–now annotated with a 0 or 1–to a second processing module, whose goal is to reduce the complexity of the syntactic complexity of the in-text citation to more general cognates. This second processing module makes the following substitutions:
publication years featured in in-text citations are replaced with the tag “PUBYEAR”
an author’s last name, if found in the list of names harvested from the reference section of the article, is replaced with the tag “AUTHOR”
part of speech are tagged by a pre-trained pos-tagger, which relies on the Penn Part of Speech tags (see; only those parts of speech tags which indicate verbs, prepositions, and determines are retained and inserted into the body of the sentence

As an example, we can consider the following sentence from Otten, et al. (2015):

Product Portfolio Management (PPM) is a dynamic decision process, whereby a business list of active (new) products (and R&D) projects is constantly updated and revised (Cooper, Edgett, & Kleinschmidt, [2001]).

Given the above processing step, that sentence would be transformed into:

Product Portfolio Management (PPM) VBZ is DT dynamic decision process, whereby DT businesss list IN active (new) products (and R&D) projects VBZ is constantly updated and VBN revised (AUTHOR, Edgett, & Kleinschmidt, PUBYEAR).’

After each sentence is tagged by selected parts of speech, AUTHOR, and PUBYEAR, the configuration and/or quantity of the tags are assessed in a third processing module. This third processing module applies the citational coding scheme discussed above as numerical tags: Extraction (1), Grouping (2), and Author(s) as Actant(s) (3) by comparing the parts of speech, AUTHOR, and PUYEAR tags to hard-coded lexical patterns fitted to each category of the coding scheme. If a sentence contains parts of speech, AUTHOR, and PUBYEAR tags that match the category of Extraction in-text citation, then taht sentence will receive a 1, and so on. This processing module works through elimination:

ignore all sentences tagged as non-in-text citations
tag all in-text citations in which PUBYEAR appears more than 2x as Grouping (2)
compare remaining sentences (i.e., not Grouping) with AUTHOR, parts of speech, and PUBYEAR pattern and designate matches as Author(s) as Actant(s) (3)
tag all remaining in-citations (i.e., not Group or Author(s) as Actant(s)) as Extraction (1)

In one sense, the third processing module moves from the most deterministic coding category to the least deterministic. For our coding scheme, any in-text citation that refers to more 3 or more sources within the boundary of the sentence is considered Grouping, regardless of the grammatical construction or the presence of other features that may match Author(s) as Actant(s) or Extraction. We may consider an example sentence from Otten, et al. (2015):

A viable alternative for determining the product portfolio (product assortment) is the use of a data mining approach, called Association Rule Mining (ARM), which exposes the interrelationships between products by inspecting a transactional dataset (Brijs, Swinnen, Vanhoof, & Wets, [1999]; [2004]; Fayyad, Piatetsky-Shapiro, & Smyth [1996a], [1996b], [1996c].

An annotated example of the above Grouping sentence would appear like the following:

DT viable alternative IN determining DT product portfolio (product assortment) VBZ is DT use IN DT data mining approach, VBN called Association Rule Mining (ARM), which exposes DT interrelationships IN products IN inspecting DT transactional dataset (AUTHOR, Swinnen, Vanhoof, & Wets, PUBYEAR; PUBYEAR; Fayyad, Piatetsky-Shapiro, & PUBYEAR, PUBYEAR, PUBYEAR.’ (Otten et al. 2015)

The next most deterministic category is Author(s) as Actant(s) because this category demands that an author be named in the sentence and function as the subject or as the receive of an action and that references to other sources be less than 3. Because the Author(s) as Actant(s) category cannot contain more than 2 references, it is excluded from the Grouping category by default. It is excluded from the Extraction category because it will contain a direct authorial attribution in which the named author is performing the action of the sentence or the object of the verb of the sentence. Take for example the following sentence from Correa Bahnsen, et al. (2015):

Moreover, as discussed in (Verbraken et al [2013]), if the average instead of the total profit is considered and the fixed cost A is discarded since is irrelevant for classifier selection, the profit can be expressed as: (2) Nevertheless, equations (1) and (2), assume that every customer has the same CLV and Co, whereas this is not true in practice.

The above sentence would be tagged in the following manner:

Moreover, IN VBN discussed IN (AUTHOR et al PUBYEAR), IN DT average instead IN DT total profit VBZ is VBN considered and DT VBN fixed cost DT VBZ is VBN discarded IN VBZ is irrelevant IN classifier selection, DT profit can VB be VBD expressed as: (2) Nevertheless, equations (1) and (2), VBP assume IN DT customer VBZ has DT same CLV and Co, whereas DT VBZ is not true IN practice.’

In the above example, the key sequence is “IN VBN discussed IN (AUTHOR et al PUBYEAR).” The pattern of past participle verb tag + past participle verb + preposition + AUTHOR tag + PUBYEAR tag corresponds to a pre-existing arrangement in the module 3 processor, which assumes that an AUTHOR tag immediately following a verb clause indicates that an action is being attributed to a named author.

All in-text citation sentences have not received a Grouping (2) or Author(s) as Actant(s) (3) classifications are then automatically tagged as Extraction (1). Programmatically, an Extraction (1) classification is any in-text citation that has less than three PUBYEAR tags and does not attribute action to a named author within the boundaries of the sentence by making the author the subject or object of an action in an independent or subordinate clause. An example from Correa Bahnsen, et a. (2015) would be:

This assumption does not hold in many real-world applications such as churn modeling, since when misidentifying a churner the financial losses are quite different than when misclassifying a non-churner as churner (Glady et al [2009]).

After processing, the above sentence would be tagged as:

DT assumption VBZ does not VBP hold IN many real-world applications such IN churn modeling, IN when misidentifying DT churner DT financial losses VBP are quite different IN when misclassifying DT non-churner IN churner (AUTHOR et al PUBYEAR).’

As we noted in the beginning of the article, the ultimate aim of our work is to accomplish two tasks: (1) compare advisor and advisee texts and (2) output measures of comparison that inform a rhetorical reading of citational moves in academic writing. Doing so means converting raw advisor and advisee texts into a computational objects and selecting features from those objects that offer relevant quantitative and qualitative information. In this first pass, the computational object was a “string.” In the next pass, we convert texts to another kind of computational object, a graph, for further analysis.

Python Recipe for NSF Proposals

In a recent Politico article, “No, the GOP is Not at War with Science”, Senator Rand Paul and Representative Lamar Smith addressed criticism of the GOP’s attempt to undermine the peer review process of the National Science Foundation and National Institute of Health’s grant proposal system.

In their article, Paul and Smith cite what they suggest are frivolous expenditures on research that do not advance national interests. I will not name the studies listed by Paul and Smith, but I do encourage people to read the original article to understand the jeers in context.

I will, however, cite one of Paul and Smith’s justifications for their argument:

Our national debt is more than $18 trillion, and the American taxpayer is hurting. If we, as a country, have decided to spend taxpayers’ hard-earned dollars on funding science and research, then we need to spend wisely. Every dollar spent by the federal government must be spent just as the typical family deals with spending decisions on car payments, child care, food purchases and housing needs.

Read more:

In order to help researchers appeal to the financial and national scruples of the Republican controlled Congress, I have written a short script that will automatically generate a GOP-friendly NSF grant proposal description in Word.

The Python code (below) requires only the python-docx and lxml >= 2.32 packages that can be downloaded through pip.

__author__ = 'Ryan Omizo'

#Requires python-docx package
#Requires lxml >=2.3.2

from docx import Document
import random

def guns_references():
    dates = [2001, 2007, 2012, 1998, 2002, 2003, 2010, 1985, 1997]
    references = ['Guns, Gun. Gun. (' + str(dates[i]) + '). Guns. Guns: Guns.' for i in range(len(dates))]

    return references

def guns_intext():
    dates = [2001, 2007, 2012, 1998, 2002, 2003, 2010, 1985, 1997]
    in_text = ['(Guns, ' + str(dates[i]) + ')' for i in range(len(dates))]

    return in_text

def guns_sentence():
    g = 'guns'
    word_count = random.randrange(7,13)

    word_list = []
    for i in range(word_count):

    word_list.insert(0, 'Guns')

    return ' '.join(map(str, word_list))

def guns_paragraph():
    sentence_count = random.randrange(7,20)

    sentence_list = []
    for i in range(sentence_count):

    in_text = guns_intext()
    s = sentence_list[random.randrange(len(sentence_list))].split()
    s.insert(len(s)/2, in_text[random.randrange(len(in_text))])

    t = ' '.join(map(str, s))
    sentence_list.insert(random.randrange(len(sentence_list)), t)

    return ' '.join(map(str, sentence_list))

def guns_section():
    paragraph_count = random.randrange(7,15)

    paragraph_list = []
    for i in range(paragraph_count):

    return paragraph_list

def guns_headings():
    r = ['I.', 'II.', 'III.', 'IV.', 'V.', 'VI.']

    headings = [r[i] + ' Guns' for i in range(len(r))]
    return headings

def guns_subheadings():
    r = range(4)[1:]
    subheadings = [str(r[i]) + '. ' + guns_paragraph()  for i in range(len(r))]
    return subheadings

document = Document()
document.add_heading('Guns', 0)

gh = guns_headings()

for i in range(len(gh)):
    if i == 1:
        gsh = guns_subheadings()
        gs1 = guns_section()
        for item in gsh:
        for x in gs1:
    elif i == 3:
        gsh = guns_subheadings()
        gs1 = guns_section()
        for item in gsh:
        for x in gs1:
    elif i == 4:
        recordset = [1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3]
        table = document.add_table(rows=1, cols=4)
        hdr_cells = table.rows[0].cells
        hdr_cells[0].text = 'Guns and Sub-machine guns'
        hdr_cells[1].text = 'Year 1'
        hdr_cells[2].text = 'Year 2'
        hdr_cells[3].text = 'Year 3'
        for i in range(len(recordset)):
            row_cells = table.add_row().cells
            row_cells[0].text = str(recordset[i])
            row_cells[1].text = 'Guns'
            row_cells[2].text = 'Guns'
            row_cells[3].text = 'Guns'
        gs2 = guns_section()
        for g in gs2:

document.add_heading('References Cited')
citations = guns_references()
for citation in citations:

Running this code in your favorite Python environment or command line will output an editable Word doc such as the following:


Download (PDF, Unknown)

Text Mining the MLA Job Information List Part 2

In Text Mining the MLA Job Information List Part 1, I cobbled together a series of regular expression scripts and list comprehensions in Python to reduce the dimensionality of the October 2012 edition of the MLA Job Information List. This dimensionality reduction removed components such as punctuation, function words, email addresses, and URLs from the source text; it also involved correcting basic but routine OCR scanning errors. The end result was a corpus half the size as the original, non-processed text in terms of token count.

In this post, I will convert the list of tokens returned by the text processing into bigrams and measure the significance these associations to find word collocations.

When I refer to bigrams, I am describing tuples of two consecutive tokens. For example, the line, “this is a test sentence” would be converted to [('this', 'is'), ('is', 'a'), ('a', 'test'), ('test', 'sentence')] if we were extracting bigrams.

When I refer to collocations, I am describing those bigrams (or any ngram larger than a unigram) that combine to form unitary meanings due to grammar or convention. The most straightforward bigram collocation is a name that relies on a compound such as “New York” or “East Lansing.”

However, collocations also join other linguistic units such as verbs and nouns because convention dictates a constrained usage. For example, in the sentence, “She opened the gate,” we have the following bigrams (allowing for the removal of “the”): (‘she’, ‘opened’) and (‘opened’, ‘gate’). These tokens all work to create meaning in this sentence, as is the case with any sentence. However, the example sentence is also amenable to substitution. The tokens ‘she’ and ‘opened’ have a subject-verb relationship, but the ‘opened’ does not strongly depend on ‘she’ to create meaning; ‘she’ could just as well be ‘jane’ or ‘he’ or ‘john.’ The case of ‘opened’ and ‘gate’ is more complex because conventions in English suggest that we open objects such as doors and gates. It is far less conventional to write, “she released the gate” or “she unstopped the gate.” At the same time, “she unlatched the gate” or “she unlocked the gate” might also work, suggesting that there is still flexibility between a verb that modifies the object “gate.” A bigram collocation would feature more rigid association between tokens. We might cite “St. Peter’s Gate” or the “pearly gates.” Both refers to a specific gate, and the replacement of either token would radically change the meaning of the term.

Tracking and measuring collocations in a natural language text is a common practice and can be applied to numerous information retrieval tasks and research questions. For example, finding stable collocations among tokens can identify compound terms such as “real estate” or “East Lansing” or “Rhode Island.” If such collocations occur at levels of significance greater than random, then a text mining routine can programmatically combine these words into a single token, thereby providing a more accurate representation of a text.

Given the MLA Job Information List, measuring bigram collocations might contribute to a cleaner dataset. For example, finding significant bigrams might help an analyst differentiate between calls for “rhetoric” and “rhetoric [and] composition.” At a more basic level, finding significant bigram collocations might help us screen expected compounds such as “english department.” This is not a trivial problem, and does merit consideration, especially if we can tune the program to recognize named entities.

My interest in collocations for these postings is broader, however, or, I should say, less granular. At a basic level, the relevance of ngram collocations is that the proximal relationships between words signify conceptual relationships, and these conceptual relationships influence the structure and meaning of a text. The idea behind this text mining endeavor is to computationally reveal probative insights into the rhetoric of the MLA Job Information List.

On the face of it, finding significant bigram collocations would only seem to highlight what we already know about the MLA Job Information List. We know that we will see numerous instances of “rhetoric, composition” and “english, department” or “technical, communication.” We would also expect to see collocations between verbs like “mail” and nouns like “cv” or “sent” and “application.” Such collocations all fit the genre of job advertisements in the field of English, rhetoric and composition, professional writing, and creating writing, which provide instructions for applicants and information about delivery of materials.

If we could imagine this computational experiment undertaken on a blind sample, extracting identifying information might be valuable for the purposes of classification. My primary interest in these posts, however, is to reveal patterns of rhetoric that might not be immediately obvious to conventional readings and to understand how the MLA Job Information List functions as a body of discourse, not just a collection of ads submitted by various institutions.

Materials Used

  • Python 2.6+
  • NLTK==2.0.4
  • Numpy==1.8.0

Bigram and Trigram Collocations

Creating bigram and trigram collocations is a common practice in natural language processing and several Python libraries already have build-in modules to handle this task, including NTLK. Given that we already have a tokenized list of the October 2012 edition of the MLA Job Information List, we can write our own, brief functions that will turn that list into a list of bigrams and trigrams:

def bigram_finder(tokens):
    return [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]

def trigram_finder(tokens):
    return [(tokens[i], tokens[i+1], tokens[i+2) for i in range(len(tokens)-2)]

###If opting to use NLTK
#from nltk import bigrams, trigrams

The results of the bigram_finder and trigram_finder functions are comparable to the methods found in the NLTK library.

Hypothesis Testing

While we now have a method to extract collocated tokens from the MLA Job Information List, we have not arrived at a means to test whether or not the collocations are significant. A basic measure of significance is whether or not a particular collocation occurs at a rate greater than random.

Of course, because we are dealing with a natural language text governed by rules and conventions, not word/token is a product of chance. Consequently, what we are really trying to determine when is if the patterns of collocations in the MLA Job Information List are strongly or weakly militated. In some cases, the associations between tokens are linguistic; however, other associations can point to concept formation and deployment, which can intimate how the rhetoric of the MLA Job Information List is operating computationally.

To test the strength of the ngram associations, I will use the Student T-Test for significance as outlined by Manning and Schutze (2000 p. 163-166).

The Student T-Test for significance begins with the null hypothesis that the terms constituting a collocation are independent. In this case, independence means that the likelihood that one term would be collocated with another is no better than chance, which depends on the distribution of the term in the population of terms.

In order to compare the likelihood that a collocations results from random selection or a more decisive cause, we calculate the t-value. If this t-value is less than a critical value given by a t-table, then we cannot reject the null hypothesis that the collocation exists more or less as a product of chance. If the t-value is greater than the critical value, then we can reject the null hypothesis.

For bigrams, the t-value is calculated thus:

t_value = (sample_likelihood - independence_likelihood)/(math.sqrt(sample_likelihood/population))

Let’s step through the variables:

The independence_likelihood is the likelihood that the two tokens in the bigram are collocated at a frequency no better than random. We calculate the independence likelihood of a bigram collocation thus:

#Let n_1 be the first token in the bigram
#Let n_2 be the second token in the bigram
#Let the population be the total number of bigrams in the sample

independence_likelihood = frequency of n_1/population * frequency of n_2/population 

In other words, the independence likelihood is the probability that n_1 and n_2 can occur in a sample given their relative frequency; it is the frequency that we would expect to see if chance were the only regulating factor.

The sample_likelihood is the actual distribution of the bigram in the sample:

#Let (n_1, n_2) represent the actual bigram

sample_likelihood = freqency of (n_1, n_2)/population

Once again, these variables are assembled in the following equation:

t_value = (sample_likelihood - independence_likelihood)/(math.sqrt(sample_likelihood/population))

NLTK has prebuilt functions to calculate t-values for the Student T-test. However, solving for t-values is not terribly taxing and can be accomplished through the use of Python’s Counter and defaultdict.

Let’s first dispense with the necessary imports:

from __future__ import division
import math
from collections import Counter, defaultdict

Note: You must place from __future__ import division first for it to work.

def bigram_student_t(tokenlist):

    population = len(tokenlist)-1
    counts = Counter(tokenlist)
    bigrams = bigram_finder(tokenlist)

    independence_likelihood = defaultdict(list)
    for bigram in bigrams:
        independence_likelihood[bigram] = counts[bigram[0]]/population * counts[bigram[1]]/population

    sample_likelihood = Counter(bigrams)
    for k, v in sample_likelihood.items():
        sample_likelihood[k] = v/population

    tvalues = defaultdict(list)

    for bigram in bigrams:
        tvalues[bigram] = (sample_likelihood[bigram] - independence_likelihood[bigram])/(math.sqrt(sample_likelihood[bigram]/population))

    return tvalues

Let me step through the code:

population = len(tokenlist)-1
counts = Counter(tokenlist)
bigrams = bigram_finder(tokenlist)

The above three lines of code sets our working variables. Our population refers to the count of bigrams, which will always be 1 less than the length of the input list of tokens.

The counts variable returns a Counter object, which behaves like a Python dictionary. They keys in counts are the tokens. The values of counts are there frequencies.

bigrams calls are former bigram_finder() function and converts the list of tokens into a list of bigrams.

independence_likelihood = defaultdict(list)
    for bigram in bigrams:
        independence_likelihood[bigram] = counts[bigram[0]]/population * counts[bigram[1]]/population

    sample_likelihood = Counter(bigrams)
    for k, v in sample_likelihood.items():
        sample_likelihood[k] = v/population

    tvalues = defaultdict(list)

    for bigram in bigrams:
        tvalues[bigram] = (sample_likelihood[bigram] - independence_likelihood[bigram])/(math.sqrt(sample_likelihood[bigram]/population)), sample_likelihood * population

The remaining code arranges are count information into two defaultdicts. The sample_likelihood defaultdict stores the probability distributions of each bigram in the list of bigrams.

The independence_likelihood variable stores the probability values of each bigram based on the expected distribution of the bigrams given the independence of each item in the bigram.

The t_values defaultdict holds our t-value solutions and includes the original bigram count.

You can obtain similar results by calling NLTK’s BigramAssocMeasures() and BigramCollocationFinder

from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

#Create list of bigrams from tokens
finder = BigramCollocationFinder.from_words(tokens)

#Find t values of all bigrams in the list using student t test from Manning and Schutze

The results from the bigram_student_t function and NLTK’s BigramAssocMeasures() are comparable but not exact. The difference lies in how NLTK defines its population variable. NTLK takes the length of the token list as its population, whereas I have taken the length of the list of bigrams. For example:

#bigram_student_t t-value for ('english', 'department')

#NLTK BigramAssocMeasures t-value for 'english', 'department'

The difference is negligible when it comes to checking t-values against a t-table; however, I believe my implementation is more in-keeping with what Manning and Schutze describe.

Using a t-table of critical values, we see that the critical value to reject the null hypothesis of independence for a sample size of 49,444 is 3.291 for a one-tailed test with a 0.9995 degree of confidence. Thus, all those bigrams with a t-value greater than 3.291 are reasonably expected to form collocations.

You can download a list of the bigrams, their t-values, and their counts here as a zipped .csv.

One thing to note about the t-values: as Manning and Schutze point out, the Student T-test for independence should be considered a means to rank collocations in a text, not to simply declare a word pair a collocation or not. Consequently, you may find bigrams that are normally regarded as collocation lacking a t-value that rises above the critical value.


A perusal of the bigram list and their associated t-values and counts may seem a little underwhelming because the output hews so closely to expectation in terms of rhetoric and statistical analysis.

As is often the case when you tally word or word collocations, the counts form a heavy-tailed distribution. In this case, there are a few collocations that appear at high frequency. But most of the collocations appear once (hence, the heavy tail along the x-axis):

Bigram Distributions

Unsurprisingly, the top-15 bigram collocations in terms of counts comprise the following:

  1. (‘assistant’, ‘professor’)
  2. (‘apply’, ‘position’)
  3. (‘department’, ‘english’)
  4. (‘job’, ‘information’)
  5. (‘information’, ‘list’)
  6. (‘mla’, ‘job’)
  7. (‘english’, ‘edition’)
  8. (‘list’, ‘english’)
  9. (‘letter’, ‘application’)
  10. (‘invite’, ‘application’)
  11. (‘candidate’, ‘will’)
  12. (‘writing’, ‘sample’)
  13. (‘creative’, ‘writing’)
  14. (‘three’, ‘letter’)

For those of us in the field of English studies, rhetoric and composition, professional writing, and creative writing, we can easily interpolate the sentences these bigrams inform. And we are not back to the problem that I posed in the introduction to this post: what can an analysis of bigram collocations tells us about the MLA Job Information List that we don’t already know?

The answer, I think, is that it may not tell us a lot about the MLA Job Information List, but it can point us to ways in which we can use basic statistical information to track and tag larger units of discourse and to better understand how global meanings arise from more granular elements.

The above list of bigrams suggests what people in the field of English studies or rhetoric and composition might call boilerplate. Most of the bigrams such as ('letter', 'application') and ('writing', 'sample') are supplied so that candidates can fulfill the basic requirements of the application process. Through tradition, and legal and institutional norms, the process is relatively homogenous. Thus, the call for applicants looks the same throughout. Call it institutional boilerplate.

If the top bigrams indicate boilerplating, then they are also indicating a particular rhetorical move aimed at fulfilling genre conventions. If we can use the top bigram collocations to tag segments of boilerplate language using nothing but text normalization and the Student T-test for significance, then, I think, we have found utility for this experiment.

To these ends, I have written a script that will examine each sentence of the MLA Job Information list, test for the presence of the top-20 collocations, and tag that sentence in HTML with the mark tag.

You can download the tagged HTML file here.

Further Questions

Whether or not the use of significant bigram collocations can alert us to boilerplate material in the MLA October 2012 Job Information List is up for debate, but I think the results are tantalizing if not definitive.

Firstly, we see in the frequency distribution chart of the bigrams that the top-20 collocations only comprise a small part of the corpus. However, the deployment of these top-20 collocations has led to almost every sentence of the list being tagged (although there are processing errors in tokenizing sentences). There is no doubt that this is a blunt metric that can/should be refined; but, the results suggests that the generic markers of text can be minute in terms of the overall count of features, but can exert a pervasive effect on the delivery of meaning, which makes intuitive sense if accepted theories of discourse on genre hold.

These results also call to mind Ridolfo and DeVoss’s article on rhetorical velocity: “COMPOSING FOR RECOMPOSITION: RHETORICAL VELOCITY AND DELIVERY”. In this piece, Ridolfo and DeVoss examine how writers and designers strategically compose pieces for re-use by other authors and for speedy circulation. One example strategy is boilerplate writing. If the we can tag texts for their use of boilerplate (as defined by a particular context of use), then might we also be able to mathematically gauge the rhetorical velocity of a text–at least in relation to existing forms?

ATTW 2014 Presentation

Download (PPTX, Unknown)

Hedging and the Jonathan Martin Bullying Scandal Part 2


This post is a follow-up to my previous post on hedging, computational rhetoric, and the Wells Report, which details the results of a special investigation into the Jonathan Martin bullying scandal that has recently been the subject of media scrutiny. In that post, I attempted to analyze the Wells Report with a personally developed app called the Hedge-O-Matic. The Hedge-O-Matic uses Naive Bayes classification routines to tag sentences for their hedgey or not-so-hedgey rhetorical content.The Hedge-O-Matic is trained on sentences culled from 150+ academic, science articles. When tested on like genres, the Hedge-O-Matic generally proves 78-82% accurate using a 10-fold cross validation process (90% of the training set is used to determine the hedge or non-hedge quality of a randomly generated 10% of test data). In this post, I weigh the Hedge-O-Matic’s results against a hand-coded version of the Wells Report.


  • After exported the Hedge-O-Matic results to a csv file, I created a separate column for my hand-coded input. Like the Hedge-O-Matic, I coded each sentence as a hedge or non_hedge.
  • I then imported this csv in to the data processing library Pandas.
  • I then compared the similarity between the Hedge-O-Matic’s tags and my own. If the fields were the same, the row would be tagged as “True;” otherwise, “False”
  • I then calculated accuracy, precision, and recall scores with the “True”/”False” totals

Results and Discussion

Hedge-O-Matic Accuracy = 0.580289456

This is a far cry from the 78-82% accuracy that I am accustomed to seeing from the Hedge-O-Matic. In a sense, the app is doing little more than spitballing at the sentences.

This result is also to be expected given the nature of the training and test sets. The Hedge-O-Matic is tuned for academic, science articles. While formal, the Wells Report is written for a different audience and incorporates different conventions. Moreover, the Wells Report features numerous quotations of text messages, which is another genre by itself. At the present state of development, the Hedge-O-Matic has not been shown anything that resembles a text message, especially not the expletive-laden communications at the heart of the Martin harassment case.

I will also remind people of a problem that I discussed in the previous post: quotation boundaries. My study is relying on a classification output in which certain sentences escaped tokenization because punctuation was within a quotation. This is a remnant of test done on a literary text, in which many instances of quotations did not end the sentence. As a result, many of the tagged sentences were not single sentences, and this could have shifted the results. When adjusted for the tokenization error, the output appears thus:

Hedge-O-Matic Original Length: 1,451 sentences
Hedge-O-Matic Adjusted Length: 1,564 sentences

At this time, I have not run the adjusted sentences; however, with a 7.7% loss of sentences, we can expect some degradation of accuracy.

To provide more clarity on these accuracy figures, here are the tagged distributions of hedge/non-hedge sentences and the correctness of their predictions:

Tag # # Correct
Hedge 708 132
Non-Hedge 743 710

These numbers translate to the following precision, recall scores, and F1 scores:

Tag Precision Recall F1 Score
Hedge 0.186440678 0.8 0.3024054983263778
Non-Hedge 0.955585464 .55209099533 0.6998452840307471

I can attribute the low precision of hedge finding in the Wells Report to a number of factors:

  • Borderline hedging/non-hedging with confidence scores around 0.5 are classed as hedges. In other words, when in doubt, the Hedge-O-Matic hedges its own bets by declaring a sentence a hedge.
  • The hedging moves made in a legalistic document such as the Wells Report are different than those made in a scientific article, the most notable being reported speech. In many instances, the authors of the Wells Report will quote or recapitulate the sentiments of their interview subjects. Thus, while the interview subject may say something hedgey, the authors themselves are not hedging; they are describing with confidence what they have witnessed.

There are other more global limitations at work here as well. The most notable is that the training set for the Hedge-O-Matic has not been trained on enough linguistic variability to account for the rhetorical moves made in the Wells Report.

There is also a high degree of overfitting occurring because of the smallishness and regularity of the training set. Thus, words that are often markers of hedging in scientific articles (“however,” “can,” “believed,”) are biasing the classifier and predicting hedge sentences even though such words may be the bracketed within an instance of reported speech or paraphrase.

That said, I would contend that the low accuracy of the Hedge-O-Matic in the case of the Wells Report is actually a good result because it supports the pivotal assumption girding this computational rhetorics project–that different discourses feature different conventions that signal disciplinary and generic boundaries and that these boundaries can be traced by computers.

Part 3 of this study will focus on visualizing the results.

MCAA Presention on Computational Rhetorics

On October 26, 2013, I was invited to speak at the Midwestern Conference of Asian Affairs on digital humanities and the work being done by MATRIX and WIDE. I described our methods of using graph and social network analytical metrics to extract rhetorical moves from discourse. I even got a chance to test our theories on a dataset that I used in my dissertation involving around 36,000 YouTube comments on a vide by MriRian.

I hope to discuss my findings of this experiment in more detail in a future post. Here, I wanted to express my gratitude for Professor Ethan Segal and MATRIX for allowing me to present on out work in WIDE-MATRIX’s Computational Rhetoric group, and to thank the assembled audience of Asian Studies scholars. The group asked keen questions about the unruly, noisy nature of texts, which always reminds me the rhetorical acts implicit in the normalization of data. The act of cleaning data is always a lossy activity, those who do text or data mining must not only be cognizant of the the data that we are screening out but also be as transparent as possible about what our biases and rationale are for these protocols. Methods carry with them methodologies, and these methodologies are inflected by ideological orientations that will condition the data.

The term heuristic was also used a lot in our discussions, and to me, that is one of the most valuable and focusing glosses for the work that we are doing in the computational rhetorics group. We are creating and applying critical lens to data in order to build theory. But these lens still require human tuning to arrive at plausible interpretations.

CWCON Presentation Featured on MATRIX blog today

The kind folks as MATRIX, MSU’s Center for Humane Arts, Letters, and Social Sciences has featured a recent talk on computational rhetoric I gave with Bill Hart-Davidson at Computers and Writing 2013.

Thanks to MATRIX for supporting the our initiatives in computational rhetoric.

See the blog post here:

One of the most fundamental concerns when envisioning methodologies of computational rhetoric is this: can we turn a natural language text–with all its rhetorical intricacies–into a computational object?

If we can, then this obliges another question, what kind?

The issue can be framed in terms of data structures, namely, what kind of data structure does a natural language text offer? If a computer is to read it, then that text must offer some form data structure.

So what is the best fit?

My first impulse is to say that a graph is the best data structure for a natural language text.

If this is true, then this theory will a great deal to affirm post-structuralist theories of language as well as challenge existing models.