Text Mining the MLA Job Information List Part 2

In Text Mining the MLA Job Information List Part 1, I cobbled together a series of regular expression scripts and list comprehensions in Python to reduce the dimensionality of the October 2012 edition of the MLA Job Information List. This dimensionality reduction removed components such as punctuation, function words, email addresses, and URLs from the source text; it also involved correcting basic but routine OCR scanning errors. The end result was a corpus half the size as the original, non-processed text in terms of token count.

In this post, I will convert the list of tokens returned by the text processing into bigrams and measure the significance these associations to find word collocations.

When I refer to bigrams, I am describing tuples of two consecutive tokens. For example, the line, “this is a test sentence” would be converted to [('this', 'is'), ('is', 'a'), ('a', 'test'), ('test', 'sentence')] if we were extracting bigrams.

When I refer to collocations, I am describing those bigrams (or any ngram larger than a unigram) that combine to form unitary meanings due to grammar or convention. The most straightforward bigram collocation is a name that relies on a compound such as “New York” or “East Lansing.”

However, collocations also join other linguistic units such as verbs and nouns because convention dictates a constrained usage. For example, in the sentence, “She opened the gate,” we have the following bigrams (allowing for the removal of “the”): (‘she’, ‘opened’) and (‘opened’, ‘gate’). These tokens all work to create meaning in this sentence, as is the case with any sentence. However, the example sentence is also amenable to substitution. The tokens ‘she’ and ‘opened’ have a subject-verb relationship, but the ‘opened’ does not strongly depend on ‘she’ to create meaning; ‘she’ could just as well be ‘jane’ or ‘he’ or ‘john.’ The case of ‘opened’ and ‘gate’ is more complex because conventions in English suggest that we open objects such as doors and gates. It is far less conventional to write, “she released the gate” or “she unstopped the gate.” At the same time, “she unlatched the gate” or “she unlocked the gate” might also work, suggesting that there is still flexibility between a verb that modifies the object “gate.” A bigram collocation would feature more rigid association between tokens. We might cite “St. Peter’s Gate” or the “pearly gates.” Both refers to a specific gate, and the replacement of either token would radically change the meaning of the term.

Tracking and measuring collocations in a natural language text is a common practice and can be applied to numerous information retrieval tasks and research questions. For example, finding stable collocations among tokens can identify compound terms such as “real estate” or “East Lansing” or “Rhode Island.” If such collocations occur at levels of significance greater than random, then a text mining routine can programmatically combine these words into a single token, thereby providing a more accurate representation of a text.

Given the MLA Job Information List, measuring bigram collocations might contribute to a cleaner dataset. For example, finding significant bigrams might help an analyst differentiate between calls for “rhetoric” and “rhetoric [and] composition.” At a more basic level, finding significant bigram collocations might help us screen expected compounds such as “english department.” This is not a trivial problem, and does merit consideration, especially if we can tune the program to recognize named entities.

My interest in collocations for these postings is broader, however, or, I should say, less granular. At a basic level, the relevance of ngram collocations is that the proximal relationships between words signify conceptual relationships, and these conceptual relationships influence the structure and meaning of a text. The idea behind this text mining endeavor is to computationally reveal probative insights into the rhetoric of the MLA Job Information List.

On the face of it, finding significant bigram collocations would only seem to highlight what we already know about the MLA Job Information List. We know that we will see numerous instances of “rhetoric, composition” and “english, department” or “technical, communication.” We would also expect to see collocations between verbs like “mail” and nouns like “cv” or “sent” and “application.” Such collocations all fit the genre of job advertisements in the field of English, rhetoric and composition, professional writing, and creating writing, which provide instructions for applicants and information about delivery of materials.

If we could imagine this computational experiment undertaken on a blind sample, extracting identifying information might be valuable for the purposes of classification. My primary interest in these posts, however, is to reveal patterns of rhetoric that might not be immediately obvious to conventional readings and to understand how the MLA Job Information List functions as a body of discourse, not just a collection of ads submitted by various institutions.

Materials Used

  • Python 2.6+
  • NLTK==2.0.4
  • Numpy==1.8.0

Bigram and Trigram Collocations

Creating bigram and trigram collocations is a common practice in natural language processing and several Python libraries already have build-in modules to handle this task, including NTLK. Given that we already have a tokenized list of the October 2012 edition of the MLA Job Information List, we can write our own, brief functions that will turn that list into a list of bigrams and trigrams:

def bigram_finder(tokens):
    return [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]

def trigram_finder(tokens):
    return [(tokens[i], tokens[i+1], tokens[i+2) for i in range(len(tokens)-2)]

###If opting to use NLTK
#from nltk import bigrams, trigrams
#bigrams(tokens)
#trigrams(tokens)

The results of the bigram_finder and trigram_finder functions are comparable to the methods found in the NLTK library.

Hypothesis Testing

While we now have a method to extract collocated tokens from the MLA Job Information List, we have not arrived at a means to test whether or not the collocations are significant. A basic measure of significance is whether or not a particular collocation occurs at a rate greater than random.

Of course, because we are dealing with a natural language text governed by rules and conventions, not word/token is a product of chance. Consequently, what we are really trying to determine when is if the patterns of collocations in the MLA Job Information List are strongly or weakly militated. In some cases, the associations between tokens are linguistic; however, other associations can point to concept formation and deployment, which can intimate how the rhetoric of the MLA Job Information List is operating computationally.

To test the strength of the ngram associations, I will use the Student T-Test for significance as outlined by Manning and Schutze (2000 p. 163-166).

The Student T-Test for significance begins with the null hypothesis that the terms constituting a collocation are independent. In this case, independence means that the likelihood that one term would be collocated with another is no better than chance, which depends on the distribution of the term in the population of terms.

In order to compare the likelihood that a collocations results from random selection or a more decisive cause, we calculate the t-value. If this t-value is less than a critical value given by a t-table, then we cannot reject the null hypothesis that the collocation exists more or less as a product of chance. If the t-value is greater than the critical value, then we can reject the null hypothesis.

For bigrams, the t-value is calculated thus:

t_value = (sample_likelihood - independence_likelihood)/(math.sqrt(sample_likelihood/population))

Let’s step through the variables:

The independence_likelihood is the likelihood that the two tokens in the bigram are collocated at a frequency no better than random. We calculate the independence likelihood of a bigram collocation thus:

#Let n_1 be the first token in the bigram
#Let n_2 be the second token in the bigram
#Let the population be the total number of bigrams in the sample

independence_likelihood = frequency of n_1/population * frequency of n_2/population 

In other words, the independence likelihood is the probability that n_1 and n_2 can occur in a sample given their relative frequency; it is the frequency that we would expect to see if chance were the only regulating factor.

The sample_likelihood is the actual distribution of the bigram in the sample:

#Let (n_1, n_2) represent the actual bigram

sample_likelihood = freqency of (n_1, n_2)/population

Once again, these variables are assembled in the following equation:

t_value = (sample_likelihood - independence_likelihood)/(math.sqrt(sample_likelihood/population))

NLTK has prebuilt functions to calculate t-values for the Student T-test. However, solving for t-values is not terribly taxing and can be accomplished through the use of Python’s Counter and defaultdict.

Let’s first dispense with the necessary imports:

from __future__ import division
import math
from collections import Counter, defaultdict

Note: You must place from __future__ import division first for it to work.

def bigram_student_t(tokenlist):

    population = len(tokenlist)-1
    counts = Counter(tokenlist)
    bigrams = bigram_finder(tokenlist)


    independence_likelihood = defaultdict(list)
    for bigram in bigrams:
        independence_likelihood[bigram] = counts[bigram[0]]/population * counts[bigram[1]]/population

    sample_likelihood = Counter(bigrams)
    for k, v in sample_likelihood.items():
        sample_likelihood[k] = v/population

    tvalues = defaultdict(list)

    for bigram in bigrams:
        tvalues[bigram] = (sample_likelihood[bigram] - independence_likelihood[bigram])/(math.sqrt(sample_likelihood[bigram]/population))

    return tvalues

Let me step through the code:

population = len(tokenlist)-1
counts = Counter(tokenlist)
bigrams = bigram_finder(tokenlist)

The above three lines of code sets our working variables. Our population refers to the count of bigrams, which will always be 1 less than the length of the input list of tokens.

The counts variable returns a Counter object, which behaves like a Python dictionary. They keys in counts are the tokens. The values of counts are there frequencies.

bigrams calls are former bigram_finder() function and converts the list of tokens into a list of bigrams.

independence_likelihood = defaultdict(list)
    for bigram in bigrams:
        independence_likelihood[bigram] = counts[bigram[0]]/population * counts[bigram[1]]/population

    sample_likelihood = Counter(bigrams)
    for k, v in sample_likelihood.items():
        sample_likelihood[k] = v/population

    tvalues = defaultdict(list)

    for bigram in bigrams:
        tvalues[bigram] = (sample_likelihood[bigram] - independence_likelihood[bigram])/(math.sqrt(sample_likelihood[bigram]/population)), sample_likelihood * population

The remaining code arranges are count information into two defaultdicts. The sample_likelihood defaultdict stores the probability distributions of each bigram in the list of bigrams.

The independence_likelihood variable stores the probability values of each bigram based on the expected distribution of the bigrams given the independence of each item in the bigram.

The t_values defaultdict holds our t-value solutions and includes the original bigram count.

You can obtain similar results by calling NLTK’s BigramAssocMeasures() and BigramCollocationFinder

from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

#Create list of bigrams from tokens
finder = BigramCollocationFinder.from_words(tokens)

#Find t values of all bigrams in the list using student t test from Manning and Schutze
finder.score_ngrams(bigram_measures.student_t)

The results from the bigram_student_t function and NLTK’s BigramAssocMeasures() are comparable but not exact. The difference lies in how NLTK defines its population variable. NTLK takes the length of the token list as its population, whereas I have taken the length of the list of bigrams. For example:

#bigram_student_t t-value for ('english', 'department')
#6.98412279000951

#NLTK BigramAssocMeasures t-value for 'english', 'department'
#6.98414953692

The difference is negligible when it comes to checking t-values against a t-table; however, I believe my implementation is more in-keeping with what Manning and Schutze describe.

Using a t-table of critical values, we see that the critical value to reject the null hypothesis of independence for a sample size of 49,444 is 3.291 for a one-tailed test with a 0.9995 degree of confidence. Thus, all those bigrams with a t-value greater than 3.291 are reasonably expected to form collocations.

You can download a list of the bigrams, their t-values, and their counts here as a zipped .csv.

One thing to note about the t-values: as Manning and Schutze point out, the Student T-test for independence should be considered a means to rank collocations in a text, not to simply declare a word pair a collocation or not. Consequently, you may find bigrams that are normally regarded as collocation lacking a t-value that rises above the critical value.

Analysis

A perusal of the bigram list and their associated t-values and counts may seem a little underwhelming because the output hews so closely to expectation in terms of rhetoric and statistical analysis.

As is often the case when you tally word or word collocations, the counts form a heavy-tailed distribution. In this case, there are a few collocations that appear at high frequency. But most of the collocations appear once (hence, the heavy tail along the x-axis):

Bigram Distributions

Unsurprisingly, the top-15 bigram collocations in terms of counts comprise the following:

  1. (‘assistant’, ‘professor’)
  2. (‘apply’, ‘position’)
  3. (‘department’, ‘english’)
  4. (‘job’, ‘information’)
  5. (‘information’, ‘list’)
  6. (‘mla’, ‘job’)
  7. (‘english’, ‘edition’)
  8. (‘list’, ‘english’)
  9. (‘letter’, ‘application’)
  10. (‘invite’, ‘application’)
  11. (‘candidate’, ‘will’)
  12. (‘writing’, ‘sample’)
  13. (‘creative’, ‘writing’)
  14. (‘three’, ‘letter’)

For those of us in the field of English studies, rhetoric and composition, professional writing, and creative writing, we can easily interpolate the sentences these bigrams inform. And we are not back to the problem that I posed in the introduction to this post: what can an analysis of bigram collocations tells us about the MLA Job Information List that we don’t already know?

The answer, I think, is that it may not tell us a lot about the MLA Job Information List, but it can point us to ways in which we can use basic statistical information to track and tag larger units of discourse and to better understand how global meanings arise from more granular elements.

The above list of bigrams suggests what people in the field of English studies or rhetoric and composition might call boilerplate. Most of the bigrams such as ('letter', 'application') and ('writing', 'sample') are supplied so that candidates can fulfill the basic requirements of the application process. Through tradition, and legal and institutional norms, the process is relatively homogenous. Thus, the call for applicants looks the same throughout. Call it institutional boilerplate.

If the top bigrams indicate boilerplating, then they are also indicating a particular rhetorical move aimed at fulfilling genre conventions. If we can use the top bigram collocations to tag segments of boilerplate language using nothing but text normalization and the Student T-test for significance, then, I think, we have found utility for this experiment.

To these ends, I have written a script that will examine each sentence of the MLA Job Information list, test for the presence of the top-20 collocations, and tag that sentence in HTML with the mark tag.

You can download the tagged HTML file here.

Further Questions

Whether or not the use of significant bigram collocations can alert us to boilerplate material in the MLA October 2012 Job Information List is up for debate, but I think the results are tantalizing if not definitive.

Firstly, we see in the frequency distribution chart of the bigrams that the top-20 collocations only comprise a small part of the corpus. However, the deployment of these top-20 collocations has led to almost every sentence of the list being tagged (although there are processing errors in tokenizing sentences). There is no doubt that this is a blunt metric that can/should be refined; but, the results suggests that the generic markers of text can be minute in terms of the overall count of features, but can exert a pervasive effect on the delivery of meaning, which makes intuitive sense if accepted theories of discourse on genre hold.

These results also call to mind Ridolfo and DeVoss’s article on rhetorical velocity: “COMPOSING FOR RECOMPOSITION: RHETORICAL VELOCITY AND DELIVERY”. In this piece, Ridolfo and DeVoss examine how writers and designers strategically compose pieces for re-use by other authors and for speedy circulation. One example strategy is boilerplate writing. If the we can tag texts for their use of boilerplate (as defined by a particular context of use), then might we also be able to mathematically gauge the rhetorical velocity of a text–at least in relation to existing forms?