Text Mining the MLA Job Information List Part 1

Because it is that time of year, I decided to revisit the MLA Job Information List dataset (careful when clicking: this is a large download) generously provided by Jim Ridolfo and undertake some text mining exploits.

My goals for this mini-project are three-fold:

  • Develop some text processing scripts tailored to the MLA Job Information List
  • Measure the distribution of bigram collocation using the Student T-test of independence
  • Convert the Student T-test results into a weighted network graph that may reveal something of rhetorical value through the use of this graph.

This post will focus on cleaning a segment of the MLA Job Information List for analysis using Python.

Materials Used

  • Python 2.6+
  • NLTK==2.0.4

Cleaning the Data

The MLA Job Information List dataset exists as a series of OCR-scanned pdf files. As you can imagine, the OCR process of interpolation can introduce salient noise into the pdf, which can include spelling and whitespace errors. These errors can be compounded when converting pdf files into a plain text format, a process which I described in an earlier post using the Mac Auotmator.

Because of the potential for high noise in the scan, I have narrowed the objects of my study to the October issue of the 2012 MLA Job Information List. This file has far less noticeable typographical errors in the plain text (I assume because the quality of the original far exceeded those issues from the 60s).

You can download the plain text copy of the October 2012 MLA Job Information list here. This file is encoded in UTF-8. If you followed my previous tutorial on creating plain text files from pdfs using the Mac Automator, you may find that your encoding is different (e.g., UTF-16). You should open the file in Python and check output before further processing.

Because I am primarily interesting into the co or proximal occurrences of words, the following items will be removed in the cleaning process:

  • Numerals
  • Names of Months
  • Function words such as prepositions and conjunctions (i.e., stopwords)
  • Pluralizations
  • Punctuation
  • URLS
  • Email addresses

The October 2012 MLA Job Information List file features a regular pattern of error involving hypenization. For example:

Ask appropriate questions; explore areas such as educa-
tion
, experience, special interests or skills, familiar- ity with textbooks, teaching methods, professional organizations, and future expectations.

In the above case, addition whitespace separates the characters of the words “education” and “familiarity.”

Another common pattern of error/noise involves hyphenization and linebreaks:

Evidence of effective leadership and prograin administrative skills beyond teaching classes, for example experi-
ence
with program administration or assessment

In addition to the hyphenization errors, there a numerous whitespace errors affecting the rendering or URLS. In some cases, the “http:” is split from the remainder of the url:

English, MC 1030, 322 Wheeler Hall Berkeley CA 94720 http: //english. berkeley.edu/
Assistant or Associate Professor, Native American, Chicano/a, Latino/a Literatures 17408 Apply to this position at https://secure.interfolio.corn/apply/15304

In other cases, the “http” itself is split down the middle.

Candidates are asked to provide an application letter, at least three letters of recommendation, a CV, and a writing sample of 10,000 words by electronic submission:
ht tp://aprecruit.berkeley.edu/apply/ J PF00033

I have also noticed that the OCR scanning can sometimes convert “w” characters into other strings such as “vv” or “vi.” This is significant because this can render a URL such as www.example.com as vvvvvv.example.com. Even though we will be filtering out these URLs, we need to begin that process with an accurate representation.

Admittedly, many of these OCR errors are rare and may not significantly influence our results. However, because we can script a solution that does not require an undue processing time, I have decided to correct these errors in the text using Python regular expressions.

The Code

I begin by importing the necessary Python modules:

import re, collections
from urlparse import urlparse
from nltk.stem.wordnet import WordNetLemmatizer

I then create a list of months in the .py file. Because the MLA Job Information List can feature normalized spellings and spellings in all-caps, I present both.

months = ['January', 'February', 'March', 'April', 'May', 'June', 'July','August','September', 'October', 'November', 'December', 'JANUARY', 'FEBRUARY', 'MARCH', 'APRIL', 'MAY', 'JUNE', 'JULY', 'AUGUST', 'SEPTEMBER', 'OCTOBER', 'NOVEMBER', 'DECEMBER']

I then create a list of stopwords. This could be an external file that the program read as well.

stopwords = ['ot', 'i', 'im', 'we', '...', 'also', 'mr', 'mrs', 'when', 'me', 'my', 'myself', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they','that', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'doing', 'a', 'an', 'the', 'and', 'but', 'or', 'as', 'until', 'within', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'do', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', "'", '"', 'just', 'don', 'now', "they're", "'re", "you're", "we're", 're', 've', "'ve", "'s", 'em', 'dy', "'ve", 'th', 'us','wasnt', 'isnt', ')', '(', '..', 'and/or', 'i.e.']

I then create a regular expression pattern, comprised of a list of tuples. The first item of the tuples is the unwanted string. The second item is its replacement. This replacement list will feed the class RegexpReplacer that will implement the substitutions.


replacement_patterns = [(':andidates', 'candidates'),
                        (r'\n', ' '),
                        (r'\r', ' '),
                        (r'\r', ' '),
                        (r'\n', ' '),
                        (r'"', ' '),
                        (r'\'', ''),
                        (r',',''),
                        (r'- ', ''),
                        (r'(?<=http:)\s+', ''),
                        (r'(?<=ht)*\s+(?=tp)', ''),
                        (r'\d+', ''),
                        (r'\)',''),
                        (r'\(', ''),
                        (r';', ''),
                        (r'\. ',''),
                        (r'\[', ''),
                        (r'\]', ''),
                        (r'>', ''),
                        (r'<', ''),
                        (r'vvvvvv', 'http://www.'),
                        (r'vvvvvv.', 'http://www'),
                        (r'vv', 'w'),
                        (r'\s+www.', 'http://www.'),
                        (r' @', '@'),
                        (r'e-mail', 'email'),
                        (r'\xe2\x80\xa2', ''),
                        (r'\xe2\x80\x94',''),
                        (r'ph\.din', 'ph.d in'),
                        (r'sainple', 'sample'),
                        (r'woinens', 'womens'),
                        (r'woinen', 'women'),
                        (r'yearly', 'yearlly'),
                        (r'aairmative', 'affirmative'),
                        (r'coinposition', 'composition'),
                        (r'!i', 's'),
                        (r"I'", 'p'),
                        (r"1'", 'p')]

The code for the RegexpReplacer has been adapted from Jacob Perkins' Python Text Processing NLTK 2.0 Cookbook The replacer object at the end is the invocation of this class.

The patterns target basic punctuation and the commonly misrepresented words in the corpus. For example, the replacement_patterns addresses the split in "ht tp" with the following look behind assertions: (r'(?<=ht)*\s+(?=tp)', ''), and replaces the additional space with an empty string ('').

You can add and delete patterns as you see fit.

class RegexpReplacer(object):
        def __init__(self, patterns=replacement_patterns):
                self.patterns = [(re.compile(regex), rep) for (regex, rep) in patterns]
        def replace(self, text):
                s = text
                for (pattern, rep) in self.patterns:
                        (s, count) = re.subn(pattern, rep, s)
                return s

replacer = RegexpReplacer()

I now wrap the preceding code in a function that will filter the raw text.

def tokenize_text(raw):
    text =    re.split('(literatures|courses|years|cultures|university|agenda|required|literature|course|culture|year)', raw)
    s = ' '.join(map(str, text))
    tokens = replacer.replace(s).split()

    #remove stopwords from text using pickled stopwords list
    delete_months = [token for token in tokens if token not in months]
    important_words = [word.lower() for word in delete_months if word.lower() not in stopwords]
    filtered_words = [word for word in important_words if len(word) >= 2]
    no_email = [word for word in filtered_words if '@' not in list(word)]

    #stems source text with WordNet Lemmatizer
    lmtzr = WordNetLemmatizer()
    lemmas = [str(lmtzr.lemmatize(word)) for word in no_email]
    ftokens = [l for l in lemmas if not urlparse(l).scheme]

    return ftokens

I will step through each line of the function.

text = re.split('(literatures|courses|years|cultures|university|agenda|required|literature|course|culture|year)', raw)
    s = ' '.join(map(str, text))

The above line of code calls uses Python's split() method to resolve some whitespace errors. Tokens such as 'literature' are sometimes merged with subsequent tokens (e.g. 'literaturethe'. The split method will break these tokens on the characters provided as the function's argument (e.g. 'literaturethe' to 'literature', 'the').

The result of splitting the text on particular characters will turn the raw string into a list of several strings. The s = ' '.join(map(str, text)) reverts the list back into a single string for future processing.

tokens = replacer.replace(s).split()

The above line of code calls the replacer object and removes and corrects our string according to the given replacement_patterns.

 delete_months = [token for token in tokens if token not in months]
    important_words = [word.lower() for word in delete_months if word.lower() not in stopwords]
    filtered_words = [word for word in important_words if len(word) >= 2]

The preceding list comprehension removes the strings found in the months and stopwords lists. In addition, the filtered_words list comprehension removes token that are less than 2 characters in length. In other circumstances, I would set the limit at all tokens less than 3 characters in length. However, because a token such as "cv" will likely be significant, I have set the limit at less than 2.

no_email = [word for word in filtered_words if '@' not in list(word)]

The code above removes email addresses. In essence, this line of code splits each token into its constituent characters and groups those characters in a list (i.e., list(word)). If the "@" character is in that list, that list is removed, thereby removing the entire word from the token list.

lmtzr = WordNetLemmatizer()
    lemmas = [str(lmtzr.lemmatize(word)) for word in no_email]

The above code calls the WordNetLemmatizer module from NLTK. This is applied to each token and returns the lemma of each token, essentially stripping affixes.

ftokens = [l for l in lemmas if not urlparse(l).scheme]

Here, I am testing whether or not the token can be parsed as a URL with Python's urlparse. If the token can be parsed as a URL, then it is removed from the final list of tokens (ftokens).

__author__ = 'ryanomizo'
import re, collections
from urlparse import urlparse
from nltk.stem.wordnet import WordNetLemmatizer

stopwords = ['i', 'im', 'we', '...', 'also', 'mr', 'mrs', 'when', 'me', 'my', 'myself', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they','that' 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'doing', 'a', 'an', 'the', 'and', 'but', 'or', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'do', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', "'", '"', 'just', 'don', 'now', "they're", "'re", "you're", "we're", 're', 've', "'ve", "'s", 'em', 'dy', "'ve", 'th', 'us','wasnt', 'isnt', ')', '(', '..', 'and/or', 'i.e.']

months = ['January', 'February', 'March', 'April', 'May', 'June', 'July','August',
             'September', 'October', 'November', 'December', 'JANUARY', 'FEBRUARY', 'MARCH',
             'APRIL', 'MAY', 'JUNE', 'JULY', 'AUGUST', 'SEPTEMBER', 'OCTOBER', 'NOVEMBER', 'DECEMBER']

replacement_patterns = [(':andidates', 'candidates'),
                        (r'\n', ' '),
                        (r'\r', ' '),
                        (r'\r', ' '),
                        (r'\n', ' '),
                        (r'"', ' '),
                        (r'\'', ''),
                        (r',',''),
                        (r'- ', ''),
                        (r'(?<=http:)\s+', ''),
                        (r'(?<=ht)*\s+(?=tp)', ''),
                        (r'\d+', ''),
                        (r'\)',''),
                        (r'\(', ''),
                        (r';', ''),
                        (r'\. ',''),
                        (r'\[', ''),
                        (r'\]', ''),
                        (r'>', ''),
                        (r'<', ''),
                        (r'\?', ''),
                        (r'en\s+glish', 'english'),
                        (r'vvvvvv', 'http://www.'),
                        (r'vvvvvv.', 'http://www'),
                        (r'vv', 'w'),
                        (r'\s+www.', 'http://www.'),
                        (r' @', '@'),
                        (r'e-mail', 'email'),
                        (r'\xe2\x80\xa2', ''),
                        (r'\xe2\x80\x94',''),
                        (r'ph\.din', 'ph.d in'),
                        (r'sainple', 'sample'),
                        (r'woinens', 'womens'),
                        (r'woinen', 'women'),
                        (r'yearly', 'yearlly'),
                        (r'aairmative', 'affirmative'),
                        (r'coinposition', 'composition'),
                        (r'!i', 's'),
                        (r"I'", 'p'),
                        (r"1'", 'p')]


class RegexpReplacer(object):
        def __init__(self, patterns=replacement_patterns):
                self.patterns = [(re.compile(regex), rep) for (regex, rep) in patterns]
        def replace(self, text):
                s = text
                for (pattern, rep) in self.patterns:
                        (s, count) = re.subn(pattern, rep, s)
                return s

replacer = RegexpReplacer()

def tokenize_text(raw):

    text = re.split('(literatures|courses|years|cultures|university|agenda|required|literature|course|culture|year)', raw)

    s = ' '.join(map(str, text))
    tokens = replacer.replace(s).split()

    #remove stopwords from text using pickled stopwords list
    delete_months = [token for token in tokens if token not in months]
    important_words = [word.lower() for word in delete_months if word.lower() not in stopwords]
    filtered_words = [word for word in important_words if len(word) >= 2]
    no_email = [word for word in filtered_words if '@' not in list(word)]

    #stems source text with WordNet Lemmatizer
    lmtzr = WordNetLemmatizer()
    lemmas = [str(lmtzr.lemmatize(word)) for word in no_email]
    ftokens = [l for l in lemmas if not urlparse(l).scheme]

    return ftokens

Conclusion/Results

With the code described in this post we can both clean and reduce the dimensionality of an MLA Job Information List posting for future analysis.

After executing the code, the length of the MLA October 2012 Job Information List goes from 83,110 tokens to 47,342 tokens.

Here is the output of in text file JIL10-tokens3.

One thing you will notice about the output tokens in the JIL10-tokens3.txt is that there are still several regular OCR errors that include turning the character "m" into "in." In several sports, "ec" is rendered as "cc." More tuning could help with this would likely require a great expansion of the regular expression patterns and/or a spellchecker that is trained on a sizeable corpus. Peter Norvig has code for a lightweight spellchecker available on his site.

To make the spellchecker work, however, you would need to be selective about the errors you feed it because an acronym that carries meaning in the humanities such as "ADE" may be changed to "are"--a token already filtered out by the stopword list. The issue for me is whether the gains achieved through spelling correction would outweigh the noise introduced by the automated spellchecking program. At this time, I do not feel that a spellchecker is worth the trouble.

People should also note that the errors target by the replacement_patterns are specific to the errors found in the October 2012 MLA Job Information List. When moving other files in the MLA Job Information List corpus, you would likely need to expand the list of replacements