Finding Genre Signals in Academic Writing: Benchmarking Method

The following post functions as supplementary material for “Finding Genre Signals in Academic Writing” for the Journal of Writing Research. This post explains how we automatically processed 505 research articles from the Spring OpenAccess database to filter citational sentences from non-citational sentences. While the primary analysis of “Finding Genre Signals in Academic Writing” relies on hand-coded sentences, the authors have developed this automated routine in order to test the viability of our citational coding scheme (which targets the lexical content of the sentence) and with an eye toward future citation analysis projects that may benefit from automated analysis.

To gather citation data in order to  benchmark our coding scheme, surface-level parser, and gain a global sense of how Extraction, Grouping, and Author(s) as Actant(s) citational types operated within the larger field of academic research, we screen scraped 505 research articles from journals hosted by Springer OpenAccess. These journals are peer reviewed and write to the genre conventions of academic audiences, including the Introduction-Methods-Results-Discussion (IMRaD) format often used to structure scientific and social scientific journals (see Christensen and Kawakami, 2009; Hannick and Flanigan, 2013; Salager-Meyer, 1994;  . This screen scrape captured the meta-data of the article (author names, date of publication, institutional affiliation, and document object index), the full text of the article without images, and works cited list. Only articles types labeled as “Research Article” by the Springer OpenAccess filtering tool were used for this exploratory analysis.

We then tokenize each article at the sentence level. Using regular expression searches, we tagged all in-text citations and non-citations. For this study, an in-text citation denotes a sentence-token that attributes a source via author name or author name and date of publication in Harvard-style in-text citation formatting. Because the citation style varied across journals due to vagaries in HTML markup presentations, we narrowed our selection to those journals that employed the following in-text citation patterns:

Author last name (Year of Publication)
2 author last names (Year of Publication)
First author last name, et al. (Year of Publication)
Author last name (Year of Publication + a-z index where different articles by the same authors appear)
2 author last names (Year of Publication + a-z index where different articles by the same authors appear)
First author last name, et al. (Year of Publication + a-z index where different articles by the same authors appear)
Author last name [Year of Publication]
2 author last names [Year of Publication]
First author last name, et al. [Year of Publication]
Author last name [Year of Publication + a-z index where different articles by the same authors appear]
2 author last names [Year of Publication + a-z index where different articles by the same authors appear]
First author last name, et al. [Year of Publication + + a-z index where different articles by the same authors appear]
Author last name ([Year of Publication])
2 author last names ([Year of Publication])
First author last name, et al. ([Year of Publication])
Author last name ([Year of Publication + a-z index where different articles by the same authors appear])
2 author last names ([Year of Publication + a-z index where different articles by the same authors appear])
First author last name, et al. ([Year of Publication + + a-z index where different articles by the same authors appear])
Author last name ([Year of Publication])
2 author last names ([Year of Publication])
First author last name, et al. ([Year of Publication])
(Author last name Year of Publication)
(2 author last names [Year of Publication])
(First author last name, et al. [Year of Publication])

Works cited entries were excluded from this pattern matching. In addition, statements that may be considered citational in nature, but did not contain explicit references to authors or dates of publication were also excluded. For example, the sentence from Ogada, et al. (2014):

These authors concluded that initial adoption may be low due to imperfect information on management and profitability of the new technology but as this becomes clearer from the experiences of their neighbors and their own experience, adoption is scaled up.”

While this sentence functions to synthesize the work of several authors previously cited in the article by Ogada, et al. (2014), it does not contain markers of author attribution or date of publication. Thus, interpretive sentences of this type were not included in the initial in-text citation search. Although we do see the potential contribution of tracking these rhetorical moves of extended synthesis, making judgments about the nature of such moves proved difficult for the lexical pattern matching routines.

For sentence that name authors, but do not provide a date of publication, we configured the screen scraper program to parse the DOM tree of the article for its References section. The last name of the primary author of the publication is sequestered into a list. If an in-text citation has stumped the initial regular expression searching parameters and received a tag of “non-intext-citation”, then the script will check for the presence of the primary author’s last name in the sentence by comparing the extant words with the list of author last names compiled in the screen scrape. To preclude radical changes, only the names greater than two characters in length are retained. If there is a match between the first author’s last name and a word in the “non-intext-citation” sentence, the tag is changed to “intext-citation.” This update of the search protocol assumes that the correspondence between a capitalized word and an author name listed in the reference section of the article most likely indicates an in-text citation. In some cases, the author’s last name could also function as a content word (verb or noun), leading to a falsely assigned label. For a generic example, consider an author whose last name is “House.” The entry “House” would not match “house” because the latter lacks an initial capital letter; however, a sentence containing the collocation “White House” would lead to a false positive. Another false permutation that we have encountered in the study occurs when an article is discussing an organization and cites work produced by that organization or other organizations that share a similar appellation. For example, in Rissler, et al. (2014), the authors write:

In the only nationwide survey of high school science teachers (n?=?939), Berkman et al ([2008]) found that at least 17% of biology teachers are young-earth creationists, and about one in eight teach creationism or intelligent design in a positive light.

Only 23% of teachers strongly agreed that evolution is the unifying theme of biology, as accepted by the National Academy of Science and the National Research Council.

The first sentence from Rissler et al (2014) is tagged as a citation because of the reference to “Berkman et al ([2008]).” The information of the second sentence has been sourced from the previous sentence. By our thin definition of what constitutes and in-text citation, the second sentence should not be tagged; however, because the word “Council” is present in the second sentence and “Council” is included in the article’s reference list in the position of a last name, the second sentence is classed as an in-text citation. We consider this cross-referencing step a contingency for articles that may present copy-editing inconsistent with the journal style guide. However, the bank of names harvested from the reference sections of article is reused in a subsequent processing step, which replaces an instance of an author’s name in the text with the cognate tag “AUTHOR.”

When one of the above citational conditions are met by the lexical content of a sentence, that sentence is tagged as an in-text citation (1). Those sentences that do not match the lexical patterns above receive a non-in-text citation (0) tag.

After initial processing by the screen scraping and the in-text citation/non-in-text citation tagging routines, we then pass the marked sentences–now annotated with a 0 or 1–to a second processing module, whose goal is to reduce the complexity of the syntactic complexity of the in-text citation to more general cognates. This second processing module makes the following substitutions:
publication years featured in in-text citations are replaced with the tag “PUBYEAR”
an author’s last name, if found in the list of names harvested from the reference section of the article, is replaced with the tag “AUTHOR”
part of speech are tagged by a pre-trained pos-tagger, which relies on the Penn Part of Speech tags (see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html); only those parts of speech tags which indicate verbs, prepositions, and determines are retained and inserted into the body of the sentence

As an example, we can consider the following sentence from Otten, et al. (2015):

Product Portfolio Management (PPM) is a dynamic decision process, whereby a business list of active (new) products (and R&D) projects is constantly updated and revised (Cooper, Edgett, & Kleinschmidt, [2001]).

Given the above processing step, that sentence would be transformed into:

Product Portfolio Management (PPM) VBZ is DT dynamic decision process, whereby DT businesss list IN active (new) products (and R&D) projects VBZ is constantly updated and VBN revised (AUTHOR, Edgett, & Kleinschmidt, PUBYEAR).’

After each sentence is tagged by selected parts of speech, AUTHOR, and PUBYEAR, the configuration and/or quantity of the tags are assessed in a third processing module. This third processing module applies the citational coding scheme discussed above as numerical tags: Extraction (1), Grouping (2), and Author(s) as Actant(s) (3) by comparing the parts of speech, AUTHOR, and PUYEAR tags to hard-coded lexical patterns fitted to each category of the coding scheme. If a sentence contains parts of speech, AUTHOR, and PUBYEAR tags that match the category of Extraction in-text citation, then taht sentence will receive a 1, and so on. This processing module works through elimination:

ignore all sentences tagged as non-in-text citations
tag all in-text citations in which PUBYEAR appears more than 2x as Grouping (2)
compare remaining sentences (i.e., not Grouping) with AUTHOR, parts of speech, and PUBYEAR pattern and designate matches as Author(s) as Actant(s) (3)
tag all remaining in-citations (i.e., not Group or Author(s) as Actant(s)) as Extraction (1)

In one sense, the third processing module moves from the most deterministic coding category to the least deterministic. For our coding scheme, any in-text citation that refers to more 3 or more sources within the boundary of the sentence is considered Grouping, regardless of the grammatical construction or the presence of other features that may match Author(s) as Actant(s) or Extraction. We may consider an example sentence from Otten, et al. (2015):

A viable alternative for determining the product portfolio (product assortment) is the use of a data mining approach, called Association Rule Mining (ARM), which exposes the interrelationships between products by inspecting a transactional dataset (Brijs, Swinnen, Vanhoof, & Wets, [1999]; [2004]; Fayyad, Piatetsky-Shapiro, & Smyth [1996a], [1996b], [1996c].

An annotated example of the above Grouping sentence would appear like the following:

DT viable alternative IN determining DT product portfolio (product assortment) VBZ is DT use IN DT data mining approach, VBN called Association Rule Mining (ARM), which exposes DT interrelationships IN products IN inspecting DT transactional dataset (AUTHOR, Swinnen, Vanhoof, & Wets, PUBYEAR; PUBYEAR; Fayyad, Piatetsky-Shapiro, & PUBYEAR, PUBYEAR, PUBYEAR.’ (Otten et al. 2015)

The next most deterministic category is Author(s) as Actant(s) because this category demands that an author be named in the sentence and function as the subject or as the receive of an action and that references to other sources be less than 3. Because the Author(s) as Actant(s) category cannot contain more than 2 references, it is excluded from the Grouping category by default. It is excluded from the Extraction category because it will contain a direct authorial attribution in which the named author is performing the action of the sentence or the object of the verb of the sentence. Take for example the following sentence from Correa Bahnsen, et al. (2015):

Moreover, as discussed in (Verbraken et al [2013]), if the average instead of the total profit is considered and the fixed cost A is discarded since is irrelevant for classifier selection, the profit can be expressed as: (2) Nevertheless, equations (1) and (2), assume that every customer has the same CLV and Co, whereas this is not true in practice.

The above sentence would be tagged in the following manner:

Moreover, IN VBN discussed IN (AUTHOR et al PUBYEAR), IN DT average instead IN DT total profit VBZ is VBN considered and DT VBN fixed cost DT VBZ is VBN discarded IN VBZ is irrelevant IN classifier selection, DT profit can VB be VBD expressed as: (2) Nevertheless, equations (1) and (2), VBP assume IN DT customer VBZ has DT same CLV and Co, whereas DT VBZ is not true IN practice.’

In the above example, the key sequence is “IN VBN discussed IN (AUTHOR et al PUBYEAR).” The pattern of past participle verb tag + past participle verb + preposition + AUTHOR tag + PUBYEAR tag corresponds to a pre-existing arrangement in the module 3 processor, which assumes that an AUTHOR tag immediately following a verb clause indicates that an action is being attributed to a named author.

All in-text citation sentences have not received a Grouping (2) or Author(s) as Actant(s) (3) classifications are then automatically tagged as Extraction (1). Programmatically, an Extraction (1) classification is any in-text citation that has less than three PUBYEAR tags and does not attribute action to a named author within the boundaries of the sentence by making the author the subject or object of an action in an independent or subordinate clause. An example from Correa Bahnsen, et a. (2015) would be:

This assumption does not hold in many real-world applications such as churn modeling, since when misidentifying a churner the financial losses are quite different than when misclassifying a non-churner as churner (Glady et al [2009]).

After processing, the above sentence would be tagged as:

DT assumption VBZ does not VBP hold IN many real-world applications such IN churn modeling, IN when misidentifying DT churner DT financial losses VBP are quite different IN when misclassifying DT non-churner IN churner (AUTHOR et al PUBYEAR).’

As we noted in the beginning of the article, the ultimate aim of our work is to accomplish two tasks: (1) compare advisor and advisee texts and (2) output measures of comparison that inform a rhetorical reading of citational moves in academic writing. Doing so means converting raw advisor and advisee texts into a computational objects and selecting features from those objects that offer relevant quantitative and qualitative information. In this first pass, the computational object was a “string.” In the next pass, we convert texts to another kind of computational object, a graph, for further analysis.