Sunday, 11 October 2015

Characterizing the Google Books Corpus: Mitigating the effect of putative influencers who have a low dissemination level.

Pechenick, Danforth & Dodds PS recently suggest [1] that treat frequency trends from the Google Books data sets as indicators of the "true" popularity of various words and phrases and that a single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. They say that this call into question the vast majority of existing claims drawn from the Google Books corpus.

I'd like to humbly suggest a new methodology - that such claims are weighted by the citation index of the book which should mitigate the "unread influencer" syndrome. I.e. Each occurrence in the frequency count is multiplied by the citation index of the publication or book in which the phrase occurs. If it is never cited it does not count. This should correct for the prolific, unread (or at least uncited) author.


Post Script: "Neuroskeptic" observes quite rightly that the "books dataset" as expressed in the NGram viewer does not contain this information. The primary data are, of course, the books themselves rather than the "books dataset", and, as is often the case, we have to go back to primary data rather than a flawed subset of the data.  At the very least, the books dataset could be used alongside Google scholar - hardly a great challenge for serious work.

No comments:

Post a Comment