Time-series n-gram analysis

arabica_freq method takes text data, enables standard cleaning operations, and provides n-gram (unigram, bigram, and trigram) frequencies over a year, month, or day.

It automatically cleans data from punctuation (using cleantext) on input. It can also apply all or a selected combination of the following cleaning operations:

  • Remove digits from the text

  • Remove standard list(s) of stop words (using NLTK)

  • Remove an additional specific list of words

Stop words are generally the most common words in a language with no significant meaning, such as “is”, “am”, “the”, “this”, “are”, etc. They are often filtered out because they bring low or zero information value. Arabica enables stopword removal for languages in the NLTK corpus.

To print all available languages:

1 from nltk.corpus import stopwords
2 print(stopwords.fileids())

It is possible to remove more sets of stopwords at once by:

1 stopwords = ['english', 'french','etc..']

Coding example

Use case: Fake news in newspaper headlines during the Covid-19 pandemic

Data: Fake-Real News dataset, period: 2019-12-02: 2020-6-19, source: Politifact.com, data licence: CC BY-SA 4.0.

Coding:

1import pandas as pd
2from arabica import arabica_freq
1 data = pd.read_csv('headlines.csv', encoding='utf8')

The data looks like this:

headline

date

Illinois “got into fiscal problems because of a Republican governor who was governor there

May 8, 2020

Black cats in Vietnam are being killed and consumed as a COVID-19 cure

May 8, 2020

Georgia Gov. Brian Kemp “mandates restaurants reopen

May 8, 2020

Central Park hospital tents housed thousands of abused children released from underground captivity

May 8, 2020

New autopsy reports suggest Jeffrey Epstein most likely died from COVID-19 complications

May 8, 2020

It procceeds in this way:

  • additional stop words cleaning, if skip is not None

  • lowercasing: reviews are made lowercase so that capital letters don’t affect n-gram calculations (e.g., “Tree” is not treated differently from “tree”), if lower_case = True

  • punctuation cleaning - performs automatically

  • stop words removal, if stopwords is not None

  • digits removal, , if numbers = True

  • n-gram frequencies for each headline are calculated, summed, and aggregated by a specified frequency.

1arabica_freq(text = data['headline'],
2             time = data['date'],
3             date_format = 'us',          # Uses US-style date format to parse dates
4             time_freq = 'M',             # Aggregation period: 'D' = daily, 'M' = monthly, 'Y' = yearly
5             max_words = 3,               # Displays thee most n-grams for each period
6             stopwords = ['english'],     # Remove English set of stopwords
7             skip = ['<br />'],           # Remove additional stop words
8             numbers = True,              # Remove numbers
9             lower_case = True)           # Lowercase text

The output is a dataframe with n-grams in monthly frequency:

period

unigram

bigram

trigram

2019-12

says: 48,trump: 12,president: 12

says,photo: 6,donald,trump: 6,photo,shows: 5

says,photo,shows: 5,president,donald,trump: 4,dirtier,dirtier,dirtier: 2

2020-01

says: 78,shows: 20,us: 17

video,shows: 8,says,photo: 7,kobe,bryant: 7

says,video,shows: 6,says,photo,shows: 6,iranian,rockets,launched: 4

2020-02

says: 77,trump: 20,president: 18

bernie,sanders: 9,photo,shows: 8,nancy,pelosi: 8

says,photo,shows: 5,says,bernie,sanders: 4,works,white,house: 4

2020-03

says: 81,coronavirus: 76,people: 29

joe,biden: 17,bernie,sanders: 12,donald,trump: 12

says,joe,biden: 6,president,donald,trump: 5,video,shows,joe: 3

2020-04

says: 66,covid: 39,coronavirus: 31

new,york: 8,photo,shows: 5,feb,feb: 5

new,york,city: 4,says,video,shows: 3,feb,feb,feb: 3

2020-05

says: 38,covid: 33,coronavirus: 21

joe,biden: 8,photo,shows: 8,donald,trump: 7

president,donald,trump: 5,says,president,donald: 4,says,gov,tony: 3

2020-06

says: 31,trump: 17,police: 16

donald,trump: 11,last,year: 5,george,soros: 5

require,years,training: 3,training,people,killed: 3,people,killed,since: 3

The n-grams indicate that the key topics discussed in the headlines were the US presidential elections until the break-up of Covid 19 in March 2020. In June 2020, George Soros and George Floyd’s case dominated the fake news in public debate.

Download the jupyter notebook with the code and the data here.