Time-series n-gram analysis¶
arabica_freq method takes text data, enables standard cleaning operations, and provides n-gram (unigram, bigram, and trigram) frequencies over a year, month, or day.
It automatically cleans data from punctuation (using cleantext) on input. It can also apply all or a selected combination of the following cleaning operations:
Remove digits from the text
Remove standard list(s) of stop words (using NLTK)
Remove extended list of stopwords (currently for English only)
Remove an additional specific list of strings.
Stop words are generally the most common words in a language with no significant meaning, such as “is”, “am”, “the”, “this”, “are”, etc. They are often filtered out because they bring low or zero information value. Arabica enables stopword removal for languages in the NLTK corpus and an extended stop words list to provide further cleaning (currently provided lists: ‘english’).
To print all available languages:
1 from nltk.corpus import stopwords
2 print(stopwords.fileids())
It is possible to remove more sets of stopwords at once by:
1 stopwords = ['english', 'french','etc..']
Coding example
Use case: Fake news in newspaper headlines during the Covid-19 pandemic
Data: Fake-Real News dataset, period: 2019-12-02: 2020-6-19, source: Politifact.com, data licence: CC BY-SA 4.0.
Coding:
1import pandas as pd
2from arabica import arabica_freq
1 data = pd.read_csv('headlines.csv', encoding='utf8')
The data looks like this:
headline |
date |
|---|---|
Illinois “got into fiscal problems because of a Republican governor who was governor there |
May 8, 2020 |
Black cats in Vietnam are being killed and consumed as a COVID-19 cure |
May 8, 2020 |
Georgia Gov. Brian Kemp “mandates restaurants reopen |
May 8, 2020 |
Central Park hospital tents housed thousands of abused children released from underground captivity |
May 8, 2020 |
New autopsy reports suggest Jeffrey Epstein most likely died from COVID-19 complications |
May 8, 2020 |
It procceeds in this way:
additional strings cleaning, if
skip is not Nonelowercasing: reviews are made lowercase so that capital letters don’t affect n-gram calculations (e.g., “Tree” is not treated differently from “tree”), if
lower_case = Truepunctuation cleaning - performs automatically
stop words removal, if
stopwords is not Noneextended stop words removal, if
stopwords_extened is not Nonedigits removal, , if
numbers = Truen-gram frequencies for each headline are calculated, summed, and aggregated by a specified frequency.
1arabica_freq(text = data['headline'],
2 time = data['date'],
3 date_format = 'us', # Uses US-style date format to parse dates
4 time_freq = 'M', # Aggregation period: 'D' = daily, 'M' = monthly, 'Y' = yearly
5 max_words = 3, # Displays thee most n-grams for each period
6 stopwords = ['english'], # Remove English set of stopwords
7 stopwords_ext = ['english'], # Remove extended list of English stopwords
8 skip = ['<br />'], # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset.
9 numbers = True, # Remove numbers
10 lower_case = True) # Lowercase text
The output is a dataframe with n-grams in monthly frequency:
period |
unigram |
bigram |
trigram |
|---|---|---|---|
2019-12 |
says: 48,trump: 12,president: 12 |
says,photo: 6,donald,trump: 6,photo,shows: 5 |
says,photo,shows: 5,president,donald,trump: 4,dirtier,dirtier,dirtier: 2 |
2020-01 |
says: 78,shows: 20,us: 17 |
video,shows: 8,says,photo: 7,kobe,bryant: 7 |
says,video,shows: 6,says,photo,shows: 6,iranian,rockets,launched: 4 |
2020-02 |
says: 77,trump: 20,president: 18 |
bernie,sanders: 9,photo,shows: 8,nancy,pelosi: 8 |
says,photo,shows: 5,says,bernie,sanders: 4,works,white,house: 4 |
2020-03 |
says: 81,coronavirus: 76,people: 29 |
joe,biden: 17,bernie,sanders: 12,donald,trump: 12 |
says,joe,biden: 6,president,donald,trump: 5,video,shows,joe: 3 |
2020-04 |
says: 66,covid: 39,coronavirus: 31 |
new,york: 8,photo,shows: 5,feb,feb: 5 |
new,york,city: 4,says,video,shows: 3,feb,feb,feb: 3 |
2020-05 |
says: 38,covid: 33,coronavirus: 21 |
joe,biden: 8,photo,shows: 8,donald,trump: 7 |
president,donald,trump: 5,says,president,donald: 4,says,gov,tony: 3 |
2020-06 |
says: 31,trump: 17,police: 16 |
donald,trump: 11,last,year: 5,george,soros: 5 |
require,years,training: 3,training,people,killed: 3,people,killed,since: 3 |
The n-grams indicate that the key topics discussed in the headlines were the US presidential elections until the break-up of Covid 19 in March 2020. In June 2020, George Soros and George Floyd’s case dominated the fake news in public debate.
Download the jupyter notebook with the code and the data here.