Time-series n-gram analysis¶
arabica_freq method takes text data, enables standard cleaning operations, and provides n-gram (unigram, bigram, and trigram) frequencies over a year, month, or day.
It automatically cleans data from punctuation (using cleantext) on input. It can also apply all or a selected combination of the following cleaning operations:
Remove digits from the text
Remove standard list(s) of stop words (using NLTK)
Remove an additional specific list of words
Stop words are generally the most common words in a language with no significant meaning, such as “is”, “am”, “the”, “this”, “are”, etc. They are often filtered out because they bring low or zero information value. Arabica enables stopword removal for languages in the NLTK corpus.
To print all available languages:
1 from nltk.corpus import stopwords
2 print(stopwords.fileids())
It is possible to remove more sets of stopwords at once by:
1 stopwords = ['english', 'french','etc..']
Coding example
Use case: Fake news in newspaper headlines during the Covid-19 pandemic
Data: Fake-Real News dataset, period: 2019-12-02: 2020-6-19, source: Politifact.com, data licence: CC BY-SA 4.0.
Coding:
1import pandas as pd
2from arabica import arabica_freq
1 data = pd.read_csv('headlines.csv', encoding='utf8')
The data looks like this:
headline |
date |
---|---|
Illinois “got into fiscal problems because of a Republican governor who was governor there |
May 8, 2020 |
Black cats in Vietnam are being killed and consumed as a COVID-19 cure |
May 8, 2020 |
Georgia Gov. Brian Kemp “mandates restaurants reopen |
May 8, 2020 |
Central Park hospital tents housed thousands of abused children released from underground captivity |
May 8, 2020 |
New autopsy reports suggest Jeffrey Epstein most likely died from COVID-19 complications |
May 8, 2020 |
It procceeds in this way:
additional stop words cleaning, if
skip is not None
lowercasing: reviews are made lowercase so that capital letters don’t affect n-gram calculations (e.g., “Tree” is not treated differently from “tree”), if
lower_case = True
punctuation cleaning - performs automatically
stop words removal, if
stopwords is not None
digits removal, , if
numbers = True
n-gram frequencies for each headline are calculated, summed, and aggregated by a specified frequency.
1arabica_freq(text = data['headline'],
2 time = data['date'],
3 date_format = 'us', # Uses US-style date format to parse dates
4 time_freq = 'M', # Aggregation period: 'D' = daily, 'M' = monthly, 'Y' = yearly
5 max_words = 3, # Displays thee most n-grams for each period
6 stopwords = ['english'], # Remove English set of stopwords
7 skip = ['<br />'], # Remove additional stop words
8 numbers = True, # Remove numbers
9 lower_case = True) # Lowercase text
The output is a dataframe with n-grams in monthly frequency:
period |
unigram |
bigram |
trigram |
---|---|---|---|
2019-12 |
says: 48,trump: 12,president: 12 |
says,photo: 6,donald,trump: 6,photo,shows: 5 |
says,photo,shows: 5,president,donald,trump: 4,dirtier,dirtier,dirtier: 2 |
2020-01 |
says: 78,shows: 20,us: 17 |
video,shows: 8,says,photo: 7,kobe,bryant: 7 |
says,video,shows: 6,says,photo,shows: 6,iranian,rockets,launched: 4 |
2020-02 |
says: 77,trump: 20,president: 18 |
bernie,sanders: 9,photo,shows: 8,nancy,pelosi: 8 |
says,photo,shows: 5,says,bernie,sanders: 4,works,white,house: 4 |
2020-03 |
says: 81,coronavirus: 76,people: 29 |
joe,biden: 17,bernie,sanders: 12,donald,trump: 12 |
says,joe,biden: 6,president,donald,trump: 5,video,shows,joe: 3 |
2020-04 |
says: 66,covid: 39,coronavirus: 31 |
new,york: 8,photo,shows: 5,feb,feb: 5 |
new,york,city: 4,says,video,shows: 3,feb,feb,feb: 3 |
2020-05 |
says: 38,covid: 33,coronavirus: 21 |
joe,biden: 8,photo,shows: 8,donald,trump: 7 |
president,donald,trump: 5,says,president,donald: 4,says,gov,tony: 3 |
2020-06 |
says: 31,trump: 17,police: 16 |
donald,trump: 11,last,year: 5,george,soros: 5 |
require,years,training: 3,training,people,killed: 3,people,killed,since: 3 |
The n-grams indicate that the key topics discussed in the headlines were the US presidential elections until the break-up of Covid 19 in March 2020. In June 2020, George Soros and George Floyd’s case dominated the fake news in public debate.
Download the jupyter notebook with the code and the data here.