Heatmap

Heatmap displays n-grams through time. It plots n-gram-frequencies by time and assigns a color to each frequency based on the value of the n-gram.

Heatmap is a suitable visualization for datasets with large T (many periods). The graph displays unigrams (single words) and bigrams over a monthly or yearly period.


Coding example:

Use case: Essential topics in newspaper headlines

Data: Million News Headlines dataset, source: Australian Broadcasting Corporation, data licence: CC0 1.0: Public Domain.

Coding:

1import pandas as pd
2from arabica import cappuccino
1data = pd.read_csv('abcnews_data.csv', encoding='utf8')

The data looks liks this:

headline

date

aba decides against community broadcasting licence

2003-2-19

act fire witnesses must be aware of defamation

2003-2-19

It procceeds in this way:

  • additional stop words cleaning, if skip is not None

  • lowercasing: reviews are made lowercase so that capital letters don’t affect n-gram calculations (e.g., “Tree” is not treated differently from “tree”), if lower_case = True

  • punctuation cleaning - performs automatically

  • stop words removal, if stopwords is not None

  • digits removal, , if numbers = True

  • n-gram frequencies for each headline are calculated, aggregated by monthly frequency, and displayed in a heatmap.

 1cappuccino(text = data['headline'],
 2           time = data['date'],
 3           date_format = 'us',               # Uses US-style date format to parse dates
 4           plot = 'heatmap',
 5           ngram = 1,                         # N-gram size, 1 = unigram, 2 = bigram
 6           time_freq = 'M',                   # Aggregation period, 'M' = monthly, 'Y' = yearly
 7           max_words = 10,                    # Displays 10 most frequent unigrams (words) for each period
 8           stopwords = ['english'],           # Remove English stopwords
 9           skip = ['covid','Donald Trump'],   # Remove additional stop words
10           numbers = True,                    # Remove numbers
11           lower_case = True)                 # Lowercase text

Here is the output:

alternate text

Download the jupyter notebook with the code and the data here.