Time-series text visualization

Arabica provides n-gram visualization methods to describe the dataset and discover variability over time.

cappuccino method enables standard cleaning operations and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization.

It automatically cleans data from punctuation (using cleantext) on input. It can also apply all or a selected combination of the following cleaning operations:

  • Remove digits from the text

  • Remove standard list(s) of stop words (using NLTK)

  • Remove an additional specific list of words

Stop words are generally the most common words in a language with no significant meaning, such as “is”, “am”, “the”, “this”, “are”, etc. They are often filtered out because they bring low or zero information value. Arabica enables stopword removal for languages in the NLTK corpus.

To print all available languages:

1 from nltk.corpus import stopwords
2 print(stopwords.fileids())

It is possible to remove more sets of stopwords at once by:

1 stopwords = ['english', 'french','etc..']

Word cloud is a graphical representation of word importance (typically frequencies) that give greater prominence to words that appear more frequently in a source text.

Heatmap allows us to visualize n-grams through time. It divides the data into discrete categories (boxes) by time and assigns a color to each category based on the value of the n-gram.

Line plot displays n-grams as a series of data points called ‘markers’ connected by straight line segments. It is a basic type of chart common in many fields.