Word cloud¶

Word cloud is a visual representation of n-grams that give greater importance to words that appear more frequently in a source text. The bigger and bolder the n-gram appears, the more frequently it appears in the text.

Graph display unigrams (single words), bigrams, and trigrams for the source text.

Coding example:

Use case: Essential topics in newspaper headlines

Data: Million News Headlines dataset, source: Australian Broadcasting Corporation, data licence: CC0 1.0: Public Domain.

Coding:

import pandas as pd
from arabica import cappuccino

 data = pd.read_csv('abcnews_data.csv', encoding='utf8')

The data looks liks this:

headline	date
aba decides against community broadcasting licence	2003-2-19
act fire witnesses must be aware of defamation	2003-2-19

It procceeds in this way:

additional strings cleaning, if skip is not None
lowercasing: reviews are made lowercase so that capital letters don’t affect n-gram calculations (e.g., “Tree” is not treated differently from “tree”), if lower_case = True
punctuation cleaning - performs automatically
stop words removal, if stopwords is not None
extended stop words removal, if stopwords_extened is not None
digits removal, , if numbers = True
n-gram frequencies for each headline are calculated, summed, and displayed in a word cloud.

cappuccino(text = data['headline'],
           time = data['date'],
           date_format = 'us',                 # Uses US-style date format to parse dates
           plot = 'wordcloud',
           ngram = 2,                          # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
           time_freq = 'ungroup',              # No period aggregation
           max_words = 150,                    # Displays 150 most frequent bigrams
           stopwords = ['english'],            # Remove English stopwords
           stopwords_ext = ['english'],        # Remove extended list of English stopwords
           skip = ['<br />'],                  # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset.
           numbers = True,                     # Remove numbers
           lower_case = True)                  # Lowercase text

Here is the output:

Download the jupyter notebook with the code and the data here.

Word cloud¶

Previous topic

Next topic

This Page