Descriptive n-gram analysis¶

arabica_freq method takes text data, enables standard cleaning operations, and with time_freq = ‘ungroup’ provides descriptive analysis for the most frequent words, bigrams, and trigrams.

It automatically cleans data from punctuation (using cleantext) on input. It can also apply all or a selected combination of the following cleaning operations:

Remove digits from the text
Remove standard list(s) of stop words (using NLTK)
Remove an additional specific list of words

Stop words are generally the most common words in a language with no significant meaning, such as “is”, “am”, “the”, “this”, “are”, etc. They are often filtered out because they bring low or zero information value. Arabica enables stopword removal for languages in the NLTK corpus.

To print all available languages:

 from nltk.corpus import stopwords
 print(stopwords.fileids())

It is possible to remove more sets of stopwords at once by:

 stopwords = ['english', 'french','etc..']

Coding example

Use case: Customer perception of Amazon products

Data: Amazon Product Reviews dataset, source: Amazon.com, data licence: CC0: Public Domain.

Coding:

import pandas as pd
from arabica import arabica_freq

 data = pd.read_csv('reviews_subset.csv',encoding='utf8')

By randomly picking a product from the reviews, a subset of 25 reviews looks like this:

time

review

08/19/2010

You may find yourself trying to decide between comparable crystallized ginger offerings from Reeds and The Ginger People. Which one should you choose? I have now tried both, and here is how they compare. Reed’s has a lovely raw cane sugar flavor, and is sweeter and more mellow than The Ginger People’s. If you want something a little less sweet (still sweet though–it is crystallized ginger, after all) and a little spicier, go for The Ginger People.

06/05/2009

On the Reeds website, this same product is available for $16.00. ”Reed’s Crystallized Ginger Candy 12 - 3.5 oz Bags”

It procceeds in this way:

additional stop words cleaning, if skip is not None
lowercasing: reviews are made lowercase so that capital letters don’t affect n-gram calculations (e.g., “Tree” is not treated differently from “tree”), if lower_case = True
punctuation cleaning - performs automatically
stop words removal, if stopwords is not None
digits removal, , if numbers = True
n-gram frequencies for each headline are calculated and summed for the whole dataset.

arabica_freq(text = data['review'],
             time = data['time'],
             date_format = 'us',          # Use US-style date format to parse dates
             time_freq = 'ungroup',       # Calculate n-grams frequencies without period aggregation
             max_words = 10,              # Display 10 most frequent unigrams, bigrams, and trigrams
             stopwords = ['english'],     # Remove English set of stopwords
             skip = ['<br />'],           # Remove additional stop words
             numbers = True,              # Remove numbers
             lower_case = True)           # Lowercase text

The output is a dataframe with n-gram frequencies:

unigram	unigram_freq	bigram	bigram_freq	trigram	trigram_freq
ginger	75	crystallized,ginger	9	health,food,store	3
one	14	ginger,candy	8	ginger,unique,taste	2
would	13	crystalized,ginger	5	ginger,candy,would	2
reeds	13	reeds,ginger	5	ginger,peoples,organic	2
candy	11	ginger,flavor	4	ginger,ale,love	2
crystallized	11	ginger,ale	4	know,ginger,candy	2
love	11	baby,ginger	4	charged,credit,card	2
taste	10	much,sugar	4	taste,could,make	1
flavor	10	health,food	3	half,sugar,much	1
much	10	strong,ginger	3	less,half,sugar	1

The frequency of “love” and “ginger, unique, taste” and no n-grams with negative meanings suggest that customers perceived the product positively. The reasons might be less sugar and overall health effects - “health,food”, “much,sugar”, and “less,half,sugar”. A more detailed inspection should confirm this.

Download the jupyter notebook with the code and the data here.

Descriptive n-gram analysis¶

Previous topic

Next topic

This Page