Descriptive n-gram analysis¶
arabica_freq method takes text data, enables standard cleaning operations, and with time_freq = ‘ungroup’ provides descriptive analysis for the most frequent words, bigrams, and trigrams.
It automatically cleans data from punctuation (using cleantext) on input. It can also apply all or a selected combination of the following cleaning operations:
Remove digits from the text
Remove standard list(s) of stop words (using NLTK)
Remove an additional specific list of words
Stop words are generally the most common words in a language with no significant meaning, such as “is”, “am”, “the”, “this”, “are”, etc. They are often filtered out because they bring low or zero information value. Arabica enables stopword removal for languages in the NLTK corpus.
To print all available languages:
1 from nltk.corpus import stopwords
2 print(stopwords.fileids())
It is possible to remove more sets of stopwords at once by:
1 stopwords = ['english', 'french','etc..']
Coding example
Use case: Customer perception of Amazon products
Data: Amazon Product Reviews dataset, source: Amazon.com, data licence: CC0: Public Domain.
Coding:
1import pandas as pd
2from arabica import arabica_freq
1 data = pd.read_csv('reviews_subset.csv',encoding='utf8')
By randomly picking a product from the reviews, a subset of 25 reviews looks like this:
time |
review |
---|---|
08/19/2010 |
You may find yourself trying to decide between comparable crystallized ginger offerings from Reeds and The Ginger People. Which one should you choose? I have now tried both, and here is how they compare.<br /><br />Reed’s has a lovely raw cane sugar flavor, and is sweeter and more mellow than The Ginger People’s.<br /><br />If you want something a little less sweet (still sweet though–it is crystallized ginger, after all) and a little spicier, go for The Ginger People. |
06/05/2009 |
On the Reeds website, this same product is available for $16.00.<br /><br />”Reed’s Crystallized Ginger Candy 12 - 3.5 oz Bags” |
It procceeds in this way:
additional stop words cleaning, if
skip is not None
lowercasing: reviews are made lowercase so that capital letters don’t affect n-gram calculations (e.g., “Tree” is not treated differently from “tree”), if
lower_case = True
punctuation cleaning - performs automatically
stop words removal, if
stopwords is not None
digits removal, , if
numbers = True
n-gram frequencies for each headline are calculated and summed for the whole dataset.
1arabica_freq(text = data['review'],
2 time = data['time'],
3 date_format = 'us', # Use US-style date format to parse dates
4 time_freq = 'ungroup', # Calculate n-grams frequencies without period aggregation
5 max_words = 10, # Display 10 most frequent unigrams, bigrams, and trigrams
6 stopwords = ['english'], # Remove English set of stopwords
7 skip = ['<br />'], # Remove additional stop words
8 numbers = True, # Remove numbers
9 lower_case = True) # Lowercase text
The output is a dataframe with n-gram frequencies:
unigram |
unigram_freq |
bigram |
bigram_freq |
trigram |
trigram_freq |
---|---|---|---|---|---|
ginger |
75 |
crystallized,ginger |
9 |
health,food,store |
3 |
one |
14 |
ginger,candy |
8 |
ginger,unique,taste |
2 |
would |
13 |
crystalized,ginger |
5 |
ginger,candy,would |
2 |
reeds |
13 |
reeds,ginger |
5 |
ginger,peoples,organic |
2 |
candy |
11 |
ginger,flavor |
4 |
ginger,ale,love |
2 |
crystallized |
11 |
ginger,ale |
4 |
know,ginger,candy |
2 |
love |
11 |
baby,ginger |
4 |
charged,credit,card |
2 |
taste |
10 |
much,sugar |
4 |
taste,could,make |
1 |
flavor |
10 |
health,food |
3 |
half,sugar,much |
1 |
much |
10 |
strong,ginger |
3 |
less,half,sugar |
1 |
The frequency of “love” and “ginger, unique, taste” and no n-grams with negative meanings suggest that customers perceived the product positively. The reasons might be less sugar and overall health effects - “health,food”, “much,sugar”, and “less,half,sugar”. A more detailed inspection should confirm this.
Download the jupyter notebook with the code and the data here.