Descriptive n-gram analysis

arabica_freq method takes text data, enables standard cleaning operations, and with time_freq = ‘ungroup’ provides descriptive analysis for the most frequent words, bigrams, and trigrams.

It automatically cleans data from punctuation (using cleantext) on input. It can also apply all or a selected combination of the following cleaning operations:

  • Remove digits from the text

  • Remove standard list(s) of stop words (using NLTK)

  • Remove an additional specific list of words

Stop words are generally the most common words in a language with no significant meaning, such as “is”, “am”, “the”, “this”, “are”, etc. They are often filtered out because they bring low or zero information value. Arabica enables stopword removal for languages in the NLTK corpus.

To print all available languages:

1 from nltk.corpus import stopwords
2 print(stopwords.fileids())

It is possible to remove more sets of stopwords at once by:

1 stopwords = ['english', 'french','etc..']

Coding example

Use case: Customer perception of Amazon products

Data: Amazon Product Reviews dataset, source: Amazon.com, data licence: CC0: Public Domain.

Coding:

1import pandas as pd
2from arabica import arabica_freq
1 data = pd.read_csv('reviews_subset.csv',encoding='utf8')

By randomly picking a product from the reviews, a subset of 25 reviews looks like this:

time

review

08/19/2010

You may find yourself trying to decide between comparable crystallized ginger offerings from Reeds and The Ginger People. Which one should you choose? I have now tried both, and here is how they compare.<br /><br />Reed’s has a lovely raw cane sugar flavor, and is sweeter and more mellow than The Ginger People’s.<br /><br />If you want something a little less sweet (still sweet though–it is crystallized ginger, after all) and a little spicier, go for The Ginger People.

06/05/2009

On the Reeds website, this same product is available for $16.00.<br /><br />”Reed’s Crystallized Ginger Candy 12 - 3.5 oz Bags”

It procceeds in this way:

  • additional stop words cleaning, if skip is not None

  • lowercasing: reviews are made lowercase so that capital letters don’t affect n-gram calculations (e.g., “Tree” is not treated differently from “tree”), if lower_case = True

  • punctuation cleaning - performs automatically

  • stop words removal, if stopwords is not None

  • digits removal, , if numbers = True

  • n-gram frequencies for each headline are calculated and summed for the whole dataset.

1arabica_freq(text = data['review'],
2             time = data['time'],
3             date_format = 'us',          # Use US-style date format to parse dates
4             time_freq = 'ungroup',       # Calculate n-grams frequencies without period aggregation
5             max_words = 10,              # Display 10 most frequent unigrams, bigrams, and trigrams
6             stopwords = ['english'],     # Remove English set of stopwords
7             skip = ['<br />'],           # Remove additional stop words
8             numbers = True,              # Remove numbers
9             lower_case = True)           # Lowercase text

The output is a dataframe with n-gram frequencies:

unigram

unigram_freq

bigram

bigram_freq

trigram

trigram_freq

ginger

75

crystallized,ginger

9

health,food,store

3

one

14

ginger,candy

8

ginger,unique,taste

2

would

13

crystalized,ginger

5

ginger,candy,would

2

reeds

13

reeds,ginger

5

ginger,peoples,organic

2

candy

11

ginger,flavor

4

ginger,ale,love

2

crystallized

11

ginger,ale

4

know,ginger,candy

2

love

11

baby,ginger

4

charged,credit,card

2

taste

10

much,sugar

4

taste,could,make

1

flavor

10

health,food

3

half,sugar,much

1

much

10

strong,ginger

3

less,half,sugar

1

The frequency of “love” and “ginger, unique, taste” and no n-grams with negative meanings suggest that customers perceived the product positively. The reasons might be less sugar and overall health effects - “health,food”, “much,sugar”, and “less,half,sugar”. A more detailed inspection should confirm this.

Download the jupyter notebook with the code and the data here.