Sentiment analysis¶

coffee_break takes text data as the input, enables digits and punctuation cleaning, and provides time-series sentiment analysis with a pre-trained sentiment classifier. It calculates sentiment in each row of the dataset, aggregates it over a specified period, and returns a plot and a dataframe with a corresponding time series.

The implemented models are:

VADER is a lexicon and rule-based sentiment classifier attuned explicitly to sentiments expressed in social media. It works best with general-language texts
FinVADER improves VADER’s classification accuracy, including two financial lexicons. It should be applied to texts in the financial and economic domain

Coding example

Use case: Sentiment analysis of Twitter tweets about Pfizer & BioNTech vaccine

Data: Pfizer Vaccine Tweets dataset, period: 15/07/2006: 18/11/2021, source: Twitter API, data licence: CC0: Public Domain.

Coding:

import pandas as pd
from arabica import coffee_break

 data = pd.read_csv('vaccination_tweets.csv',encoding='utf8')

The data looks like this:

text	date
Same folks said daikon paste could treat a cytokine storm #PfizerBioNTech https://t.co/xeHhIMg1kF	20/12/2020 06:06
While the world has been on the wrong side of history this year, hopefully, the biggest vaccination effort we’ve ev… https://t.co/dlCHrZjkhm	13/12/2020 16:27
#coronavirus #SputnikV #AstraZeneca #PfizerBioNTech #Moderna #Covid_19 Russian vaccine is created to last 2-4 years… https://t.co/ieYlCKBr8P	12/12/2020 20:33

 coffee_break(text = data['text'],
              time = data['date'],
              date_format = 'eur',  # Read dates in European format
              model = 'vader',      # Use VADER classifier
              time_freq = 'Y',      # Yearly aggregation
              preprocess = True,    # Clean data - punctuation + numbers
              skip = ["brrrr",
                  "donald trump"],  # Remove additional stop words
              n_breaks = None)      # No structural break analysis

It proceeds in this way:

pre-processing: tweets are cleaned from numbers, punctuation, blank rows and a list of additional stopwords (“brrrr”, “donald trump”)
sentiment classification: sentiment in each row is classified with VADER sentiment classifier. The aggregate sentiment ranges between -1 (most extreme negative) and 1 (most extreme positive). Arabica uses VADER’s compound indicator for sentiment classification (FinVADER as well).
period aggregation: sentiment is aggregated for a specified frequency (year or month), as follows: aggregate sentiment = \(\frac { sum(sentiment)_{t} } { count(rows)_{t}}\), where t is the aggregation period.
visualization: aggregated time series of sentiment is displayed in a line plot

Here is the output:

At the same time, Arabica returns a dataframe with the corresponding data. The table can be saved simply by:

# generate a dataframe
 df = coffee_break(text = data['text'],
                   time = data['date'],
                   date_format = 'eur',
                   model = 'vader',
                   time_freq = 'Y',
                   preprocess = True,
                   skip = ["brrrr",
                          "donald trump"],
                   n_breaks = None)


# save is as a csv
df.to_csv('sentiment_data.csv')

We can see that sentiment significantly dropped after Pfizer vaccines started to be used to tackle Covid in 2021. The reason is likely the global pandemic and the generally negative mood in these years.

Download the jupyter notebook with the code and the data here.

Sentiment analysis¶

Previous topic

Next topic

This Page