Welcome to Arabica’s documentation!

Arabica is a python library for exploratory data analysis specifically designed for time-series text data. It reflects the reality that many text datasets are now collected as repeated observations over time (social media conversations, research metadata, product reviews, newspaper headlines, central bankers’ communication, etc.).

  • Descriptive n-gram analysis: n-gram frequencies

  • Time-series n-gram analysis: n-gram frequencies over a period

  • Text visualization: n-gram heatmap, line plot, word cloud

  • Sentiment analysis: VADER sentiment classifier

  • Financial sentiment analysis: with FinVADER

  • Structural breaks identification: Jenks Optimisation Method

N-grams are continuous sequences of words in a document. Technically, they are the neighboring sequences of items in a text. Some examples include:

  • unigram: “dog”, bigram: “dog, goes”, trigram: “dog, goes, home”

  • unigram: “flower”, bigram: “flower, grows”, trigram: “flower, grows, here”

Contents


I have created this project in my free time, and I hope Arabica will save you some time. You can invite me for coffe if Arabica helps you with your project, thesis, or research paper.