Tweet topics and sentiments relating to distance learning among Italian Twitter users

Data

Twitter was chosen as the data source. It is one of the leading social media platforms in the world, with 199 million active users as of April 20214and is also a common text source for sentiment analysis23,24,25.

To collect distance learning related tweets, we used TrackMyHashtag https://www.trackmyhashtag.com/, a tracking tool to monitor hashtags in real time. Unlike the Twitter API, which does not provide tweets older than three weeks, TrackMyHashtag also provides historical data and filters selections by language and location.

For our study, we chose the Italian words for “distance learning” as the search term and chose March 3, 2020 to November 23, 2021 as the period of interest. Finally, we selected only Italian tweets. A total of 25,100 tweets were collected for this study.

Data preprocessing

To clean the data and prepare it for sentiment analysis, we applied the following processing steps using NLP techniques implemented with Python:

  1. 1.

    removed mentions, URLs and hashtags,

  2. 2.

    replaced HTML characters with their Unicode equivalent (such as replacing ‘&’ with ‘&’),

  3. 3.

    stripped HTML tags (such as \(< div>\), \(< p>\)etc.),

  4. 4.

    unnecessary line breaks removed,

  5. 5.

    removed special characters and punctuation marks,

  6. 6.

    remove words that are numbers,

  7. 7.

    converted the text of the Italian tweets to English using the ‘googletrans’ tool.

In the second part a higher quality database is required for the topic model. Duplicate tweets were removed and only unique tweets were retained. In addition to general data cleaning methods, tokenization and lemmatization can enable the model to achieve better performance. Different forms of a word cause misclassification for patterns. Consequently, NLTK’s WorldNet library26 was used to perform the lemmatization. Basic algorithms that aggressively reduce words to a common base even if these words actually have different meanings are not considered here. Finally, we minified all text to ensure every word appeared in a consistent format and pruned the vocabulary, removing stop words and off-topic terms such as ‘how’, ‘from’ and will’.

Analysis of feelings and emotions

Among the main algorithms to be used for text extraction and especially for sentiment analysis, we applied the Valence Aware Dictionary for Sentiment Reasoning (VADER) proposed by Hutto et al.27 to determine the polarity and intensity of chirps. VADER is a sentiment lexicon and rule-based sentiment analysis tool derived from a wisdom of the crowd approach. Through extensive human work, this tool enables social media sentiment analysis to be completed quickly and has a very high human-like accuracy. We used VADER to obtain sentiment scores for the preprocessed text data of a tweet. At the same time, according to the classification method recommended by its authors, we defined the emotional result in three categories: positive, negative and neutral (Fig. 1 step 1).

Figure 1
figure 1

The steps of the analysis of feelings and emotions.

Then, to detect emotions based on categories, we applied nrc28 algorithm, which is one of the methods included in the R library package glasses29 for the analysis of emotions. Especially, nrc the algorithm applies an emotion dictionary to score each tweet based on two sentiments (positive or negative) and eight emotions (anger, fear, anticipation, confidence, surprise, sadness, joy, and disgust). Emotional recognition aims to identify the emotions that a tweet carries. If a tweet was associated with a particular emotion or feeling, it scores points that reflect the degree of valence associated with that category. Otherwise, there would be no points for that category. Therefore, if a tweet contains two words listed in the word list for the emotion ‘joy’, the score for that sentence in the joy category will be 2.

When you use nrc lexicon, instead of getting an algebraic score due to positive and negative words, each tweet receives a score for each emotion category. However, this algorithm fails to properly account for deniers. Moreover, he adopts the word approach, where the sense is based on the individual words that occur in the text, neglecting the role of syntax and grammar. Therefore, VADER and nrc the methods are not comparable in terms of number of chirps and polarity categories. Therefore, the idea is to use VADER for sentiment analysis and then apply it nrc only to detect positive and negative emotions. The flow chart in Fig. 1 presents two-step sentiment analysis. VADER’s neutral tweets are very useful in classification, but not interesting for sentiment analysis; therefore, we focused on positive and negative sentiment tweets. VADER’s performance in the field of social media text is excellent. Based on its complete rules, VADER can perform a sentiment analysis on various lexical features: punctuation, capitalization, degree modifiers, contrastive conjunction “but” and negation slide trigrams.

Theme template

The topic model is an unsupervised machine learning method; that is, it is a text mining procedure by which topics or topics of documents can be identified from a large corpus of documents30. The Latent Dirichlet Allocation (LDA) model is one of the most popular topic modeling methods; is a probabilistic model to express a corpus based on a three-level hierarchical Bayesian model. The basic idea of ​​LDA is that every document has a topic, and a topic can be defined as a distribution of words31. Specifically in LDA models, the generation of documents within a corpus follows the following process:

  1. 1.

    A mixture of k topic, \(\theta\)is obtained from a Dirichlet prior, which is parameterized by \(\alpha\);

  2. 2.

    A theme \(z_n\) is drawn from the multinomial distribution, \(p(\theta \mid \alpha)\) this is the theme distribution of the document it models \(p(z_n=i\mid \theta )\) ;

  3. 3.

    Fixed the number of threads \(k=1 \ldpoint ,K\)spreading the word about k the subject is marked with \(\phi\) which is also a hyper-parameter multinomial distribution whose \(\beta\) follows the Dirichlet distribution;

  4. 4.

    Starting from the topic \(z_n\)one word, \(w_n\)it is then sampled via the multinomial distribution \(p(w \mid z_n;\beta )\).

In general, the probability of a document (or tweet, in our case) “\(\mathbf w\)” containing words can be described as:

$$\beginlined p(\mathbfw)=\int _\theta p(\theta \mid \alpha )\left( \prod \limits _n = 1^N \sum \ limits _z_n = 1^k p(w_n \mid z_n ;\beta )p(z_n \mid \theta ) \right) \mathrmd\theta \endconnected$$

(1)

Finally, the corpus probability of M documents \(D=\\mathbfw_\mathbf1,\ldots ,\mathbfw_\mathbfM\\) can be expressed as the product of the marginal probabilities of each single document \(D_m\)as shown in (2).

$$\beginlined p(D) = \prod \limits _m = 1^M {\int _\theta {p(\theta _m \mid \alpha )\left( \prod \limits _ n = 1^N_m \sum \limits _z_n = 1^k p(w_m,n \mid z_m,n ;\beta )p(z_m,n \mid \theta _m ) \right) } } \mathrmd\theta _m \endaligned$$

(2)

In our analysis involving tweets over a 2-year period, we find that tweet content is variable over time, and therefore, topic content is not a static corpus. The Dynamic LDA (DLDA) model is adopted and used for topics clustered in time epochs, and a state-space model handles topic transitions from one epoch to another. A probabilistic Gaussian model to obtain the posterior probabilities for the evolving topics along the timeline is added as an additional dimension.

Figure 2
figure 2

Dynamic theme model (for three time slices). A set of topics in the data set is evolved from the set of the previous part. The model for each time slice corresponds to the original LDA process. Furthermore, the parameters of each theme evolve over time.

Figure 2 shows a graphical representation of the dynamic topic model (DTM)32. As part of the probabilistic topic model class, the dynamic model can explain how different tweet topics evolve. The tweet dataset used here (March 3, 2020–November 23, 2021) contains 630 days, which is exactly seven quarters of the year. The dynamic topic model is accordingly applied at seven time steps corresponding to the seven quarters of the data set. These time slices are placed in the model provided by as a nation33.

A fundamental challenge in DLDA (like LDA) is determining an appropriate number of topics. Roder et al. proposed coherence scores to assess the quality of each topic model. Specifically, theme coherence is the measure used to assess the coherence between themes extracted from a model. As a measure of coherence we used \(C_v\) AND \(C_umass\). The first is a measure based on a sliding window that uses normalized point mutual information (NPMI) and cosine similarity. Instead, \(C_umass\) is based on document co-occurrence counts, a prior segmentation, and a logarithmic conditional probability as a measure of confirmation. These values ​​are intended to mimic the relative score a human is likely to assign to a topic and indicate how much the topic’s words “make sense.” These results infer cohesion between the ‘higher’ words within a given topic. Distribution is also considered in principal component analysis (PCA), which can visualize topic patterns in a two-dimensional spatial distribution of words. A uniform distribution is preferred, which gives a high degree of independence to each subject. The judgment of a good model is a higher coherence and an average distribution in the primer analysis displayed by pyLDAvis34.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *