ABSTRACT
Nowadays, a lot of data is generated on the internet. Data analysis is used to find out useful information from the data. When it comes to textual data, Natural Language Processing and Machine Learning Algorithms are used to categorize and get insights from this data. One of the most valuable information from the textual data is the human sentiment obscured in the text form.
Experts developed algorithms to derive Human emotion from textual data. Humans use distinct types of words when they have certain emotions. Machine learning algorithms can be used to find patterns and predict the emotion obscure in words.
This blog discusses different machine learning models used to find the accuracy and precision of each model. The blog is based on E-commerce product data reviews analysis and training machine learning models.
First focus on the standard data analysis techniques for the data analysis part. Later, different machine learning models are used to be trained and tested. It uses Support Vector Machine Model, Logistic Regression, Naive Bayes, and Random Forest Model.
The dataset used for this blog includes 71 thousand records of different product reviews in text form. The dataset is cleaned using different data manipulation methods to categorize the data. The dataset is prepared for the machine learning models by removing the unused attributes in the data. The performance of each machine learning model is discussed, and a summary of each model is then mentioned in the end. Additionally, the use of deep learning models has been shown to improve the accuracy of sentiment analysis, particularly for longer and more complex texts. Overall, the accuracy of a sentiment analysis model can be evaluated using a variety of metrics, such as precision, Recall, and F1 score.
Introduction
Sentiments play a vital role in human life. Finding Human emotions helps us use them in the right way. One can easily find emotions using human actions, especially human gestures. Humans do specific actions when they have different sentiments.
Each second ton of data is produced. “According to the Cloud Tweaks almost 2.5 quintillion bytes per day is produced”. While we can find human sentiments when we visually see the actions. We also need a method to do data analysis on this text-based data to extract sentiments information.
Sentiment Analysis:
The other name of sentiment analysis is opinion mining in which we study reviews of emotion, views, sentiments, opinion which are expressed in text.
What is the meaning of sentiment analysis?
Sentiment analysis is combination of two words sentiment and analysis. Sentiment can show emotion, opinion are judgments. In computational Linguistics the point of interest is on opinion and sentiments rather than on emotion and the word ‘sentiment’ and ‘opinion’ are extensively used alternately additionally in this thesis.
“The Fact are objective expressions about object, entities and their characteristic While the Opinion are the normally subjective expression that identify people sentiment feeling toward objects, features event and their characteristics.”
Sentiment Analysis “Sentiment classification or opinion mining or reviews mining and in some situation called polarity classification this deal the with machine handling of subjective, sentiment and opinion in the sentence or text”
Why the sentiment analysis is important
When online customers buy a product online, they first see the reviews on that product and then take their decision. So without to classification of that reviews the customer taking decision difficult. Hence it so powerful marketing tools by this the owner of business can show his business quality.
The origin of sentiment analysis can be traced to the 1950s, when sentiment analysis was primarily used on written paper documents. With recent development, later sentiment analysis algorithms were implemented which resulted in a better prediction of human emotion.
Overall, sentiment analysis is important because it allows businesses and organizations to process enormous amounts of written quickly and accurately or spoken language, and to extract valuable insights and information from this data. This can help them to make more informed decisions and to improve their products and services.
Some algorithms were designed for text processing, which made us able to do analysis on text to find out the sentiments. Algorithms like Lexicon based method were used to do it. In lexicon-based method we first tokenize every word. And in this process every single word was used as characteristic i.e., if a sentence has good word, it likely to score positive sentence.
But we cannot get accurate results every time there are a lot of cases where we cannot use the single keyword characteristic. Such as a sentence which has these two inappropriate words so if a machine reads the bad word, then the score of sentences negative but the sentence positive sentence and in this single word characteristic can even lead wrong class.
For to increase the accuracy of the different machine learning model in comparison of single word characteristic we can use the n-gram such as 1,2-gram feature for sentiment prediction. Here in this blog, we differentiate the 1-gram and 2-gram sentiment prediction. It is also called unigram and bi-gram features.
We used four different machine model logistic regression to predict the probability of a target variable, Positive and Negative classes. As also used Support vector machine (SVM) find the accuracy by to find the hyper-plane that has the largest margin. As well as using the Random Forest machine learning model which grows and combines multiple decision trees to make forest it used for classification as well as regression but here it is used for classification of Positive and Negative classes. For classification tasks, the output of the Random Forest is the magnificence decided on via way of means of largest trees.
Preprocessing
About This Data set
In this thesis we describe the data of Grammar and Online Product Reviews.
This is a list of over 71,045 reviews from 1,000 distinct products provided by Datafiniti’s Product Database. The dataset includes the text and title of the review, name, manufacturer of the product, reviewer metadata, and more (data.world).
In this data set we select only two columns where we change the name of columns reviews. text replace with Reviews_ text and review. rating names replace with Sentiment. For training the model we can divide the data set into two-part Training data set and the test data set. About 70 percent of the data set is training data set and the other 30 percent data set is the test data set.
Display data frame
Here we show the data frame of the given data set we call the data by the name of selected_df.
This Reviews_text have the review text which have the punctuations, misspelled words or used shortcuts for words which did not understand machine, some row of Reviews_text columns are empty, some reviews have link like data which are unrelated and ambiguous data for these all so we can do cleaning steps which are deeply discussed in last.
The number of reviews base on score:
Score Reviews
5 46543
4 14598
3 4369
1 3701
2 1833
Positive reviews are considering which have 4 and 5 scores and 1 and 2 scores are consider are negative reviews.
The ratio of Positive and Negative reviews:
Data Cleaning
It is a key step to prepare data for data analysis. It increases overall productivity and gives more accuracy and for highest the quality in our decision making. And help to remove errors and bugs. Remove the unwanted punctuations, duplicate data, inaccurate data, links, special characters, stop words filtering, from data set.
Preprocessing is a key step in the sentiment analysis process. It involves cleaning and preparing the text data so that it can be accurately analyzed by a machine learning model. This typically involves several different tasks, such as:
- Tokenization: This involves breaking the text down into individual words or phrases (called tokens) so that they can be more easily processed by the model.
- Stemming: This involves reducing words to their base form (called the stem) to normalize the text data. For example, “running”, “runs”, and “ran” would all be reduced to the stem “run”.
- Part-of-speech tagging: This involves showing the part of speech (e.g., noun, verb, adjective) of each word in the text. This can be useful for finding the sentiment of a word, as certain parts of speech are more likely to convey a positive or negative sentiment.
- Stop word removal: This involves removing simple words (called stop words) such as “a”, “the”, and “and” that do not supply useful information for the sentiment analysis.
- Normalization: This involves converting all text to lowercase and removing punctuation or special characters.
Preprocessing the text data in this way can help to improve the accuracy of the sentiment analysis by ensuring that the model is only analyzing the most relevant and useful information from the text. Top of Form
To remove stopwords use Library
NLTK
This library removes all stopword i.e “the”, “a”, “an”, “in from Reviews_text column
rmv_stop_words = stopwords. words(“English”)
After removing stop words all the text of column Reviews_text convert all to lower case.
Selected_df[‘Reviews_text’]. Lower ()
After the text is converted to lower case then we extract stem of every word of the Reviews_text column text uses the PorterStemmer node and split function to tokenize it.
i.e
Use SnowballStemmer library for other languages like French, Chinese, Italian to extract stem from words.
After all preprocessing of data, we reach the midway point of analysis which is to extract the terms to use as components of the document vectors and as input features to the given SVM, Logistic regression, Naive Bayes and Random Forest classification models.
For every form of vector, we train the model such as for unigram and bi gram in Implementation part. To create the document columns Reviews_text vectors first we create the Bag of words.
Now the data is ready for training the models.
Machine Learning Algorithms
We discussed 4 types of machine learning algorithms Support Vector machine, Logistic Regression, Random Forest algorithms, Naive Bayes.
Support Vector Machine (SVM)
Support vector machine is supervised machine learning algorithm, which is used classification, outlier detection and the other is for regression purpose. In this thesis the SVM is built for Positive and Negative classification of Reviews. It assigns the new data to one of the given classes or categories.
In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Guyon et al., 1993, Cortes and Vapnik, 1995, Vapnik et al., 1997) SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik (1982, 1995) and Chervonenkis (1974) (wikipedia, n.d.).
SVMs can be used for linear classification purposes. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using the kernel trick. It enables us to implicitly map the inputs into high dimensional feature spaces (kaggale). Once the hyperplan has been draw, the SVM can the be used to classifie new, un seen text data as positive or negative based on which side of the hyperplane falls on it. In ordre to maker this classification, the SVM uses several different mathematical techniques, such as kernel functions and support vectors, to find the optimal hyperplane.
Overall, SVM is a powerful and effective algorithm for sentiment analysis and has been shown to produce reliable results in many different applications.
In the above graph (a) green circle is the Positive vector and the orange color is the Negative vector. The higher the margin higher will be the accuracy. In this graph the data are showing like regular shape, so it is easy to classify if data point is so dispersed then it is difficult to separate them using a linear hyperplane In this situation the SVM use the Kernel trick it transforms the data into high dimension space. The kernel trick transforms the low dimension to higher dimension then the non-linear separable problems convert to linear separable problems of classification.
The best value of hyperplane can be seen is
𝑤.𝑥 + 𝑏 = 0
And the margin is
margin = 2 / ‖𝑤‖
Logistic Regression:
It is a statistical model used for predictive analysis and classification. It also estimates the probability of an event such as yes or no occurring based on given data of independent variable the dependent variable should be between the 1 and 0.
It belongs to the supervised machine learning model. It differentiates between classes such as in this thesis by the help of this machine learning model tries to distinguish between the Positive and Negative reviews classes.
𝑥 = 𝑒xp(𝑤. 𝑥+𝑏)/ 1+𝑒xp(𝑤.𝑥+𝑏) ………………. (1)
𝑥 = 1 /1+𝑒xp(𝑤x+𝑏) …………………….……… (2)
Here 𝑥 ∈ 𝑅^𝑛 is the input feature, 𝑌 ∈ {Positive, Negative} is the label vector, w is the weight, b is the offset value, and w.x is the dot product from the matrices. The logistic regression then compares the two probability values and assigns x to the class with the higher probability value.
The above graph 0 are the Negative and 1 represent the Positive
correct | not correct | |
selected | True Positive | False Positive |
not selected | False Negative | True Positive |
Confusion matrix fig (3)
Precision, Recall, F1-measure
In the statistical model of binary classification, the F1 score is a technique to measure the accuracy of a model. In this thesis we measure the accuracy of all models.
Together, precision and recall can be used to evaluate the overall performance of a sentiment analysis model. A good model will have high precision and high recall, showing that it is both correct and thorough in its predictions.
Top of Form
Random Forest
This is also a supervised machine learning algorithm used for classification and regression problems. In this algorithm, we build a decision tree on different samples and take the average to improve the predictive accuracy of that dataset. If there is more decision tree the model will be robust, and the accuracy will be high and have ability to solving of problem
Steps for RF algorithm
- Select random samples from a given training set.
- This algorithm will construct a decision tree for every training data.
- Voting will take place by averaging the decision tree.
Finally, select the most voted prediction result as the final prediction result. Random forest model attribute (simpliearn).
Hyperparameters of Random Forest model:
- n_estimators: Number of trees built by the algorithm before averaging the products.
- max_features: How many maximum numbers of features Random Forest uses before considering splitting a node.
- mini_sample_leaf: Determines the minimum number of leaves needed to
Hyperparameters to increase the speed of the model:
- n_jobs: Conveys to the engine how many processors are allowed to use. If the value is 1, it can use only one processor, but if the value is -1, there is no limit.
- random_state: Controls randomness of the sample. The model will always produce the same results if it has a definite value of random state and if it has been given the same hyperparameters and the same training data.
- oob_score: OOB (Out of the Bag) is a Random Forest cross-validation method. In this, one-third of the sample is not used to train the data but to evaluate its performance
Leaf Node
- A leaf node is a node that carries the classification or the decision.
Decision Node
- A node that has two or more branches.
Root Node
- The root node is the topmost decision node, which is where you have all of your data(simplilearn).
Naive Bayes (NB)
The fourth one is the Naive Bayes (NB) classifier for defining the model. Specifically, we will use MultinomialNB classifier.
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable (wekipedia).
Here,
- P(c|x): posterior probability of class (c, target) given predictor (x, attributes). This is the probability of c being true, supplied x is true.
- P(c): is the prior probability of class. This is the observed probability of class out of all the observations.
- P(x|c): is the likelihood which is the probability of predictor given class. This is the probability of x being true, supplied x is true.
- P(x): is the prior probability of predictor. This is the observed probability of predictor out of all the observations.
With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial {\displaystyle (p_{1},\dots ,p_{n})} where {\displaystyle p_{i}} is the probability that event i occurs (or K such multinomial in the multiclass case).
If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero, because the probability estimate is directly proportional to the number of occurrences of a feature’s value. This is problematic because it will wipe out all information in the other probabilities when they are multiplied (wikipedia.org).
Model Implementation
Naive Bayes
Navie Bayes accuracy for unigram is 93.81.
For Bi gram
cv = CountVectorizer (stop_words= ‘English’, ngram_range= (2,2), tokenizer = token.tokenize)
True Positive: 14673
False Positive: 586
False Negative: 446
True Positive: 964
True Positive: 13875
False Positive: 1384
False Negative: 361
True Positive: 1049
Because of this, the classifier can be trained very quickly and is well-suited to large datasets. However, the naive independence assumptions that the classifier makes can sometimes lead to misclassifications, which can reduce its overall accuracy.
Why we find the n-gram accuracy
The reason is that the bag of words model is quite a simple approach to word vectorization and has certain limitations. Mainly, the words are not ordered as they are collected, so a lot of the context from the text is lost.”
And if in the context have two opposite word which meaning is opposite example not bad, not good by using bi-gram or greater than two gram make the meaning of context better.
So, to solve this type of problem using the n-gram can help to show some order to preserve context. Here the accuracy difference is 0.9 percent. It shows the accuracy of bi gram is better than uni gram.
Logistic regression model
Train logistic regression model for uni-gram
Result of Logistic regression
Classifier | N-gram | Accuracy |
Logistic regression | Unigram | 0.961 |
Logistic regression | Bi gram | 0.954 |
Why uni-gram accuracy is greater than bi-gram?
Due to data sparsity here the accuracy of uni-gram is higher than bi-gram because user only rates a small set of items. These problems occur with increasing dimension of data.
Random Forest
True Positive: 18308
False Positive: 25
False Negative: 949
True Positive: 721
Applying SVM Model
True Positive: 18248
False Positive: 85
False Negative: 549
True Negative: 1121
F1 score calculation is a method used for calculating the performance of a classification model.
True Positive: 18248
False Positive: 85
False Negative: 549
True Positive: 1121
Hyperplane that has the largest margin, which means the maximum distance between the nearest points in the different classes then higher will be the accuracy.
Conclusion
We summarize sentiment analysis and the accuracy comparison of different machine learning models. This thesis includes three types of machine learning models. These all are supervised machine learning models which are Support vector machine, Logistic regression and Random Forest. It is an automated model for evaluating sentiments and depends on each word weight replacing term frequency of each word. It classifies a sentiment strength into two sentiment polarity classification levels.
To test the results of the model’s performance we estimate predictions on the train and test set. It is a practical and reliable way to deal with over- and under-fitting. To measure the learning performance of methods we use accuracy and F1-score metrics.
Below we demonstrate the results obtained for all four models.
Here we show the result of Navie Bayes which are:
classifier | n-gram | accuracy |
Naive Bayes | Unigram | 93.81% |
Naive Bayes | Bi gram | 89.53% |
Table 1
Table 1 shows the accuracy of Navie Buyes which is the lowest accuracy as compared to other. The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. Because of this, the classifier can be trained very quickly and is well-suited to large datasets. However, the naive independence assumptions that the classifier makes can sometimes lead to misclassifications, which can reduce its overall accuracy. There are a few ways to improve the accuracy of a Naive Bayes classifier, such as by using a more sophisticated model that can account for dependencies between features, or by using a different classification algorithm altogether.
Logistic Regression results:
Classifier | N-gram | F1 score |
Logistics Regression | Unigram | 0.961% |
Logistics Regression | Bi gram | 0.954% |
Table 2
Table 2 shows the F1 score of Logistic Regression for uni-gram and bi-gram bag of words as compared to SVM its higher accuracy.
Classifier | N-gram | F1 score |
Random Forest | Unigram | 0.951% |
Random Forest | Bi gram | 0.962% |
Table 3
In Table 3 the Random Forest accuracy of uni-gram is less than the accuracy of Logistic Regression of uni-gram.
Classifier | N-gram | F1 score |
Support vector machine | Unigram | 0.96% |
Support vector machine | Bi gram | 0.97% |
Table 4
Table 4 shows the F1 score of Support Vector machine for uni-gram is 0.96 and for bi-gram the accuracy is 0.97 which is the highest accuracy from other.
Here the accuracy is higher than all because they find the hyperplane in a high-dimensional space that maximally separates the classes in the training data. This hyperplane is called the “support vector,” and the goal of the SVM is to find the hyperplane that has the largest margin, which means the maximum distance between the nearest points in the different classes. Because SVMs can find the hyperplane with the largest margin, they tend to have good generalization performance and can achieve high accuracy on a variety of classification tasks.
Based on the results one can draw the conclusion that the model is over-fitted. This is for sure not a good sign, however, it can be fine-tuned by changing hyper- parameters of the model and using cross-validation techniques. Overall, the score on the test is considerably high. Another thing to note – training in logistic regression take a small amount of time. This makes the method a good starting model to test how data preparation stages influence model performance. Random Forest training took much more time compared to the other two models. And his accuracy is better than Navie Buy
Ibad khan
Intern /Data Analytics