With growing interaction of people in cyber space, sentiment analysis has become a key area of ML. We will use scikit to predict the bad comments in the given data set.
Here’s the flow chart of the approach that we are going to take.

Now the we have defined the approach, let’s get our hand dirty with the code. I have written a python notebook explaining each step. We have tried to predict bad comments using four different famous classifiers, SVC, MultinomialNB, LogisticRegression, and SGDClassifier.
Click here to download the dataset
Final Result: SVC: 66% MultinomialNB: 11% LogisticRegression: 59% SGDClassifier: 47%
So, Naive Bayes gives very bad result. It can just predict 11% of bad comments. SGDClassifier predicted 47% of bad comments correctly which is a considerable improvement over the Naive Bayes. Logistic Regression though has regression in its surname but its a classifier and it shows good improvement over SGDClassifier.
SVC comes out as winner with 66 % correct prediction for sentiment analysis.
As you can see, each classifier consist of many different parameters.
For example MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) has alpha, class_prior and fit_prior.
In this post, we have run each classifier with the default setting. We will try to see how we can do performance tuning by changing parameters in the next post.
Troubleshooter:
- unicode(message, ‘utf8’).lower() may throw an error in python 3, replace with str(message).lower(). This is because default encoding is UTF-8 instead of ASCII in python 3.
- you may have to separately download additional databases such as ‘wordnet’ or ‘punkt’ for language processing, using nltk.download() command in your IDE.
Read More about classification
Productionalise your scikit model
If you liked this article and would like one such blog to land in your inbox every week, consider subscribing to our newsletter: https://skillcaptain.substack.com