About

Sentiment Analysis is a field of natural language processing that seeks to use machine learning techniques to determine sentiment scores for a body of text. The idea is to determine the polarity of the phrase or sentence as negative, neutral or positive. In some cases to determine if a statement is objective or subjective.

Our dataset is from the Kaggle website, "it contains 1.6 million tweets extracted using the twitter api". Since this a very large dataset, we will be using pandas_profiling for quick data analysis. We also expect longer runtimes for some of the codes.

The polarity of the tweets are annotated as 0 = negative and 4 = positive.

To perform the sentiment analysis, the vaderSentiment library will be used. It was created specifically for analyzing sentiments expressed in seocial media. VADER stands for Valence Aware Dictionary and Sentiment Reasoner.

With the vaderSentiment library, we are going to reclassify polarity as negative, neutral and negative sentiment. Typical thresholds from the vaderSentiment GitHub Page as follows:

Sentiment Compound
Positive >= 0.05
Neutral > -0.05 and < 0.05
Negative <= -0.05

Install required dependencies

!pip install vaderSentiment
Requirement already satisfied: vaderSentiment in /usr/local/lib/python3.6/dist-packages (3.3.2)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from vaderSentiment) (2.23.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->vaderSentiment) (2020.6.20)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->vaderSentiment) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->vaderSentiment) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->vaderSentiment) (2.10)
import sys
!{sys.executable} -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension
Requirement already up-to-date: pandas-profiling[notebook] in /usr/local/lib/python3.6/dist-packages (2.9.0)
Requirement already satisfied, skipping upgrade: matplotlib>=3.2.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (3.2.2)
Requirement already satisfied, skipping upgrade: tangled-up-in-unicode>=0.0.6 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.0.6)
Requirement already satisfied, skipping upgrade: requests>=2.23.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (2.23.0)
Requirement already satisfied, skipping upgrade: seaborn>=0.10.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.11.0)
Requirement already satisfied, skipping upgrade: attrs>=19.3.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (20.2.0)
Requirement already satisfied, skipping upgrade: htmlmin>=0.1.12 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.1.12)
Requirement already satisfied, skipping upgrade: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (1.1.3)
Requirement already satisfied, skipping upgrade: missingno>=0.4.2 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.4.2)
Requirement already satisfied, skipping upgrade: jinja2>=2.11.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (2.11.2)
Requirement already satisfied, skipping upgrade: visions[type_image_path]==0.5.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.5.0)
Requirement already satisfied, skipping upgrade: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (1.18.5)
Requirement already satisfied, skipping upgrade: phik>=0.9.10 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.10.0)
Requirement already satisfied, skipping upgrade: tqdm>=4.43.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (4.51.0)
Requirement already satisfied, skipping upgrade: joblib in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.17.0)
Requirement already satisfied, skipping upgrade: ipywidgets>=7.5.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (7.5.1)
Requirement already satisfied, skipping upgrade: scipy>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (1.4.1)
Requirement already satisfied, skipping upgrade: confuse>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (1.3.0)
Requirement already satisfied, skipping upgrade: jupyter-core>=4.6.3; extra == "notebook" in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (4.6.3)
Requirement already satisfied, skipping upgrade: jupyter-client>=6.0.0; extra == "notebook" in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (6.1.7)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.4.7)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (1.2.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.8.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (0.10.0)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.23.0->pandas-profiling[notebook]) (2.10)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.23.0->pandas-profiling[notebook]) (1.24.3)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.23.0->pandas-profiling[notebook]) (2020.6.20)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.23.0->pandas-profiling[notebook]) (3.0.4)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (2018.9)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.11.1->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied, skipping upgrade: networkx>=2.4 in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (2.5)
Requirement already satisfied, skipping upgrade: imagehash; extra == "type_image_path" in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (4.1.0)
Requirement already satisfied, skipping upgrade: Pillow; extra == "type_image_path" in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (7.0.0)
Requirement already satisfied, skipping upgrade: numba>=0.38.1 in /usr/local/lib/python3.6/dist-packages (from phik>=0.9.10->pandas-profiling[notebook]) (0.48.0)
Requirement already satisfied, skipping upgrade: nbformat>=4.2.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.0.8)
Requirement already satisfied, skipping upgrade: ipykernel>=4.5.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.10.1)
Requirement already satisfied, skipping upgrade: ipython>=4.0.0; python_version >= "3.3" in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.5.0)
Requirement already satisfied, skipping upgrade: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.5.1)
Requirement already satisfied, skipping upgrade: traitlets>=4.3.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.3.3)
Requirement already satisfied, skipping upgrade: pyyaml in /usr/local/lib/python3.6/dist-packages (from confuse>=1.0.0->pandas-profiling[notebook]) (3.13)
Requirement already satisfied, skipping upgrade: tornado>=4.1 in /usr/local/lib/python3.6/dist-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (5.1.1)
Requirement already satisfied, skipping upgrade: pyzmq>=13 in /usr/local/lib/python3.6/dist-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (19.0.2)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.1->matplotlib>=3.2.0->pandas-profiling[notebook]) (1.15.0)
Requirement already satisfied, skipping upgrade: decorator>=4.3.0 in /usr/local/lib/python3.6/dist-packages (from networkx>=2.4->visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (4.4.2)
Requirement already satisfied, skipping upgrade: PyWavelets in /usr/local/lib/python3.6/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.6/dist-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[notebook]) (50.3.2)
Requirement already satisfied, skipping upgrade: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[notebook]) (0.31.0)
Requirement already satisfied, skipping upgrade: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied, skipping upgrade: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (2.6.0)
Requirement already satisfied, skipping upgrade: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.0.18)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.8.0)
Requirement already satisfied, skipping upgrade: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.5)
Requirement already satisfied, skipping upgrade: pygments in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (2.6.1)
Requirement already satisfied, skipping upgrade: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.1)
Requirement already satisfied, skipping upgrade: notebook>=4.4.1 in /usr/local/lib/python3.6/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.3.1)
Requirement already satisfied, skipping upgrade: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.5)
Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.6.0)
Requirement already satisfied, skipping upgrade: terminado>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.9.1)
Requirement already satisfied, skipping upgrade: Send2Trash in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.5.0)
Requirement already satisfied, skipping upgrade: nbconvert in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.6.1)
Requirement already satisfied, skipping upgrade: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.3)
Requirement already satisfied, skipping upgrade: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.4.2)
Requirement already satisfied, skipping upgrade: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.4.4)
Requirement already satisfied, skipping upgrade: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.6.0)
Requirement already satisfied, skipping upgrade: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.2.1)
Requirement already satisfied, skipping upgrade: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.4)
Requirement already satisfied, skipping upgrade: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.5.1)
Requirement already satisfied, skipping upgrade: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (20.4)
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK

Import required Libraries

import pandas as pd 
import numpy as np 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

Upload dataset and create dataframe

from google.colab import files
files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving training.1600000.processed.noemoticon.csv to training.1600000.processed.noemoticon (1).csv
df = pd.read_csv('training.1600000.processed.noemoticon.csv', sep="," , header=None, encoding='latin-1', parse_dates=True, infer_datetime_format=True )
df
0 1 2 3 4 5
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....
... ... ... ... ... ... ...
1599995 4 2193601966 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY AmandaMarie1028 Just woke up. Having no school is the best fee...
1599996 4 2193601969 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY TheWDBoards TheWDB.com - Very cool to hear old Walt interv...
1599997 4 2193601991 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY bpbabe Are you ready for your MoJo Makeover? Ask me f...
1599998 4 2193602064 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY tinydiamondz Happy 38th Birthday to my boo of alll time!!! ...
1599999 4 2193602129 Tue Jun 16 08:40:50 PDT 2009 NO_QUERY RyanTrevMorris happy #charitytuesday @theNSPCC @SparksCharity...

1600000 rows × 6 columns

df.columns = ['Polarity', 'tweet_id', 'date', 'flag', 'user', 'text']

df.head(5)
Polarity tweet_id date flag user text
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Polarity  1600000 non-null  int64 
 1   tweet_id  1600000 non-null  int64 
 2   date      1600000 non-null  object
 3   flag      1600000 non-null  object
 4   user      1600000 non-null  object
 5   text      1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB

Perform Exploratory Data Analysis using Pandas_Profiling

This is a very large dataset with 1.6 million rows, we will use pandas_profiling to help us explore the data better and faster.

from pandas_profiling import ProfileReport

# generate report 

profile = ProfileReport(df, title = 'Pandas Profiling Report', explorative=True)

# to view it in Google Colab 

profile.to_notebook_iframe()


df['text'].astype('string', copy=True) # convert text from object to string
0          @switchfoot http://twitpic.com/2y1zl - Awww, t...
1          is upset that he can't update his Facebook by ...
2          @Kenichan I dived many times for the ball. Man...
3            my whole body feels itchy and like its on fire 
4          @nationwideclass no, it's not behaving at all....
                                 ...                        
1599995    Just woke up. Having no school is the best fee...
1599996    TheWDB.com - Very cool to hear old Walt interv...
1599997    Are you ready for your MoJo Makeover? Ask me f...
1599998    Happy 38th Birthday to my boo of alll time!!! ...
1599999    happy #charitytuesday @theNSPCC @SparksCharity...
Name: text, Length: 1600000, dtype: string
df['flag'].unique()
array(['NO_QUERY'], dtype=object)
df['Polarity'].unique()
array([0, 4])

Apply Vader Sentiment Analysis function

sia = SentimentIntensityAnalyzer()

sia_t = lambda x: sia.polarity_scores(x)  # this function will return a dictionary of values

df['pos','compound', 'neu', 'neg' ] = df['text'].apply(sia_t)

print(df.head())
   Polarity  ...                          (pos, compound, neu, neg)
0         0  ...  {'neg': 0.117, 'neu': 0.768, 'pos': 0.114, 'co...
1         0  ...  {'neg': 0.291, 'neu': 0.709, 'pos': 0.0, 'comp...
2         0  ...  {'neg': 0.0, 'neu': 0.842, 'pos': 0.158, 'comp...
3         0  ...  {'neg': 0.321, 'neu': 0.5, 'pos': 0.179, 'comp...
4         0  ...  {'neg': 0.138, 'neu': 0.862, 'pos': 0.0, 'comp...

[5 rows x 7 columns]
df_sia = pd.json_normalize(df[('pos', 'compound', 'neu', 'neg')], max_level=0)

df_sia.head()
neg neu pos compound
0 0.117 0.768 0.114 -0.0173
1 0.291 0.709 0.000 -0.7500
2 0.000 0.842 0.158 0.4939
3 0.321 0.500 0.179 -0.2500
4 0.138 0.862 0.000 -0.4939
df_sia.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 4 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   neg       1600000 non-null  float64
 1   neu       1600000 non-null  float64
 2   pos       1600000 non-null  float64
 3   compound  1600000 non-null  float64
dtypes: float64(4)
memory usage: 48.8 MB
new_df = df.join(df_sia, how='left')

new_df.head()
Polarity tweet_id date flag user text (pos, compound, neu, neg) neg neu pos compound
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t... {'neg': 0.117, 'neu': 0.768, 'pos': 0.114, 'co... 0.117 0.768 0.114 -0.0173
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ... {'neg': 0.291, 'neu': 0.709, 'pos': 0.0, 'comp... 0.291 0.709 0.000 -0.7500
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man... {'neg': 0.0, 'neu': 0.842, 'pos': 0.158, 'comp... 0.000 0.842 0.158 0.4939
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire {'neg': 0.321, 'neu': 0.5, 'pos': 0.179, 'comp... 0.321 0.500 0.179 -0.2500
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all.... {'neg': 0.138, 'neu': 0.862, 'pos': 0.0, 'comp... 0.138 0.862 0.000 -0.4939
new_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 11 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Polarity                   1600000 non-null  int64  
 1   tweet_id                   1600000 non-null  int64  
 2   date                       1600000 non-null  object 
 3   flag                       1600000 non-null  object 
 4   user                       1600000 non-null  object 
 5   text                       1600000 non-null  object 
 6   (pos, compound, neu, neg)  1600000 non-null  object 
 7   neg                        1600000 non-null  float64
 8   neu                        1600000 non-null  float64
 9   pos                        1600000 non-null  float64
 10  compound                   1600000 non-null  float64
dtypes: float64(4), int64(2), object(5)
memory usage: 134.3+ MB
new_df.drop(columns=('pos', 'compound', 'neu', 'neg'), inplace=True) # drop the dictionary column
new_df.columns  # check to see dictionary dropped
Index(['Polarity', 'tweet_id', 'date', 'flag', 'user', 'text', 'neg', 'neu',
       'pos', 'compound'],
      dtype='object')
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline

sns.displot(x=new_df['compound'])
plt.show()

Create new column for Sentiment Type

conditions = [
              (new_df['compound'] > 0.05),
              (new_df['compound'] > -0.05) & (new_df['compound'] <= 0.05),
              (new_df['compound'] <= -0.05)
              ]

## create list of values to assign to the conditions

values = ['positive', 'neutral', 'negative']

## create a new column 

new_df['Sentiment'] = np.select(conditions, values)

new_df.head()
Polarity tweet_id date flag user text neg neu pos compound Sentiment
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t... 0.117 0.768 0.114 -0.0173 neutral
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ... 0.291 0.709 0.000 -0.7500 negative
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man... 0.000 0.842 0.158 0.4939 positive
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire 0.321 0.500 0.179 -0.2500 negative
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all.... 0.138 0.862 0.000 -0.4939 negative
values1 = [4, 2, 0]

new_df['Polarity_new'] = np.select(conditions, values1)
new_df.head()
Polarity tweet_id date flag user text neg neu pos compound Sentiment Polarity_new
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t... 0.117 0.768 0.114 -0.0173 neutral 2
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ... 0.291 0.709 0.000 -0.7500 negative 0
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man... 0.000 0.842 0.158 0.4939 positive 4
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire 0.321 0.500 0.179 -0.2500 negative 0
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all.... 0.138 0.862 0.000 -0.4939 negative 0
import seaborn as sns
%matplotlib inline

sns.displot(new_df['Polarity_new'])
plt.show()
sns.displot(new_df['Polarity'])
plt.show()
df_neg_neu = new_df[(new_df['Polarity'] == 0) & (new_df['Polarity_new'] == 4)]
                     
print(df_neg_neu['text'].head(20))
2     @Kenichan I dived many times for the ball. Man...
6                                           Need a hug 
7     @LOLTrish hey  long time no see! Yes.. Rains a...
14    @smarrison i would've been the first, but i di...
15    @iamjazzyfizzle I wish I got to watch it with ...
18    @LettyA ahh ive always wanted to see rent  lov...
19    @FakerPattyPattz Oh dear. Were you drinking ou...
21    one of my friend called me, and asked to meet ...
23               this week is not going as i had hoped 
28    ooooh.... LOL  that leslie.... and ok I won't ...
33    @julieebaby awe i love you too!!!! 1 am here  ...
38    @fleurylis I don't either. Its depressing. I d...
41    He's the reason for the teardrops on my guitar...
43    @JonathanRKnight Awww I soo wish I was there t...
44    Falling asleep. Just heard about that Tracy gi...
45    @Viennah Yay! I'm happy for you with your job!...
46    Just checked my user timeline on my blackberry...
47    Oh man...was ironing @jeancjumbe's fave top to...
51    @localtweeps Wow, tons of replies from you, ma...
54                                        I need a hug 
Name: text, dtype: object

Create a Word Cloud using the Positive Words

from wordcloud import WordCloud

wc = WordCloud(max_words = 2000, background_color='yellow', width = 1600, height = 1600).generate(''.join(new_df[new_df.Polarity_new == 4].text))

plt.figure(figsize = (16,16), facecolor=None)
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Create a Word Cloud using the Neutral words

from wordcloud import WordCloud

wc = WordCloud(max_words = 2000, background_color='white', width = 1600, height = 1600).generate(''.join(new_df[new_df.Polarity_new == 2].text))

plt.figure(figsize = (16,16), facecolor=None)
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Create a Word Cloud using the Negative Words

from wordcloud import WordCloud

wc = WordCloud(max_words = 2000, width = 1600, height = 1600).generate(''.join(new_df[new_df.Polarity_new == 0].text))

plt.figure(figsize = (16,16), facecolor=None)
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

References

  1. vaderSentiment Python Library accessed 18-October-2020.
  2. Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
  3. Kaggle Sentiment Analysis Dataset accessed 18-October-2020.
  4. Pandas Profiling accessed 20-October-2020.