Implementing Naive Bayes Classifier using Python
Explore how to implement the Naive Bayes Classifier in Python using a real-life dataset
About
Explore how to implement the Naive Bayes Classifier in Python using a dataset from the UCI Machine Learning Repository.
Dataset info
"The target attribute for classification is Category (blood donors vs. Hepatitis C (including its progress ('just' Hepatitis C, Fibrosis, Cirrhosis)"
DataSet Source:
Creators: Ralf Lichtinghagen, Frank Klawonn, Georg Hoffmann
Donor: Ralf Lichtinghagen: Institute of Clinical Chemistry; Medical University Hannover (MHH); Hannover, Germany; lichtinghagen.ralf '@' mh-hannover.de
Donor: Frank Klawonn; Helmholtz Centre for Infection Research; Braunschweig, Germany; frank.klawonn '@' helmholtz-hzi.de
Donor: Georg Hoffmann; Trillium GmbH; Grafrath, Germany; georg.hoffmann '@' trillium.de
What is the Naive Bayes classifier
Naive Bayes classifier is considered to be a family of supervised learning algorithms known as 'probablistic classifiers' that is based on applying the Bayes' theorem, and also assumes strong feature independence.
Why and When to use Naive Bayes classifier
Naive Bayes is actually simple to use and relatively fast when compared to other classification algorithms.
These classifier have worked well in applications such as multiclass prediction, text classification and spam filtering. In the Scikit Learn library, there are a few Naive Bayes algorithms:
Gaussian Naive Bayes: Assumes the features have a gaussian distribution.
Multinomial Naive Bayes: Assumes multinominally distributed data, and typically used for text classification.
Complement Naive Bayes: Is an adaptation of the standard multinomial Naive Bayes algorithm, and regularly outperforms on text classification task.
Bernoulli Naive Bayes: Assumes the features are binary-valued variables.
Categorical Naive Bayes: Assumes each feature has its own categorical distribution.
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
from google.colab import files
uploaded = files.upload()
import io
hcv = pd.read_csv(io.BytesIO(uploaded['hcvdat0.csv'])) ##used BytesIO instead of StringIO
hcv
hcv.info()
hcv['Category'] = hcv['Category'].astype('category') # change the objects to category type data
hcv['Sex'] = hcv["Sex"].astype('category')
hcv['Category'].unique()
sns.countplot(y='Category', data = hcv)
plt.show()
sns.countplot(y='Sex', data=hcv)
suspect = hcv.loc[hcv['Category'] == '0s=suspect Blood Donor'] # check the values for Os = suspect blood donor
suspect
hcv[hcv['ALP'].isnull()]
from sklearn.preprocessing import LabelEncoder # use label encoder for male vs female
label = LabelEncoder() #initialize
hcv['Gender'] = label.fit_transform(hcv['Sex'])
hcv
drop_index = hcv.loc[hcv['Category'] == '0s=suspect Blood Donor'].index # Drop rows with Os=suspect blood donor
hcv.drop(drop_index, inplace=True)
hcv_dict = {'0=Blood Donor': 0, '1=Hepatitis' : 1, '2=Fibrosis' : 2, '3=Cirrhosis': 3}
hcv["New Category"] = hcv['Category'].map(hcv_dict).astype('int32') # create new column for Category to remap values
hcv
hcv.dropna(how='any', inplace=True) # drop rows with null values
## split into features and label/target
X = hcv.iloc[:, 2:-1].drop(columns = 'Sex').to_numpy() # features
y = hcv.iloc[:, -1].to_numpy() # label/target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.naive_bayes import GaussianNB #Import Gaussian Naive Bayes model
clf = GaussianNB() #Inititate Gaussian Classifier
clf.fit(X_train,y_train) # Train the model using the training sets
y_pred = clf.predict(X_test) # perform prediction
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
from sklearn.naive_bayes import MultinomialNB #Import Multinomial Naive Bayes model
clf = MultinomialNB() #Inititate Multinomial Classifier
clf.fit(X_train,y_train) # Train the model using the training sets
y_pred = clf.predict(X_test) # perform prediction
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
References
UCI Machine Learning Repository - HCV dataset. Accessed 15-Nov-2020.
Sklearn Naive_Bayes. Accessed 15-Nov-2020.
Sklearn Metrics - accuracy_score. Accessed 15-Nov-2020.
Sklearn test_train_split.Accessed 15-Nov-2020.
Naive_Bayes_Classifier. Accessed 15-Nov-2020.