Implementing Support Vector Machines (SVM) Classifier using Python
Explore how to implement the Support Vector Machine Algorithm in Python using a real-life dataset
What is Support Vector Machines
Support Vector Machines also known as SVMs is a supervised machine learning algorithm that can be used to separate a dataset into two classes using a line. This line is called a maximal margin hyperplane, because the line typically has the biggest margin possible on each side of the line to the nearest point.
See example below.
Why and When to use Support Vector Machines?
Support Vector Machine can be used for supervised machine learning problems such as classification, regressions, and outlier detection. It is found to be effective in high dimensional spaces and also versatile.
They work best where there is a clear separation between classes. SVM is not ideal for large datasets that have lots of noise or overlapping classes. If you have lots of features and large dataset, SVM might be slow.
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
from google.colab import files
uploaded = files.upload()
import io
hcv = pd.read_csv(io.BytesIO(uploaded['hcvdat0.csv'])) ##used BytesIO instead of StringIO
hcv
hcv['Category'] = hcv['Category'].astype('category') # change the objects to category type data
hcv['Sex'] = hcv["Sex"].astype('category')
from sklearn.preprocessing import LabelEncoder # use label encoder for male vs female
label = LabelEncoder() #initialize
hcv['Gender'] = label.fit_transform(hcv['Sex'])
hcv
drop_index = hcv.loc[hcv['Category'] == '0s=suspect Blood Donor'].index # Drop rows with Os=suspect blood donor
hcv.drop(drop_index, inplace=True)
hcv_dict = {'0=Blood Donor': 0, '1=Hepatitis' : 1, '2=Fibrosis' : 2, '3=Cirrhosis': 3}
hcv["New Category"] = hcv['Category'].map(hcv_dict).astype('int32') # create new column for Category to remap values
hcv
hcv.dropna(how='any', inplace=True) # drop rows with null values
## split into features and label/target
X = hcv.iloc[:, 2:-1].drop(columns = 'Sex').to_numpy() # features
y = hcv.iloc[:, -1].to_numpy() # label/target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn import svm
clf = svm.SVC() # Initiate svm.SVC (support vector classifier)
clf.fit(X_train, y_train) # train model
y_pred = clf.predict(X_test) # perform prediction
Time the SVM classifier
import time
start = time.time()
from sklearn import svm
clf = svm.SVC() # Initiate svm.SVC (support vector classifier)
clf.fit(X_train, y_train) # train model
stop = time.time()
print ('training time: ', round(stop - start, 3), 's')
start = time.time()
y_pred = clf.predict(X_test) # perform prediction
stop = time.time()
print ('prediction time: ', round(stop - start, 3), 's')
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
Try with different Kernels to see if performance improves.
There are different Kernels that can be used with svm.SVC: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}.
However default=’rbf’.
The non-linear kernels are used where the relationship between X and y may not be linear. The decision boundary can be linear or non-linear.
clf_linear = svm.SVC(kernel='linear') # Initiate SVM classifier with linear kernel
clf_linear.fit(X_train, y_train) # train model
y_pred_linear= clf_linear.predict(X_test)
clf_rbf = svm.SVC(kernel='rbf') # Initiate SVM classifier with rbf kernel
clf_rbf.fit(X_train, y_train) # train model
y_pred_rbf= clf_rbf.predict(X_test)
clf_poly = svm.SVC(kernel='poly') # Initiate SVM classifier with poly kernel
clf_poly.fit(X_train, y_train) # train model
y_pred_poly= clf_poly.predict(X_test)
model_eval = pd.DataFrame(columns=['Score'])
model_eval.loc['Linear','Score'] = accuracy_score(y_pred_linear, y_test)
model_eval.loc['RBF','Score'] = accuracy_score(y_pred_rbf, y_test)
model_eval.loc['Polynomial','Score'] = accuracy_score(y_pred_poly, y_test)
model_eval
Other parameters you can pass to your Classifier when initiating it
C: float, default=1.0 is a regularization parameter which trades off correct classification of the training set with the maximization of the margin for the decision boundary. With a larger value of C, you get a more accurate prediction but a more complex decision boundary.
gamma: {‘scale’, ‘auto’} or float, default=’scale’ is typically only used with the non-linear kernel. When the gamma value is very low, the model is unable to capture or complexity of the data, it will be more similar to a linear model. However with a higher gamma value, there is greater chance of overfitting.
There are many other parameters you can pass, reference the Scikit Learn library (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
clf_rbf2 = svm.SVC(kernel='rbf', C=10) # Initiate SVM classifier with linear kernel
clf_rbf2.fit(X_train, y_train) # train model
y_pred_linear= clf_rbf2.predict(X_test)
accuracy_score(y_pred_linear, y_test)
clf_rbf2 = svm.SVC(kernel='rbf', C=1) # Initiate SVM classifier with linear kernel
clf_rbf2.fit(X_train, y_train) # train model
y_pred_linear= clf_rbf2.predict(X_test)
accuracy_score(y_pred_linear, y_test)
#y_train = y_train[1:10]
import time
start = time.time()
from sklearn import svm
clf = svm.SVC() # Initiate svm.SVC (support vector classifier)
clf.fit(X_train, y_train) # train model
stop = time.time()
print ('training time: ', round(stop - start, 3), 's')
y_pred = clf.predict(X_test) # perform prediction
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
References
scikit learn - Support Vector Machines
Support Vector Regression (SVR) using linear and non-linear kernels (https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html#sphx-glr-auto-examples-svm-plot-svm-regression-py)
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html
UCI Machine Learning Repository - HCV dataset. Accessed 15-Nov-2020.