In this hands-on we continue with the topic of Supervised Learning, this time we will use a Naive Bayes classifier to decide in which one of two classes belongs each one of a set of observations. As usual the first step is to import the required packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
For this exercise we will use the Breast Cancer dataset, also included in Sklearn's datasets
package:
from sklearn.datasets import load_breast_cancer
breast_cancer_ds = sklearn.datasets.load_breast_cancer()
Let's take a look at the description of the dataset so we can get a better idea of what's in it:
print(breast_cancer_ds.DESCR)
This dataset has 30 features, which is way more than in any of our previous examples! We will have then to perform some feature selection to avoid having an overly complicated model. If we take a look at the description we can see that there are only 10 distinct attributes, with 3 statistics (mean, worst case and standard error) calculated for each one. To simplify our problem we will check if we can get reasonable results by taking only the worst case. Let's then put the data into a pandas DataFrame and select the features we are interested in:
breast_cancer_df_raw = pd.DataFrame(data=breast_cancer_ds.data, columns=breast_cancer_ds.feature_names)
worst_cols = filter(lambda x: x.split()[0] == "worst", breast_cancer_ds.feature_names)
breast_cancer_df = pd.DataFrame(breast_cancer_df_raw[worst_cols])
breast_cancer_df.head()
Now that we have selected our data, let's add the labels to the dataset to ease its manipulation:
breast_cancer_df['malignant'] = breast_cancer_ds.target
breast_cancer_df.head()
We can now take a look to the correlation matrix to identify the most relevant feaures:
plt.figure(figsize=(14, 12))
sns.heatmap(breast_cancer_df.corr(), annot=True);
We can see that the majority of the features we selected have a quite high correlation value. For this example and for visualization purposes we will only select the one with the highest correlation value (either negative or positive), let's say over 0.65 in absolute value:
most_relevant_features = breast_cancer_df.corr()[abs(breast_cancer_df.corr()["malignant"]) > 0.65].index.values
most_relevant_features
The first three features selected are the radius, perimeter and area, which are quite correlated between them (as we can see in the correlation matrix). Because using the three of them will not add much more information than using just one, we will select only radius to keep the number of features to a minimum:
most_relevant_features = ["worst radius", "worst concavity", "worst concave points", "malignant"]
breast_cancer_short_df = breast_cancer_df[most_relevant_features]
breast_cancer_short_df.head()
Now that we have identified the most relevant features we can check if there's a significant value separation between classes for each one of them using a boxplot:
import math
plt.figure(figsize=(15,10))
for i, feature in enumerate(breast_cancer_short_df.columns[:-1]):
rows = math.ceil(len(breast_cancer_short_df.columns[:-1])/2)
plt.subplot(rows, 2, i+1)
sns.boxplot(x="malignant", y=feature, data=breast_cancer_short_df)
plt.tight_layout()
plt.show()
There is indeed a significant value separation between classes, so we can expect to get a reasonably good classification with a relatively simple classifier. Let's now plot the scatter matrix of the selected features:
sns.pairplot(breast_cancer_short_df,vars=breast_cancer_short_df.columns[:-1], hue="malignant");
Although there is some overlap, we can see here some clear separation.
Before starting with our model definition let's split the data two for training and validation purposes:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(breast_cancer_short_df, test_size=0.3, random_state=0)
And get the names of the features and label column for convenience:
feats = breast_cancer_short_df.columns[:-1]
label = breast_cancer_short_df.columns[-1]
In this occasion we will be using a Gaussian Naive Bayes classifier, which is provided by Sklearn. As with all Sklearn models, we can train it with the fit function provinding the training data and labels:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X=train_df[feats], y=train_df[label]);
And get the accuracy on the training data:
train_score = gnb.score( train_df[feats], train_df[label])
print("Training accuracy: {}".format(train_score))
print("Test Accuracy: {}".format(gnb.score( test_df[feats], test_df[label] )))
Once the model is trained, we can use it to make predicions using the predict method:
y_pred = gnb.predict(test_df[feats]).T
And plot the confusion matrix:
from sklearn.metrics import confusion_matrix
plt.figure(figsize = (10,8))
sns.heatmap(confusion_matrix(test_df[label], y_pred), annot=True);
Copyright © Barcelona Supercomputing Center, 2019-2020 - All Rights Reserved - AI in DataCenters