For this example we will generate a simple dataset to train our first BigDL full fledged neural network. Let's start by importing the necessary packages for number manipulation and visualization and setting up a seed for the RGN:
import numpy as np
import numpy.random as rd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set up a seed for the RNG
seed = 2019
rd.seed(seed)
As we have already worked with the logistic regression, so this time we would like to see how the neural network performs in non-linearly separable data:
# Define data parameters
num_features = 2
num_classes = 2
num_samples = 2000
mean = 0.0
var = 0.26
threshold = 0.25
x = rd.normal(mean, var, num_samples)
y = rd.normal(mean, var, num_samples)
labels = np.array([np.linalg.norm(s) > threshold for s in zip(x,y)])
colors = ['r', 'b']
plt.figure(figsize=(6,6))
plt.scatter(x,y, c=[colors[int(l)] for l in labels], s=0.2);
plt.xlim(-1,1); plt.ylim(-1,1);
This data is obviously non-separable by a logistic regression (at least without transforming the feature vectors x
and y
).
from bigdl.util.common import *
from pyspark import SparkContext
# Get the spark context
sc = SparkContext.getOrCreate(conf=create_spark_conf().setMaster("local[4]").set("spark.driver.memory","10g"))
samples = [Sample.from_ndarray(np.array([xx,yy]), np.array(l)+1) for xx,yy,l in zip(x,y,labels)]
samples_rdd = sc.parallelize(samples)
train_test_rate = 0.7
train_rdd, test_rdd = samples_rdd.randomSplit([train_test_rate, 1-train_test_rate], seed)
print("Train observations: {}".format(train_rdd.count()))
print("Test observations: {}".format(test_rdd.count()))
We need to initialize the BigDL engine before doing any BigDL work as usual:
init_engine()
We will now define a very basic feed forward neural network which we will use to classify the dataset. It will consists of two layers, a first Linear
layer with ReLU
activation and a second Linear
layer with LogSoftMax
activation. For illustration purposes we will show how to define them using both the Sequential and Functional APIs:
from bigdl.nn.layer import *
num_hidden = 10
ff_seq = Sequential()
ff_seq.add(Linear(num_features, num_hidden))
ff_seq.add(ReLU())
ff_seq.add(Linear(num_hidden, num_classes))
ff_seq.add(LogSoftMax());
print(ff_seq)
lin1 = Linear(num_features, num_hidden)()
relu = ReLU()(lin1)
lin2 = Linear(num_hidden, num_classes)(relu)
sigmoid = Tanh()(lin2)
ff_fun = Model(lin1, sigmoid)
for l in ff_fun.layers:
print(l)
As we saw in the logistic regression notebook, for classification problems the recommended optimization criterion is ClassNLLCriterion
. We create the Optimizer
as usual:
from bigdl.optim.optimizer import *
from bigdl.nn.criterion import *
from bigdl.optim.optimizer import *
batch_size = 128
epochs = 200
learning_rate = 0.4
optimizer = Optimizer(
model=ff_seq,
training_rdd=train_rdd,
criterion=ClassNLLCriterion(),
end_trigger=MaxEpoch(epochs),
optim_method=SGD(learningrate=learning_rate),
batch_size=batch_size)
To be able to trace the training progress we add validation and enable logging:
# Define a validator for the Optimizer
optimizer.set_validation(
batch_size=batch_size,
val_rdd=test_rdd,
trigger=EveryEpoch(),
val_method=[Loss()]
)
# Define the Logs and pass them to the Optimizer
import datetime as dt
log_dir = '/tmp/bigdl_summaries'
app_name = 'nn-ff-' + dt.datetime.now().strftime("%Y%m%d-%H%M%S")
# Create the train and validation summaries
train_summary = TrainSummary(log_dir=log_dir, app_name=app_name)
val_summary = ValidationSummary(log_dir=log_dir, app_name=app_name)
# Pass them to the optimizer
optimizer.set_train_summary(train_summary)
optimizer.set_val_summary(val_summary)
print("Logs saved to: {}/{}".format(log_dir, app_name))
And finally train the model:
optimizer.optimize();
Let's plot the loss information recorded in the training and validation logs:
import matplotlib.pyplot as plt
loss_train = np.array(train_summary.read_scalar("Loss"))
loss_test = np.array(val_summary.read_scalar("Loss"))
plt.figure(figsize = (12,12))
plt.subplot(2,1,1)
plt.plot(loss_train[:,0],loss_train[:,1],label='Training loss')
plt.xlim(0,loss_train.shape[0]+10)
plt.grid(True)
plt.title("Training loss")
plt.subplot(2,1,2)
plt.plot(loss_test[:,0],loss_test[:,1],label='Test loss')
plt.xlim(0,loss_train.shape[0]+10)
plt.title("Test Loss")
plt.grid(True)
And plot the confusion matrix:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import pandas as pd
# Remember that labels are 1-indexed in BigDL:
y_pred = np.array(ff_seq.predict_class(test_rdd).collect())-1
y_label = np.array([s.label.to_ndarray()[0] - 1 for s in test_rdd.collect()])
acc = accuracy_score(y_label, y_pred)
print("The prediction accuracy is %.2f%%"%(acc*100))
cm = confusion_matrix(y_label, y_pred)
df_cm = pd.DataFrame(cm)
plt.figure(figsize = (10,8))
sns.heatmap(df_cm, annot=True,fmt='d');
Copyright © Barcelona Supercomputing Center, 2019-2020 - All Rights Reserved - AI in DataCenters