AutoML for Beginners with H2O

Photo by Markus Winkler

This is the famous "Titanic - Machine Learning from Disaster" competition from Kaggle. If you are new to Machine Learning this is one of the best, first challenges for you to dive into Machine Learning.The challenge is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. We are going to do this by using H2O AutoML.

H2O is a fully open-source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry-leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models.
H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time limit.
H2O has made it easy for non-experts to experiment with machine learning through AutoMl. But, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models.It is also useful for advanced users. H2O helps them by freeing up their time to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.

RMS Titanic

grayscale photo of ship on pier

"RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean on 15 April 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking at the time one of the deadliest of a single ship and the deadliest peacetime sinking of a superliner or cruise ship to date." - Wikipedia

Prerequisites

We are going to do this challenge using H2O in Python. The prerequisites for this is as follows,
  1. Operating Systems

    • Windows 7 or later
    • OS X 10.9 or later
    • Ubuntu 12.04 or later
    • RHEL/CentOS 6 or later
  2. Language

    • Python 2.7.x, 3.5.x, 3.6.x
You need Java installed and JAVA_HOME environment variable set in your system to run H2O. Download and install Java(Amazon Correto) from here. The installer will automatically set the JAVA_HOME environment variable for you. Amazon Corretto is a no-cost, multiplatform, production-ready distribution of the Open Java Development Kit (OpenJDK). Corretto comes with long-term support that will include performance enhancements and security fixes. You can download Python 3.6 from here: Windows(exe)/Linux(source)

Installing H20

First you need to install the dependencies (prepending with `sudo` if needed):

#!/bin/sh
pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future

The following command removes any existing installation of H2O module from Python.

#!/bin/sh
pip uninstall h2o

Next, use pip to install this version of the H2O Python module.

#!/bin/sh
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-zipf/1/Python/h2o-3.32.1.1-py2.py3-none-any.whl

To install this package with conda, uncomment the second line and run:

#!/bin/sh
conda install -c h2oai h2o

Start H2O

Import the H2O Python module and `H2OAutoML` class and initialize a local H2O cluster.

import h2o
from h2o.automl import H2OAutoML
h2o.init()

Load Data

Download and copy the dataset to a subfolder named "data" in the same directory as your Python script/Jupiter notebook.The dataset contains two files train.csv and test.csv.
We will use the training set(train.csv) to build our machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Our model will be based on “features” like passengers’ gender and class.
The test set is to see how well our model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is our job to predict these outcomes. For each passenger in the test set, we will use the model that we trained to predict whether or not they survived the sinking of the Titanic.

Data Dictionary

Variable Defenition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Source: Kaggle


Now we will load the data into H20.

# Use local data file
data_path = "./data/train.csv"

# Load data into H2O
df = h2o.import_file(data_path)

Handling Missing Data and Adding New Features

Next, we have to deal with missing values. If you look at the missing row in the above output, there are 687 missing records in Cabin, 177 in Age, and 2 in Embarked columns.
A cabin number looks like ‘A123’ and the letter refers to the deck. Therefore we’re going to extract these and create a new feature called Deck.

df["Deck"] = df["Cabin"].strsplit("([0-9]*)")[0]
df.describe()

Next, we will calculate the missing age of passengers. I am going to replace the missing values with the mean value along with a group by column Pclass.

df.impute("Age", method="mean", by=["Pclass"])
df.describe()

Since the Embarked column has only 2 missing values, we will just replace the missing rows with the most common value.

df.impute("Embarked", method="mode")
df.describe()

Next, I am going to add some more new features by combining values from the existing features.

AgePclass

df["AgePclass"] = df["Age"]*df["Pclass"]

NoOfFamilyMembers

df["NoOfFamilyMembers"] = (df["Parch"]*df["SibSp"])

IsAlone

df["IsAlone"] = df["NoOfFamilyMembers"]/df["NoOfFamilyMembers"]
df[df["IsAlone"].isna(), "IsAlone"] = 0

FarePerPerson

df["FarePerPerson"] = df["Fare"]/(df["NoOfFamilyMembers"]+1)

Now I am going to update the missing values of the Deck feature with a new value. First, let's see what all alphabets are currently used in this field.

df["Deck"].table()
automl-4
automl-4

We'll update empty Deck values with "Z".

df["Deck"] = df["Deck"].ascharacter()
df[df["Deck"].isna(), "Deck"] = "Z"

For classification, the Survived column should be encoded as categorical (aka. "factor" or "enum"). Let's take a look.

df.describe()

By default that cell got processed as an integer. Pclass, Deck and IsAlone should also be enum as it represents the socio-economic status of passengers. We need to convert these into an enum. We can do that by just executing the following line,

df["Survived"]= df["Survived"].asfactor()
df["Pclass"]= df["Pclass"].asfactor()
df["IsAlone"]= df["IsAlone"].asfactor()
df["Deck"] = df["Deck"].asfactor()

Now let's see what happend.

df.describe()

As you can see the data type of Survived and Pclass columns are now enum. Let's leave all the categorical features to be auto handled by H20.

Generate x and Y values

Now, let's identify the response & predictor columns by saving them as x and y. The PassengerId, Name and Ticket columns are unique identifiers so we'll remove those from the set of our predictors along with the Cabin feature.

y = "Survived"
x = df.columns
x.remove(y)
x.remove("PassengerId")
x.remove("Name")
x.remove("Ticket")
x.remove("Cabin")

Run AutoML

Run AutoML, stopping after 15 models. The max_models argument specifies the number of individual (or "base") models and does not include the two ensemble models that are trained at the end. We set a seed value for reproducibility. H2O Deep Learning models are not reproducible by default for performance reasons, so for reproducibility, we are adding "DeepLearning" to exclude_algos parameter. The exploitation_ratio parameter is for specifying the budget ratio (between 0 and 1) dedicated to the exploitation (vs exploration) phase. By default, the exploitation phase is disabled (exploitation_ratio=0) as this is still experimental; to activate it, it is recommended to try a ratio around 0.1. Learn more about exploitation and exploration from here.

aml = H2OAutoML(max_models = 15, seed = 1, nfolds=15, exclude_algos = ["DeepLearning"], exploitation_ratio = 0.15)
aml.train(x = x, y = y, training_frame = df)

Leaderboard

Next, we will view the AutoML Leaderboard. Since we did not specify a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses cross-validation metrics to rank the models.
A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of binary classification, the default ranking metric is Area Under the ROC Curve (AUC). In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.
The leader model is stored at aml.leader and the leaderboard is stored at aml.leaderboard.

lb = aml.leaderboard

Now we will view a snapshot of the top models. Here we should see the two Stacked Ensembles at or near the top of the leaderboard. Stacked Ensembles can almost always outperform a single model.

lb.head()

To view the entire leaderboard, specify the `rows` argument of the `head()` method as the total number of rows:

lb.head(rows=lb.nrows)

As we can see the leader in this example is StackedEnsemble with an AUC value of 0.88102.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0. Learn more about AUC from here.
You can google and see what all are other values and how they affect the accuracy of our model.

Save Leader Model

There are two ways to save the leader model -- binary format and MOJO format. If you're taking your leader model to production, then I'd suggest the MOJO format since it's optimized for production use.

h2o.save_model(aml.leader, path = "./out/titanic_survivability_model_bin")
import os
mojo_out_folder = "./out/titanic_survivability_model_mojo/"
mojo_file_name = "mojo_out.zip"
try:
    os.remove(mojo_out_folder + mojo_file_name)
except OSError:
    pass
os.makedirs(mojo_out_folder, exist_ok=True)
from zipfile import ZipFile
with ZipFile(mojo_out_folder + mojo_file_name, 'w') as file:
    pass
aml.leader.download_mojo(path = mojo_out_folder + mojo_file_name)

Prediction

In this step we will load the test.csv file to H2O and predict the survivability of each passenger.

# Use local data file
test_data_path = "./data/test.csv"

# Load data into H2O
test_df = h2o.import_file(test_data_path)

test_df.describe()
automl-9
automl-9

We'll add the newly created features and convert Pclass as we did for the training dataset. You can also see that a row is missing the value for the fare. I am going to update the missing value with mean along with a group by column Pclass.

test_df["Deck"] = test_df["Cabin"].strsplit("([0-9]*)")[0]
test_df["AgePclass"] = test_df["Age"]*test_df["Pclass"]
test_df["NoOfFamilyMembers"] = (test_df["Parch"]*test_df["SibSp"])
test_df["IsAlone"] = test_df["NoOfFamilyMembers"]/test_df["NoOfFamilyMembers"]
test_df[test_df["IsAlone"].isna(), "IsAlone"] = 0
test_df["FarePerPerson"] = test_df["Fare"]/(test_df["NoOfFamilyMembers"]+1)

test_df["Pclass"]= test_df["Pclass"].asfactor()
test_df.impute("Fare", method="mean", by=["Pclass"]) 
test_df.describe()
automl-10
automl-10

We will now remove the unique identifier columns from the dataset.

test_df = test_df[:, ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Deck", "AgePclass", "NoOfFamilyMembers", "IsAlone", "FarePerPerson"]]

Using the predict() function with AutoML generates predictions on the leader model from the run. The order of the rows in the results is the same as the order in which the data was loaded, even if some rows fail (for example, due to missing values or unseen factor levels).

# To generate predictions on a test set, you can make predictions
# directly on the `"H2OAutoML"` object or on the leader model
# object directly
preds = aml.predict(test_df)

# or:
# preds = aml.leader.predict(test)
preds
automl-11
automl-11

p0 is the probability (between 0 and 1) that class 0 is chosen and p1 is the probability (between 0 and 1) that class 1 is chosen.


You can use this Dataframe for further analysis.

Conclusion

Hooray! We have covered almost all the portions that you need to know for getting started with AutoML. AutoML is the process of automating the time consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers of any experience level to build ML models with high scale, efficiency, and productivity all while sustaining model quality.We have started with the installation of H2O, then we loaded the data to H20, checked missing data, updated those missing values and created new features. Then we trained the dataset using H2O AutoML with 15 different models and found out that GBM is a good candidate for handling this specific dataset. We also learned how to save this model and to use it for making predictions later.There are many things you can still do to improve this piece of code. You can do more feature engineering, standardize the numerical values, find the correlation between the dependant and independent variables, etc.

All the code of this article is available over on Github. This is a python project, so it should be easy to import and run as it is.

Share