Data science taken to the extreme

Jonathan Alexander, Head of AI @ BuiltOn.dev on .

Share with Facebook
Share with Twitter
Share with LinkedIn

Extreme data science is all about saving time and being able to present complete solutions quickly. I.e, How can you extract value from data fast?

ML head

Some of the significant challenges with conventional data science are the time-consuming exploration, cleaning and feature engineering, that are specific to each dataset. And then there are almost always hours of manually setting value and human decision making required. On top of that, every manipulation to the dataset needs to be logged and redone in the inference step in a couple of different ways depending on each transformation. The inference step has the same standard challenges as other server applications; maintenance, scaling, fault tolerance, availability and so on.

At BuiltOn, we deal with many different datasets and use cases, all of which end up with actionable APIs. We encounter all of the hurdles above, along with many others, on a regular basis.

To overcome these issues, we have developed our own extreme data science library which empowers our data scientists to solve the whole problem from exploration to production with minimal code and time. The four pillars of our technology are:

  • One-liners - using our BuiltonAI library.
  • Auto pipeline for inference.
  • Serverless deployment.
  • Easy scaling for big data.

The Builton AI machine learning library

  • We developed our own “XFrame” ( Xtreme data science), which is very similar to pandas and sklearn APIs, for easy on-boarding of the technology.
  • Out-of-corelazy evaluation
  • Every manipulation is a transformer, which is saved in the background — for auto-pipeline generation.
  • Every algorithm is wrapped to handle the most common issues and use cases.
  • When the exploration is done, the pipeline is fully ready for production, either for inference or retraining with more data.

Let’s start with a simple example by using the Titanic dataset, where we can try to predict who survived the Titanic. The data is pretty dirty, with missing and heterogeneous values, and has a variety of input types. We will see how to deal with these issues later.

In a single line, we can explore the dataset, regardless of size, since we use out-of-core technology.

from builtonai import load_xframe
titanic = load_xframe('datasets/titanic')
titanic 

Columns:
	PassengerId	int
	Survived	int
	Pclass	int
	Name	str
	Sex	str
	Age	float
	SibSp	int
	Parch	int
	Ticket	str
	Fare	float
	Cabin	str
	Embarked	str

Rows: 891

Data:
+-------------+----------+--------+--------------------------------+--------+
| PassengerId | Survived | Pclass |              Name              |  Sex   |
+-------------+----------+--------+--------------------------------+--------+
|      1      |    0     |   3    |    Braund, Mr. Owen Harris     |  male  |
|      2      |    1     |   1    | Cumings, Mrs. John Bradley...  | female |
|      3      |    1     |   3    |     Heikkinen, Miss. Laina     | female |
|      4      |    1     |   1    | Futrelle, Mrs. Jacques Hea...  | female |
|      5      |    0     |   3    |    Allen, Mr. William Henry    |  male  |
|      6      |    0     |   3    |        Moran, Mr. James        |  male  |
|      7      |    0     |   1    |    McCarthy, Mr. Timothy J     |  male  |
|      8      |    0     |   3    | Palsson, Master. Gosta Leonard |  male  |
|      9      |    1     |   3    | Johnson, Mrs. Oscar W (Eli...  | female |
|      10     |    1     |   2    | Nasser, Mrs. Nicholas (Ade...  | female |
+-------------+----------+--------+--------------------------------+--------+
+------+-------+-------+------------------+---------+-------+----------+
| Age  | SibSp | Parch |      Ticket      |   Fare  | Cabin | Embarked |
+------+-------+-------+------------------+---------+-------+----------+
| 22.0 |   1   |   0   |    A/5 21171     |   7.25  |       |    S     |
| 38.0 |   1   |   0   |     PC 17599     | 71.2833 |  C85  |    C     |
| 26.0 |   0   |   0   | STON/O2. 3101282 |  7.925  |       |    S     |
| 35.0 |   1   |   0   |      113803      |   53.1  |  C123 |    S     |
| 35.0 |   0   |   0   |      373450      |   8.05  |       |    S     |
| None |   0   |   0   |      330877      |  8.4583 |       |    Q     |
| 54.0 |   0   |   0   |      17463       | 51.8625 |  E46  |    S     |
| 2.0  |   3   |   1   |      349909      |  21.075 |       |    S     |
| 27.0 |   0   |   2   |      347742      | 11.1333 |       |    S     |
| 14.0 |   1   |   0   |      237736      | 30.0708 |       |    C     |
+------+-------+-------+------------------+---------+-------+----------+
[891 rows x 12 columns]

Note that the Titanic dataset is small, but our library would run the exact same way on much larger datasets.

In the code above, we import the dataset and print it to the screen. There we can see the columns, their type, and a sample of the values. It is clear that we have missing values in Age, together with heterogeneous values in Ticket and in the structure of Name, along with other issues.

In just one line, we can use our variation of XGBoost to add a column of predictions.

from builtonai import XGBoostTransformer
train, test = titanic.random_split(0.8)
model = XGBoostTransformer(target='Survived',output_column='predictions').fit(train)
results = model.predict(test)
Results

Columns:
	PassengerId	int
	Survived	int
	Pclass	int
	Name	str
	Sex	str
	Age	float
	SibSp	int
	Parch	int
	Ticket	str
	Fare	float
	Cabin	str
	Embarked	str
	predictions	int

Rows: 174

Data:
+-------------+----------+--------+--------------------------------+--------+
| PassengerId | Survived | Pclass |              Name              |  Sex   |
+-------------+----------+--------+--------------------------------+--------+
|      3      |    1     |   3    |     Heikkinen, Miss. Laina     | female |
|      6      |    0     |   3    |        Moran, Mr. James        |  male  |
|      11     |    1     |   3    | Sandstrom, Miss. Marguerit...  | female |
|      14     |    0     |   3    |  Andersson, Mr. Anders Johan   |  male  |
|      16     |    1     |   2    | Hewlett, Mrs. (Mary D King...  | female |
|      18     |    1     |   2    |  Williams, Mr. Charles Eugene  |  male  |
|      28     |    0     |   1    | Fortune, Mr. Charles Alexander |  male  |
|      29     |    1     |   3    | O'Dwyer, Miss. Ellen "Nellie"  | female |
|      31     |    0     |   1    |    Uruchurtu, Don. Manuel E    |  male  |
|      43     |    0     |   3    |      Kraeff, Mr. Theodor       |  male  |
+-------------+----------+--------+--------------------------------+--------+
+------+-------+-------+------------------+---------+-------------+----------+-------------+
| Age  | SibSp | Parch |      Ticket      |   Fare  |    Cabin    | Embarked | predictions |
+------+-------+-------+------------------+---------+-------------+----------+-------------+
| 26.0 |   0   |   0   | STON/O2. 3101282 |  7.925  |             |    S     |      1      |
| None |   0   |   0   |      330877      |  8.4583 |             |    Q     |      0      |
| 4.0  |   1   |   1   |     PP 9549      |   16.7  |      G6     |    S     |      0      |
| 39.0 |   1   |   5   |      347082      |  31.275 |             |    S     |      0      |
| 55.0 |   0   |   0   |      248706      |   16.0  |             |    S     |      1      |
| None |   0   |   0   |      244373      |   13.0  |             |    S     |      0      |
| 19.0 |   3   |   2   |      19950       |  263.0  | C23 C25 C27 |    S     |      0      |
| None |   0   |   0   |      330959      |  7.8792 |             |    Q     |      1      |
| 40.0 |   0   |   0   |     PC 17601     | 27.7208 |             |    C     |      0      |
| None |   0   |   0   |      349253      |  7.8958 |             |    C     |      0      |
+------+-------+-------+------------------+---------+-------------+----------+-------------+
[174 rows x 13 columns]

In general, we try to cater for each manipulation by adding a column, which allows us to create simple feature engineering and model stacking.

Evaluation is also on one line as you would expect:

evaluation = model.evaluate(test)
print(f"Accuracy {evaluation['accuracy']} AUC: {evaluation['auc']}")
      
Accuracy 0.7627118644067796 AUC: 0.8281746031746037

Now let’s take it to the extreme

The first challenge we set ourselves is to add a column of images to the tabular data. By doing so, we will demonstrate how easy it is to do substantial data wrangling, modelling and eventually to produce an easy to consume output with the final results through minimal code.

We add the new column of images to the Titanic dataset by searching random pictures of men and women and matching it to gender in the dataset. This is obviously nonsense that won’t help get better results, but we added this step to show how easily we can handle more complex cases. We can now model a dataset with tabular and image data combined!

from builtonai import load_xframe
train, test = load_xframe('datasets/titanic_images').random_split(0.8, seed=0)
print(train[0])
train[0]['image'].show()

{'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': '', 'Embarked': 'S', 'path': './downloads/man/1. 40-something-man-2-1.jpg', 'image': Height: 375px Width: 667px Channels: 3 }

Extreme data science Titanic guy

Next, we are going to do some data wrangling to demonstrate how easily it can be done. Then we are going to pack a few columns for inference, and finally, we will prep it for API consumption.

Let’s look at some basic data wrangling.

# Text manipulation
train['surname'] = train['Name'].apply(lambda x: x.split(',')[0]) # calculate on the fly
train['forename'] = train['Name'].apply(lambda x: x.split('.')[-1].strip()) # calculate on the fly
train['m'] = train['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip()) # calculate on the fly
# numeric manipulation 
rain['normalize_fare'] = train['Fare'].normalize() # create a Normalize Transformer 
# on mean() and std(), the calculation is done, but also the values are saved for inference
train['standardize_age'] = (train['Age'] - train['Age'].mean())/train['Age'].std()
# logical
train['logical'] = train['SibSp'] > train['Pclass'] # calculate on the fly
# filter for cleaning
# The filters are saved in the pipeline for retraining, but do not apply on inference
train = train[train['SibSp'] !=5 ] # re-filter on fit(), but not on transform()

Every line is another column, which makes it easy to debug and verify that the transformations make sense. Unlike, for example, writing functions and generators to run on files. This is very similar to how you would do it with Pandas.

More advanced data wrangling.

from builtonai.ml.supervised import XGBoostTransformer
from builtonai.dimension_reduction import RandomProjection, FeatureHasher
from builtonai.feature_engineering import TFIDF, Imputer, UnpackTransformer, \
 ImageToFeaturesTransformer, TransformerChain, FeatureBinner, CountTransformer, LabelEncoder, \
 QuadraticFeatures, RenameTransformer, PackTransformer
 
# Impute missing values automatic handle numeric and categorical differently, but you can specify how to handle each column or each column type.
train = Imputer().fit_transform(train) 

# Text mining method usually used for text analysis and modeling 
train = TFIDF(features=['Name'], output_column_prefix='tfidf').fit_transform(train)

# Encode categorical as integers
train = LabelEncoder('Cabin').fit_transform(train)

# Binning
train = FeatureBinner(['standardize_age'], output_column_prefix='bin').fit_transform(train)

# Create a combination of a number of features
train = QuadraticFeatures(['SibSp','Parch']).fit_transform(train)

# Groupby-count-join
train = CountTransformer(by=['Survived'],count_columns=['PassengerId']).fit_transform(train) 

# Transfer learning from resnet50 turning each image to vector 
train = ImageToFeaturesTransformer(image_column='image', 
 output_column='image_features').transform(train)

# A dimensionality reduction method which scales nicely which we use on the image vectors
train = RandomProjection(features=['image_features'], 
 output_column='projection').fit_transform(train)

# Hashes an input feature space to an n-bit feature space
train = FeatureHasher(features=['path']).fit_transform(train)

# Using XGboost leaves as features for model stacking
train = XGBoostTransformer(target='Survived',
 output_column='xgb_features',
 output_type='features').fit_transform(train)

# Build a classification model with topk predictions
train = XGBoostTransformer(target='Survived', 
 output_column='topk',
 output_type='topk', k=2).fit_transform(train)

As an example of more advanced data wrangling, we will use TF-IDF, a common technique for text mining which gives every word a weighted value based on how valuable it is in the text vs the entire corpus. We apply it to the “Name” column, which again, is not very informative, to showcase how easy it is to use.

LabelEncoder and One-Hot-Encoder are classic methods of turning each string to an int, or list of ints, to be consumed by machine learning algorithms that cannot handle string inputs among other use cases. We demonstrate the LabelEncoder in this case, but have other options available.

FeatureBinner puts numeric values in bins, For example, it can be used to categorise 0–18 year olds as children, 19–50 as adults, and 50+ to old ( I know, 50 is not old…).

QuadraticFeatures takes combinations of features and creates a new feature based on their combinations. If you have a weather column and a day of the week column, it will create a column for the combinations of both, e.g. Sunny+Monday.

CountTransformer runs a group-by, count and joins, which is very helpful in event data.

Our ImageToFeaturesTransformer uses transfer learning, using the Resnet50 deep learning model, to turn each image to a vector of features which we can consume as any other column. This is very helpful for any standard out-of-the-box image learning.

RandomProjections is a dimensionality reduction method, one way to combat the curse of dimensionality.

The FeatureHasher takes every value and hashes it, which helps with a very high cardinality of categories, a.k.a. the hashing trick.

Finally, we use XGBoost twice, first to generate features by taking the leaves of the deepest level, and second for predictions, using all the features we created.

We can continue processing the dataset after modelling and prep it to be consumed as APIs as a backend would do, but since the data scientist knows the data best, it’s easier for them this way. In SageMaker, you will need another lambda or backend service to fit any algorithm to the way you want to consume it.

train['result'] = train['topk'].apply(lambda x: 'Survived' if x[1]>x[0] else "Died")
train = RenameTransformer({'result': 'class'}).fit_transform(train)

Let’s have a look at some of the manipulation and modelling:

results_columns = ['topk', 'class','Cabin','quadratic_features', 'tfidf.Name','standartize_age', 
                    'image_features','projection', 'xgb_features']
train[results_columns]


Columns:
	topk	dict
	class	str
	Cabin	int
	quadratic_features	dict
	tfidf.Name	dict
	standartize_age	float
	image_features	array
	projection	array
	xgb_features	array

Rows: 700

Data:
+-------------------------------+----------+-------+
|              topk             |  class   | Cabin |
+-------------------------------+----------+-------+
| {0: 0.9755354523658752, 1:... |   Died   |   1   |
| {0: 0.9755354523658752, 1:... |   Died   |   1   |
| {0: 0.9755354523658752, 1:... |   Died   |   1   |
| {0: 0.9755354523658752, 1:... |   Died   |   96  |
| {0: 0.9755354523658752, 1:... |   Died   |   1   |
| {0: 0.9755354523658752, 1:... |   Died   |   1   |
| {0: 0.9755354523658752, 1:... |   Died   |   1   |
| {0: 0.9755354523658752, 1:... |   Died   |   1   |
| {0: 0.02549266815185547, 1... | Survived |   1   |
| {0: 0.9755354523658752, 1:... |   Died   |   1   |
+-------------------------------+----------+-------+
+-------------------------------+-------------------------------+
|       quadratic_features      |           tfidf.Name          |
+-------------------------------+-------------------------------+
| {'Parch, Parch': 0, 'Parch... | {'harris': 6.5510803350434... |
| {'Parch, Parch': 0, 'Parch... | {'william': 2.679879324135... |
| {'Parch, Parch': 0, 'Parch... | {'james': 3.60664135587696... |
| {'Parch, Parch': 0, 'Parch... | {'timothy': 5.857933154483... |
| {'Parch, Parch': 1, 'Parch... | {'leonard': 4.759320865815... |
| {'Parch, Parch': 0, 'Parch... | {'william': 2.679879324135... |
| {'Parch, Parch': 25, 'Parc... | {'johan': 4.35385575770718... |
| {'Parch, Parch': 1, 'Parch... | {'eugene': 5.4524680463752... |
| {'Parch, Parch': 0, 'Parch... | {'charles': 3.606641355876... |
| {'Parch, Parch': 0, 'Parch... | {'mr.': 0.5447271754416719... |
+-------------------------------+-------------------------------+
+---------------------+-------------------------------+
|   standartize_age   |         image_features        |
+---------------------+-------------------------------+
| -0.5184364485131501 | [0.03379005938768387, 0.0,... |
|  0.3639773854672754 | [0.25268638134002686, 0.0,... |
|         None        | [0.19756250083446503, 0.0,... |
|  1.6536591428232819 | [0.27158886194229126, 0.0,... |
| -1.8759961930984201 | [0.057356078177690506, 0.0... |
| -0.6541924229716771 | [0.011503922753036022, 0.0... |
|  0.6354893343843294 | [0.13228459656238556, 0.0,... |
| -1.8759961930984201 | [0.008415664546191692, 0.0... |
|         None        | [0.11948581039905548, 0.0,... |
|  0.3639773854672754 | [0.10456686466932297, 0.0,... |
+---------------------+-------------------------------+
+-------------------------------+-------------------------------+
|           projection          |          xgb_features         |
+-------------------------------+-------------------------------+
| [15.033728846016446, 3.567... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
| [17.175222251997713, 5.293... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
| [-0.17154641212455646, 16.... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
| [14.854158839305425, 22.79... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
| [5.9382615960285845, 12.68... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
| [-5.3924663425478485, 3.41... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
| [1.9954488560044368, 18.75... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
| [2.94227206642811, 0.70623... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
| [3.934261165505184, 7.8759... | [1.0, 1.0, 1.0, 1.0, 1.0, ... |
| [7.149745773427712, 4.2162... | [2.0, 2.0, 2.0, 2.0, 2.0, ... |
+-------------------------------+-------------------------------+
[700 rows x 9 columns]

Lastly, let’s have a look at some of the manipulation and modelling:

We can pack a few columns into a single response for the inference phase to make it easier to consume on the client side.

train = PackTransformer(['topk','class'], output_column='response', column_type=dict ).fit_transform(train)
train['response'][0]

{'topk': {0: 0.975540816783905, 1: 0.024459194391965866}, 'class': 'Died'}

For most machine learning projects, cleaning and feature engineering are where most of the time is spent. This is why having something like our Imputer that can figure out missing values, and transformers which can be run dynamically for testing and debugging, is pure magic.

When considering machine learning for APIs, on the other hand, most of the time is spent building pipelines. One will need to re-do many of the transformations before inference in the right way, and for training on new data, there will be a need to rerun the cleaning procedures.

Auto-pipelines

How long would it take to create a pipeline that can provide you with predictions using the Titanic case above? … you guessed it, no time at all as we only need one line of code!

pipeline = train.pipeline

# For retraining
pipeline.fit(new_data)

# For inference
pipeline.predict(test)

Which, in turn, can be saved and loaded in a server to retrain with more data or to run inference for predictions on any reasonable format like JSON, Numpy, Pandas, and our own XFrame.

Note that the predict() function knows when to use values from “train”, like in the ‘standardize_age’ case, when to calculate on the fly like in the ‘forename’ case, and how to avoid filtering and cleaning in predictions time like the ‘SibSp’ case.

his allows us to quickly train and deploy all kinds of pipelines for classification, regression, clustering and ranking with just a few lines. And we can deploy the pipelines on the same infrastructure. Smart, right?

Deployment:

AWSHandler(<aws-params>).deploy(model_id, pipeline)

Serverless

As a start-up that focuses on bootstrapping fellow startups, we don’t want to maintain an auto-scaling cluster for each company before they have a significant amount of data and volume. In addition, we can’t afford to provide servers for every company that tests our system for free. Our solution? Serverless!

Serverless is an infrastructure where the servers are maintained with a cloud provider like AWS, and the software engineer just needs to write a function, not a server. Most importantly of all, you pay for what you use, perfect for small volume inference that is practically free.

We set up our pipelines to be deployed automatically within AWS Lambda, saving us costs and time spent on operational management. And it scales easily with no action on our side with extreme reliability for fault tolerance and availability.

Only when a client needs a very high volume of predictions do we move them to an auto-scaling server cluster using docker, which has no changes in code between projects, datasets, business solution and pipelines.

Extreme data science - serverless image

Scale

As already mentioned, the package is out-of-core, which means we can just run on a bigger single instance without the need for distributed computing (although we are very well parallelized on that single instance). Due to the pipeline being easy to save, load and retrain, AWS Batch fits perfectly to run our training. As a result, if you have more data, we just change the type of instance, without any other changes needed.

With the high variety of instance options, we have yet to come across a challenging dataset size for our solution.

Summary

The challenges of extreme data science are similar to traditional data science. You always want to reduce the time and resources needed for exploring, cleaning, manipulating, training and deploying data at scale. The solutions must be extremely efficient and robust.

We addressed these issues by creating a unique machine learning library which puts the data scientist in the driver’s seat. Write less code, prevent common mistakes, run best practices and build pipelines behind the scenes to make complete solutions quickly.

The limitations associated with the traditional approach to data science led us to build our own AI platform. It allows us and our users to train and deploy dozens of different pipelines used in our APIs. Our goal at BuiltOn, is to democratize AI and level the playing field, giving developers easy access to ready-made building blocks and powerful AI ready APIs for e-commerce.

If you want to learn more, follow us on TwitterLinkedIn and Github, have a look at our website, eat healthy and exercise. Maybe pick up a new hobby; after all, you might save a few hours by going extreme.

loading...