Getting Started

d3m-interface integrates D3M AutoML systems with Jupyter Notebooks. The Jupyter notebooks provide an interactive computing environment where you can generate models using the D3M AutoML systems, and explore them using PipelineProfiler which is an interactive visualization aimed at producing detailed visualizations of end-to-end machine learning pipelines. d3m-interface has two main components: model generation and model exploration.

Model Generation

The model generation component provides methods to search, train, test, and score pipelines.

First, import the class AutoML from d3m-interface:

[1]:
from d3m_interface import AutoML

Then, search for pipelines using the imported class. AutoML receives the output path and the name of the AutoML engine to be used. In this example, we use AlphaD3M, developed by NYU. d3m-interface automatically downloads the Docker image for AlphaD3M and sets up the AutoML system.

To perform the search of pipelines, we need to call search_pipelines. This call needs the following parameters: the training dataset; the maximum running time (time_bound) in minutes; the variable to predict (target); the metric to be used (metric); and the keywords that describe the task to be solved (task_keywords). The time_bound controls how long the search can take and control the use of computational resources. Note that longer running times may lead to more accurate solutions since the system will have more time to try and evaluate more candidate solutions for the problem.

The 185_baseball_MIN_METADATA dataset in CSV format is used for this example. This dataset contains information about baseball players and play statistics, including Games_played, At_bats, Runs, Hits, Doubles, Triples, Home_runs, RBIs, Walks, Strikeouts, Batting_average, On_base_pct, Slugging_pct and Fielding_ave.

Note

If you installed AlphaD3M from PyPI, please use the ‘pypi’ parameter in the constructor, like this: automl = AutoML(output_path, ‘AlphaD3M’, ‘pypi’)

[2]:
output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = '/Users/rlopez/D3M/examples/datasets/185_baseball_MIN_METADATA/train_data.csv'
test_dataset = '/Users/rlopez/D3M/examples/datasets/185_baseball_MIN_METADATA/test_data.csv'

automl = AutoML(output_path)
automl.search_pipelines(train_dataset, time_bound=10, target='Hall_of_Fame', metric='f1Macro', task_keywords=['classification', 'multiClass', 'tabular'])
INFO: Initializing AlphaD3M AutoML...
INFO: AlphaD3M AutoML initialized!
INFO: Found pipeline id=e9043b5b-a27c-4a64-a154-8e061dfed680, time=0:00:19.895961, scoring...
INFO: Scored pipeline id=e9043b5b-a27c-4a64-a154-8e061dfed680, f1_macro=0.64214
INFO: Found pipeline id=69fb4020-c8d7-421a-a216-ab66e27508a7, time=0:00:38.073745, scoring...
INFO: Scored pipeline id=69fb4020-c8d7-421a-a216-ab66e27508a7, f1_macro=0.61677
INFO: Found pipeline id=5fdacf6b-7c4e-4a79-a8fb-3dbcb09841a3, time=0:00:56.313074, scoring...
INFO: Scored pipeline id=5fdacf6b-7c4e-4a79-a8fb-3dbcb09841a3, f1_macro=0.71535
INFO: Found pipeline id=c2ce4516-3aa4-489a-b860-15631568707d, time=0:01:11.570050, scoring...
INFO: Scored pipeline id=c2ce4516-3aa4-489a-b860-15631568707d, f1_macro=0.44115
INFO: Found pipeline id=e469f0a3-af16-4ab3-b826-09188a06c9a3, time=0:01:26.809299, scoring...
INFO: Scored pipeline id=e469f0a3-af16-4ab3-b826-09188a06c9a3, f1_macro=0.47316
INFO: Found pipeline id=80851e2a-ae8e-4dd7-8764-9beb025a3185, time=0:01:48.029030, scoring...
INFO: Scored pipeline id=80851e2a-ae8e-4dd7-8764-9beb025a3185, f1_macro=0.60765
INFO: Found pipeline id=498ea7df-ff69-482f-b1f3-e7ef8e06f261, time=0:02:03.248226, scoring...
INFO: Scored pipeline id=498ea7df-ff69-482f-b1f3-e7ef8e06f261, f1_macro=0.60765
INFO: Found pipeline id=9195f3a1-0433-4fb7-8c7e-db93fbacdda7, time=0:02:21.458944, scoring...
INFO: Scored pipeline id=9195f3a1-0433-4fb7-8c7e-db93fbacdda7, f1_macro=0.60765
INFO: Found pipeline id=7e1a0713-2785-442b-bf0a-c80495747680, time=0:02:36.621582, scoring...
INFO: Scored pipeline id=7e1a0713-2785-442b-bf0a-c80495747680, f1_macro=0.62492
INFO: Found pipeline id=bc7bcbb4-86e5-45d8-a914-b61b7dcdfaa6, time=0:02:54.767454, scoring...
INFO: Scored pipeline id=bc7bcbb4-86e5-45d8-a914-b61b7dcdfaa6, f1_macro=0.62492
INFO: Found pipeline id=9568f649-471e-42fd-9e52-1520b378b4df, time=0:03:18.970813, scoring...
INFO: Found pipeline id=61bad066-82d2-46a3-b30e-a650ec4abe00, time=0:03:34.186458, scoring...
INFO: Scored pipeline id=9568f649-471e-42fd-9e52-1520b378b4df, f1_macro=0.62492
INFO: Scored pipeline id=61bad066-82d2-46a3-b30e-a650ec4abe00, f1_macro=0.62492
INFO: Found pipeline id=4ad83c37-8a32-460f-beda-02d142d35459, time=0:04:01.392814, scoring...
INFO: Scored pipeline id=4ad83c37-8a32-460f-beda-02d142d35459, f1_macro=0.4929
INFO: Found pipeline id=f8386c24-bdb8-4ec3-b7df-a1095b61453b, time=0:04:19.554698, scoring...
INFO: Scored pipeline id=f8386c24-bdb8-4ec3-b7df-a1095b61453b, f1_macro=0.41461
INFO: Found pipeline id=6e0db6d0-d402-43ac-9b0b-bc15adaafa0f, time=0:05:25.937965, scoring...
INFO: Found pipeline id=ffed0620-68c8-4267-848a-5f351f1c6909, time=0:05:26.384093, scoring...
INFO: Found pipeline id=661a0416-b65f-4233-a5c9-af82616159d3, time=0:05:35.799391, scoring...
INFO: Scored pipeline id=ffed0620-68c8-4267-848a-5f351f1c6909, f1_macro=0.62492
INFO: Scored pipeline id=6e0db6d0-d402-43ac-9b0b-bc15adaafa0f, f1_macro=0.62492
INFO: Found pipeline id=c1337970-cd54-4387-ac69-e663bee43ebf, time=0:05:42.173183, scoring...
INFO: Search completed, still scoring some pending pipelines...
INFO: Scored pipeline id=661a0416-b65f-4233-a5c9-af82616159d3, f1_macro=0.62492
INFO: Scored pipeline id=c1337970-cd54-4387-ac69-e663bee43ebf, f1_macro=0.62492
INFO: Scoring completed for all pipelines!

After the pipeline search is complete, we can display the leaderboard:

[3]:
automl.plot_leaderboard()
[3]:
ranking id summary f1_macro
1 5fdacf6b-7c4e-4a79-a8fb-3dbcb09841a3 imputer.sklearn, encoder.dsbox, gradient_boosting.sklearn 0.715350
2 e9043b5b-a27c-4a64-a154-8e061dfed680 imputer.sklearn, encoder.dsbox, random_forest.sklearn 0.642140
3 7e1a0713-2785-442b-bf0a-c80495747680 imputer.sklearn, one_hot_encoder.distilonehotencoder, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common 0.624920
4 bc7bcbb4-86e5-45d8-a914-b61b7dcdfaa6 imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common 0.624920
5 9568f649-471e-42fd-9e52-1520b378b4df imputer.sklearn, one_hot_encoder.sklearn, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common 0.624920
6 61bad066-82d2-46a3-b30e-a650ec4abe00 imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, xgboost_gbtree.common 0.624920
7 ffed0620-68c8-4267-848a-5f351f1c6909 imputer.sklearn, one_hot_encoder.distilonehotencoder, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common 0.624920
8 6e0db6d0-d402-43ac-9b0b-bc15adaafa0f imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, xgboost_gbtree.common 0.624920
9 661a0416-b65f-4233-a5c9-af82616159d3 imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common 0.624920
10 c1337970-cd54-4387-ac69-e663bee43ebf imputer.sklearn, one_hot_encoder.sklearn, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common 0.624920
11 69fb4020-c8d7-421a-a216-ab66e27508a7 imputer.sklearn, encoder.dsbox, extra_trees.sklearn 0.616770
12 80851e2a-ae8e-4dd7-8764-9beb025a3185 imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, select_fwe.sklearn, xgboost_gbtree.common 0.607650
13 498ea7df-ff69-482f-b1f3-e7ef8e06f261 imputer.sklearn, one_hot_encoder.distilonehotencoder, quantile_transformer.sklearn, select_fwe.sklearn, xgboost_gbtree.common 0.607650
14 9195f3a1-0433-4fb7-8c7e-db93fbacdda7 imputer.sklearn, one_hot_encoder.sklearn, quantile_transformer.sklearn, select_fwe.sklearn, xgboost_gbtree.common 0.607650
15 4ad83c37-8a32-460f-beda-02d142d35459 imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, generic_univariate_select.sklearn, xgboost_gbtree.common 0.492900
16 e469f0a3-af16-4ab3-b826-09188a06c9a3 imputer.sklearn, encoder.dsbox, sgd.sklearn 0.473160
17 c2ce4516-3aa4-489a-b860-15631568707d imputer.sklearn, encoder.dsbox, linear_svc.sklearn 0.441150
18 f8386c24-bdb8-4ec3-b7df-a1095b61453b imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, score_based_markov_blanket.rpi, xgboost_gbtree.common 0.414610

Individual pipelines need to be trained with the full data. The training is done with the call:

[4]:
best_pipeline_id = automl.get_best_pipeline_id() # Getting the id of the best pipeline
model_id = automl.train(best_pipeline_id)
INFO: Training model...
INFO: Training finished!

Pipeline predictions are accessed with:

[5]:
predictions = automl.test(model_id, test_dataset)
INFO: Testing model...
INFO: Testing finished!
[6]:
predictions
[6]:
d3mIndex Hall_of_Fame
0 2 0
1 9 0
2 14 0
3 15 0
4 21 0
... ... ...
262 1327 0
263 1328 0
264 1329 0
265 1335 0
266 1338 0

267 rows × 2 columns

The pipeline can be evaluated against a held out dataset with the function call:

[7]:
automl.score(best_pipeline_id, test_dataset)
[7]:
('f1_macro', 0.64322)

Model Exploration

In order to explore the produced pipelines, we can use PipelineProfiler. PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AutoML systems.

After the pipeline search process is completed, we can use PipelineProfiler with:

Note

You can partially interact with this visualization. Try it in Jupyter Notebook to get full access to all features.

[8]:
automl.plot_comparison_pipelines()
INFO: Inputs for PipelineProfiler created!

PipelineProfiler shows the produced pipelines as a matrix, where the pipelines are represented as rows, and primitives as columns.

PipelineProfiler matrix view

The score view displays performance metrics (i.e. accuracy, F1) of the evaluated pipelines. It can also visualize the training time of each of the pipelines.

PipelineProfiler performance view

The Primitive Contribution view shows the correlation between primitive usage and the pipeline scores.

PipelineProfiler primitive contribution

The Pipeline Comparison view highlights the differences between selected pipelines. It presents a node-link representation of the selected pipelines. Multiple pipelines can be selected by shift-clicking the matrix rows.

PipelineProfiler graph comparison

For more information about how to use PipelineProfiler, click here. There is also a video demo available here.

After the analysis is complete, end the session to stop the Docker container and clean up temporary files:

[10]:
automl.end_session()
INFO: Ending session...
INFO: Session ended!

Download this example as a jupyter notebook file ( .ipynb ).