Getting Started¶

d3m-interface integrates D3M AutoML systems with Jupyter Notebooks. The Jupyter notebooks provide an interactive computing environment where you can generate models using the D3M AutoML systems, and explore them using PipelineProfiler which is an interactive visualization aimed at producing detailed visualizations of end-to-end machine learning pipelines. d3m-interface has two main components: model generation and model exploration.

Model Generation¶

The model generation component provides methods to search, train, test, and score pipelines.

First, import the class AutoML from d3m-interface:

[1]:

from d3m_interface import AutoML

Then, search for pipelines using the imported class. AutoML receives the output path and the name of the AutoML engine to be used. In this example, we use AlphaD3M, developed by NYU. d3m-interface automatically downloads the Docker image for AlphaD3M and sets up the AutoML system.

To perform the search of pipelines, we need to call search_pipelines. This call needs the following parameters: the training dataset; the maximum running time (time_bound) in minutes; the variable to predict (target); the metric to be used (metric); and the keywords that describe the task to be solved (task_keywords). The time_bound controls how long the search can take and control the use of computational resources. Note that longer running times may lead to more accurate solutions since the system will have more time to try and evaluate more candidate solutions for the problem.

The 185_baseball_MIN_METADATA dataset in CSV format is used for this example. This dataset contains information about baseball players and play statistics, including Games_played, At_bats, Runs, Hits, Doubles, Triples, Home_runs, RBIs, Walks, Strikeouts, Batting_average, On_base_pct, Slugging_pct and Fielding_ave.

Note

If you installed AlphaD3M from PyPI, please use the ‘pypi’ parameter in the constructor, like this: automl = AutoML(output_path, ‘AlphaD3M’, ‘pypi’)

[2]:

output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = '/Users/rlopez/D3M/examples/datasets/185_baseball_MIN_METADATA/train_data.csv'
test_dataset = '/Users/rlopez/D3M/examples/datasets/185_baseball_MIN_METADATA/test_data.csv'

automl = AutoML(output_path)
automl.search_pipelines(train_dataset, time_bound=10, target='Hall_of_Fame', metric='f1Macro', task_keywords=['classification', 'multiClass', 'tabular'])

INFO: Initializing AlphaD3M AutoML...
INFO: AlphaD3M AutoML initialized!
INFO: Found pipeline id=e9043b5b-a27c-4a64-a154-8e061dfed680, time=0:00:19.895961, scoring...
INFO: Scored pipeline id=e9043b5b-a27c-4a64-a154-8e061dfed680, f1_macro=0.64214
INFO: Found pipeline id=69fb4020-c8d7-421a-a216-ab66e27508a7, time=0:00:38.073745, scoring...
INFO: Scored pipeline id=69fb4020-c8d7-421a-a216-ab66e27508a7, f1_macro=0.61677
INFO: Found pipeline id=5fdacf6b-7c4e-4a79-a8fb-3dbcb09841a3, time=0:00:56.313074, scoring...
INFO: Scored pipeline id=5fdacf6b-7c4e-4a79-a8fb-3dbcb09841a3, f1_macro=0.71535
INFO: Found pipeline id=c2ce4516-3aa4-489a-b860-15631568707d, time=0:01:11.570050, scoring...
INFO: Scored pipeline id=c2ce4516-3aa4-489a-b860-15631568707d, f1_macro=0.44115
INFO: Found pipeline id=e469f0a3-af16-4ab3-b826-09188a06c9a3, time=0:01:26.809299, scoring...
INFO: Scored pipeline id=e469f0a3-af16-4ab3-b826-09188a06c9a3, f1_macro=0.47316
INFO: Found pipeline id=80851e2a-ae8e-4dd7-8764-9beb025a3185, time=0:01:48.029030, scoring...
INFO: Scored pipeline id=80851e2a-ae8e-4dd7-8764-9beb025a3185, f1_macro=0.60765
INFO: Found pipeline id=498ea7df-ff69-482f-b1f3-e7ef8e06f261, time=0:02:03.248226, scoring...
INFO: Scored pipeline id=498ea7df-ff69-482f-b1f3-e7ef8e06f261, f1_macro=0.60765
INFO: Found pipeline id=9195f3a1-0433-4fb7-8c7e-db93fbacdda7, time=0:02:21.458944, scoring...
INFO: Scored pipeline id=9195f3a1-0433-4fb7-8c7e-db93fbacdda7, f1_macro=0.60765
INFO: Found pipeline id=7e1a0713-2785-442b-bf0a-c80495747680, time=0:02:36.621582, scoring...
INFO: Scored pipeline id=7e1a0713-2785-442b-bf0a-c80495747680, f1_macro=0.62492
INFO: Found pipeline id=bc7bcbb4-86e5-45d8-a914-b61b7dcdfaa6, time=0:02:54.767454, scoring...
INFO: Scored pipeline id=bc7bcbb4-86e5-45d8-a914-b61b7dcdfaa6, f1_macro=0.62492
INFO: Found pipeline id=9568f649-471e-42fd-9e52-1520b378b4df, time=0:03:18.970813, scoring...
INFO: Found pipeline id=61bad066-82d2-46a3-b30e-a650ec4abe00, time=0:03:34.186458, scoring...
INFO: Scored pipeline id=9568f649-471e-42fd-9e52-1520b378b4df, f1_macro=0.62492
INFO: Scored pipeline id=61bad066-82d2-46a3-b30e-a650ec4abe00, f1_macro=0.62492
INFO: Found pipeline id=4ad83c37-8a32-460f-beda-02d142d35459, time=0:04:01.392814, scoring...
INFO: Scored pipeline id=4ad83c37-8a32-460f-beda-02d142d35459, f1_macro=0.4929
INFO: Found pipeline id=f8386c24-bdb8-4ec3-b7df-a1095b61453b, time=0:04:19.554698, scoring...
INFO: Scored pipeline id=f8386c24-bdb8-4ec3-b7df-a1095b61453b, f1_macro=0.41461
INFO: Found pipeline id=6e0db6d0-d402-43ac-9b0b-bc15adaafa0f, time=0:05:25.937965, scoring...
INFO: Found pipeline id=ffed0620-68c8-4267-848a-5f351f1c6909, time=0:05:26.384093, scoring...
INFO: Found pipeline id=661a0416-b65f-4233-a5c9-af82616159d3, time=0:05:35.799391, scoring...
INFO: Scored pipeline id=ffed0620-68c8-4267-848a-5f351f1c6909, f1_macro=0.62492
INFO: Scored pipeline id=6e0db6d0-d402-43ac-9b0b-bc15adaafa0f, f1_macro=0.62492
INFO: Found pipeline id=c1337970-cd54-4387-ac69-e663bee43ebf, time=0:05:42.173183, scoring...
INFO: Search completed, still scoring some pending pipelines...
INFO: Scored pipeline id=661a0416-b65f-4233-a5c9-af82616159d3, f1_macro=0.62492
INFO: Scored pipeline id=c1337970-cd54-4387-ac69-e663bee43ebf, f1_macro=0.62492
INFO: Scoring completed for all pipelines!

After the pipeline search is complete, we can display the leaderboard:

[3]:

automl.plot_leaderboard()

[3]:

ranking	id	summary	f1_macro
1	5fdacf6b-7c4e-4a79-a8fb-3dbcb09841a3	imputer.sklearn, encoder.dsbox, gradient_boosting.sklearn	0.715350
2	e9043b5b-a27c-4a64-a154-8e061dfed680	imputer.sklearn, encoder.dsbox, random_forest.sklearn	0.642140
3	7e1a0713-2785-442b-bf0a-c80495747680	imputer.sklearn, one_hot_encoder.distilonehotencoder, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common	0.624920
4	bc7bcbb4-86e5-45d8-a914-b61b7dcdfaa6	imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common	0.624920
5	9568f649-471e-42fd-9e52-1520b378b4df	imputer.sklearn, one_hot_encoder.sklearn, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common	0.624920
6	61bad066-82d2-46a3-b30e-a650ec4abe00	imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, xgboost_gbtree.common	0.624920
7	ffed0620-68c8-4267-848a-5f351f1c6909	imputer.sklearn, one_hot_encoder.distilonehotencoder, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common	0.624920
8	6e0db6d0-d402-43ac-9b0b-bc15adaafa0f	imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, xgboost_gbtree.common	0.624920
9	661a0416-b65f-4233-a5c9-af82616159d3	imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common	0.624920
10	c1337970-cd54-4387-ac69-e663bee43ebf	imputer.sklearn, one_hot_encoder.sklearn, quantile_transformer.sklearn, pca_features.pcafeatures, xgboost_gbtree.common	0.624920
11	69fb4020-c8d7-421a-a216-ab66e27508a7	imputer.sklearn, encoder.dsbox, extra_trees.sklearn	0.616770
12	80851e2a-ae8e-4dd7-8764-9beb025a3185	imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, select_fwe.sklearn, xgboost_gbtree.common	0.607650
13	498ea7df-ff69-482f-b1f3-e7ef8e06f261	imputer.sklearn, one_hot_encoder.distilonehotencoder, quantile_transformer.sklearn, select_fwe.sklearn, xgboost_gbtree.common	0.607650
14	9195f3a1-0433-4fb7-8c7e-db93fbacdda7	imputer.sklearn, one_hot_encoder.sklearn, quantile_transformer.sklearn, select_fwe.sklearn, xgboost_gbtree.common	0.607650
15	4ad83c37-8a32-460f-beda-02d142d35459	imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, generic_univariate_select.sklearn, xgboost_gbtree.common	0.492900
16	e469f0a3-af16-4ab3-b826-09188a06c9a3	imputer.sklearn, encoder.dsbox, sgd.sklearn	0.473160
17	c2ce4516-3aa4-489a-b860-15631568707d	imputer.sklearn, encoder.dsbox, linear_svc.sklearn	0.441150
18	f8386c24-bdb8-4ec3-b7df-a1095b61453b	imputer.sklearn, encoder.dsbox, quantile_transformer.sklearn, score_based_markov_blanket.rpi, xgboost_gbtree.common	0.414610

Individual pipelines need to be trained with the full data. The training is done with the call:

[4]:

best_pipeline_id = automl.get_best_pipeline_id() # Getting the id of the best pipeline
model_id = automl.train(best_pipeline_id)

INFO: Training model...
INFO: Training finished!

Pipeline predictions are accessed with:

[5]:

predictions = automl.test(model_id, test_dataset)

INFO: Testing model...
INFO: Testing finished!

[6]:

predictions

[6]:

	d3mIndex	Hall_of_Fame
0	2	0
1	9	0
2	14	0
3	15	0
4	21	0
...	...	...
262	1327	0
263	1328	0
264	1329	0
265	1335	0
266	1338	0

267 rows × 2 columns

The pipeline can be evaluated against a held out dataset with the function call:

[7]:

automl.score(best_pipeline_id, test_dataset)

[7]:

('f1_macro', 0.64322)

Model Exploration¶

In order to explore the produced pipelines, we can use PipelineProfiler. PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AutoML systems.

After the pipeline search process is completed, we can use PipelineProfiler with:

Note

You can partially interact with this visualization. Try it in Jupyter Notebook to get full access to all features.

[8]:

automl.plot_comparison_pipelines()

INFO: Inputs for PipelineProfiler created!

PipelineProfiler shows the produced pipelines as a matrix, where the pipelines are represented as rows, and primitives as columns.

PipelineProfiler matrix view

The score view displays performance metrics (i.e. accuracy, F1) of the evaluated pipelines. It can also visualize the training time of each of the pipelines.

PipelineProfiler performance view

The Primitive Contribution view shows the correlation between primitive usage and the pipeline scores.

PipelineProfiler primitive contribution

The Pipeline Comparison view highlights the differences between selected pipelines. It presents a node-link representation of the selected pipelines. Multiple pipelines can be selected by shift-clicking the matrix rows.

PipelineProfiler graph comparison

For more information about how to use PipelineProfiler, click here. There is also a video demo available here.

After the analysis is complete, end the session to stop the Docker container and clean up temporary files:

[10]:

automl.end_session()

INFO: Ending session...
INFO: Session ended!

Download this example as a jupyter notebook file ( .ipynb ).