API¶

class AutoML(output_folder, automl_id='AlphaD3M', container_runtime='docker', resource_folder=None, grpc_port=None, verbose=False)¶

Create/instantiate an AutoML object

Parameters

output_folder – Path to the output directory
automl_id – AutoML system name to be used. AutoML systems available are: ‘AlphaD3M’, ‘AutonML’. Currently only AlphaD3M is available for the container_runtime=’pypi’ option
resource_folder – Path to the directory where the resources are stored. This is needed only for some primitives that use pre-trained models, databases ,etc.
container_runtime – The container runtime to use, can be ‘docker’, ‘singularity’, ‘pypi’, or ‘local’
grpc_port – Port to be used by GRPC
verbose – Whether or not to show all the logs from AutoML systems

search_pipelines(dataset, time_bound, time_bound_run=5, target=None, metric=None, task_keywords=None, method='holdout', stratified=False, shuffle=True, folds=10, train_ratio=0.7, random_seed=0, exclude_primitives=None, include_primitives=None, **kwargs)¶

Perform the search of pipelines

Parameters

dataset – Path to dataset. It supports CSV file, D3M dataset, OpenML, and Sklearn datasets
time_bound – Limit time in minutes to perform the search
time_bound_run – Limit time in minutes to score a pipeline
target – Column name of the potential target variable for a problem
metric – The provided metrics are the following: hammingLoss, accuracy, objectDetectionAP, rocAucMicro, f1Macro, meanSquaredError, f1, jaccardSimilarityScore, normalizedMutualInformation, rocAuc, f1Micro, hitsAtK, meanAbsoluteError, rocAucMacro, rSquared, recall, meanReciprocalRank, precision, precisionAtTopK, rootMeanSquaredError
task_keywords – A list of keywords that capture the nature of the machine learning task. The keywords that can be combined to describe the task are the following: tabular, nested, multiLabel, video, linkPrediction, multivariate, graphMatching, forecasting, classification, graph, semiSupervised, text, timeSeries, clustering, collaborativeFiltering, univariate, missingMetadata, remoteSensing, multiClass, regression, multiGraph, lupi, relational, audio, grouped, objectDetection, vertexNomination, communityDetection, geospatial, image, overlapping, nonOverlapping, speech, vertexClassification, binary
method – Method to score the pipeline: holdout, cross_validation
stratified – Whether or not to split the data using a stratified strategy
shuffle – Whether or not to shuffle the data before splitting
folds – the seed used by the random number generator
train_ratio – Represent the proportion of the dataset to include in the train split
random_seed – The number seed used by the random generator
exclude_primitives – List of primitive’s names to be excluded in the search space. If None, all the primitives will be used in the search
include_primitives – List of primitive’s names to be included in the search space. If None, all the primitives will be used in the search
kwargs – Different arguments for problem’s settings (e.g. pos_label for binary problems using F1)

train(pipeline_id, expose_outputs=None)¶

Train a model using an specific ML pipeline

Parameters

pipeline_id – Pipeline id
expose_outputs – The output of the pipeline steps. If None, it doesn’t expose any output of the steps. If str, should be ‘all’ to shows the output of each step in the pipeline, If list, it should contain the ids of the steps, e.g. ‘steps.2.produce’

Returns

An id of the fitted pipeline with/without the pipeline step outputs

test(pipeline_id, test_dataset, expose_outputs=None, calculate_confidence=False)¶

Test a model

Parameters

pipeline_id – The id of a fitted pipeline
test_dataset – Path to dataset. It supports D3M dataset, and CSV file
expose_outputs – The output of the pipeline steps. If None, it doesn’t expose any output of the steps. If str, should be ‘all’ to shows the output of each step in the pipeline, If list, it should contain the ids of the steps, e.g. ‘steps.2.produce’
calculate_confidence – Whether or not to return the confidence instead of the predictions

Returns

A dataframe that contains the predictions with/without the pipeline step outputs

score(pipeline_id, test_dataset)¶

Compute a proper score of the model

Parameters

pipeline_id – The id of a pipeline or a Pipeline object
test_dataset – Path to dataset. It supports D3M dataset, and CSV file

Returns

A tuple holding metric name and score value

save_pipeline(pipeline_id, output_folder)¶

Save a pipeline on disk

Parameters

pipeline_id – The id of the pipeline to be saved
output_folder – Path to the folder where the pipeline will be saved

load_pipeline(pipeline_path)¶

Load a previous saved pipeline

Parameters: pipeline_path – Path to the folder where the pipeline is saved

get_best_pipeline_id()¶

Get the id of the best pipeline

Returns: The id of the best pipeline

list_primitives()¶

Get a list of primitives used by the AutoML system

Returns: List of primitives used by the AutoML system

create_pipelineprofiler_inputs(test_dataset=None, source_name=None)¶

Create an proper input supported by PipelineProfiler based on the pipelines generated by an AutoML system

Parameters

test_dataset – Path to dataset. If None it will use the search scores, otherwise will score the pipelines over the passed dataset
source_name – Name of the pipeline source. If None it will use the AutoML id

Returns

List of pipelines in the PipelineProfiler input format

create_textanalizer_inputs(dataset, text_column, label_column, positive_label=1, negative_label=0)¶

Create an proper input supported by VisualTextAnalyzer

Parameters

dataset – Path to dataset. It supports D3M dataset, and CSV file
text_column – Name of the column that contains the texts
label_column – Name of the column that contains the classes
positive_label – Label for the positive class
negative_label – Label for the negative class

export_pipeline_code(pipeline_id, ipython_cell=True)¶

Converts a Pipeline Description to an executable Python script

Parameters

pipeline_id – Pipeline id
ipython_cell – Whether or not to show the Python code in a Jupyter Notebook cell

end_session()¶: This safely ends session in D3M interface

plot_leaderboard()¶: Plot pipelines’ leaderboard

plot_summary_dataset(dataset, text_column=None)¶

Plot histograms of the dataset

Parameters

dataset – Path to dataset. It supports D3M dataset, and CSV file
text_column – Name of the column that contains the texts. Only needed for D3M dataset that has collections

plot_comparison_pipelines(test_dataset=None, source_name=None, precomputed_pipelines=None)¶

Plot PipelineProfiler visualization

Parameters

test_dataset – Path to dataset. If None it will use the search scores, otherwise will score the pipelines over the passed dataset
source_name – Name of the pipeline source. If None it will use the AutoML id
precomputed_pipelines – If not None, it loads pipelines previously computed

plot_text_analysis(dataset=None, text_column=None, label_column=None, positive_label=1, negative_label=0, precomputed_data=None)¶

Plot a visualization for text datasets

Parameters

dataset – Path to dataset. It supports D3M dataset, and CSV file
text_column – Name of the column that contains the texts
label_column – Name of the column that contains the classes
positive_label – Label for the positive class
negative_label – Label for the negative class
precomputed_data – If not None, it loads words/named entities previously computed

plot_text_explanation(model_id, instance_text, text_column, label_column, num_features=5, top_labels=1)¶

Plot a LIME visualization for model explanation

Parameters

model_id – Model id
instance_text – Text to be explained
text_column – Name of the column that contains the texts
label_column – Name of the column that contains the classes
num_features – Maximum number of features present in the explanation
top_labels – Number of labels with highest prediction probabilities to use in the explanations

static add_new_automl(automl_id, docker_image_url)¶

Add a new AutoML system that is not already defined in the D3M Interface. It can also be a different version of a pre-existing AutoML (however it must be added with a different name)

Parameters

automl_id – A id to identify the new AutoML
docker_image_url – The docker image url of the new AutoML