API¶
-
class
AutoML
(output_folder, automl_id='AlphaD3M', container_runtime='docker', resource_folder=None, grpc_port=None, verbose=False)¶ Create/instantiate an AutoML object
- Parameters
output_folder – Path to the output directory
automl_id – AutoML system name to be used. AutoML systems available are: ‘AlphaD3M’, ‘AutonML’. Currently only AlphaD3M is available for the container_runtime=’pypi’ option
resource_folder – Path to the directory where the resources are stored. This is needed only for some primitives that use pre-trained models, databases ,etc.
container_runtime – The container runtime to use, can be ‘docker’, ‘singularity’, ‘pypi’, or ‘local’
grpc_port – Port to be used by GRPC
verbose – Whether or not to show all the logs from AutoML systems
-
search_pipelines
(dataset, time_bound, time_bound_run=5, target=None, metric=None, task_keywords=None, method='holdout', stratified=False, shuffle=True, folds=10, train_ratio=0.7, random_seed=0, exclude_primitives=None, include_primitives=None, **kwargs)¶ Perform the search of pipelines
- Parameters
dataset – Path to dataset. It supports CSV file, D3M dataset, OpenML, and Sklearn datasets
time_bound – Limit time in minutes to perform the search
time_bound_run – Limit time in minutes to score a pipeline
target – Column name of the potential target variable for a problem
metric – The provided metrics are the following: hammingLoss, accuracy, objectDetectionAP, rocAucMicro, f1Macro, meanSquaredError, f1, jaccardSimilarityScore, normalizedMutualInformation, rocAuc, f1Micro, hitsAtK, meanAbsoluteError, rocAucMacro, rSquared, recall, meanReciprocalRank, precision, precisionAtTopK, rootMeanSquaredError
task_keywords – A list of keywords that capture the nature of the machine learning task. The keywords that can be combined to describe the task are the following: tabular, nested, multiLabel, video, linkPrediction, multivariate, graphMatching, forecasting, classification, graph, semiSupervised, text, timeSeries, clustering, collaborativeFiltering, univariate, missingMetadata, remoteSensing, multiClass, regression, multiGraph, lupi, relational, audio, grouped, objectDetection, vertexNomination, communityDetection, geospatial, image, overlapping, nonOverlapping, speech, vertexClassification, binary
method – Method to score the pipeline: holdout, cross_validation
stratified – Whether or not to split the data using a stratified strategy
shuffle – Whether or not to shuffle the data before splitting
folds – the seed used by the random number generator
train_ratio – Represent the proportion of the dataset to include in the train split
random_seed – The number seed used by the random generator
exclude_primitives – List of primitive’s names to be excluded in the search space. If None, all the primitives will be used in the search
include_primitives – List of primitive’s names to be included in the search space. If None, all the primitives will be used in the search
kwargs – Different arguments for problem’s settings (e.g. pos_label for binary problems using F1)
-
train
(pipeline_id, expose_outputs=None)¶ Train a model using an specific ML pipeline
- Parameters
pipeline_id – Pipeline id
expose_outputs – The output of the pipeline steps. If None, it doesn’t expose any output of the steps. If str, should be ‘all’ to shows the output of each step in the pipeline, If list, it should contain the ids of the steps, e.g. ‘steps.2.produce’
- Returns
An id of the fitted pipeline with/without the pipeline step outputs
-
test
(pipeline_id, test_dataset, expose_outputs=None, calculate_confidence=False)¶ Test a model
- Parameters
pipeline_id – The id of a fitted pipeline
test_dataset – Path to dataset. It supports D3M dataset, and CSV file
expose_outputs – The output of the pipeline steps. If None, it doesn’t expose any output of the steps. If str, should be ‘all’ to shows the output of each step in the pipeline, If list, it should contain the ids of the steps, e.g. ‘steps.2.produce’
calculate_confidence – Whether or not to return the confidence instead of the predictions
- Returns
A dataframe that contains the predictions with/without the pipeline step outputs
-
score
(pipeline_id, test_dataset)¶ Compute a proper score of the model
- Parameters
pipeline_id – The id of a pipeline or a Pipeline object
test_dataset – Path to dataset. It supports D3M dataset, and CSV file
- Returns
A tuple holding metric name and score value
-
save_pipeline
(pipeline_id, output_folder)¶ Save a pipeline on disk
- Parameters
pipeline_id – The id of the pipeline to be saved
output_folder – Path to the folder where the pipeline will be saved
-
load_pipeline
(pipeline_path)¶ Load a previous saved pipeline
- Parameters
pipeline_path – Path to the folder where the pipeline is saved
-
get_best_pipeline_id
()¶ Get the id of the best pipeline
- Returns
The id of the best pipeline
-
list_primitives
()¶ Get a list of primitives used by the AutoML system
- Returns
List of primitives used by the AutoML system
-
create_pipelineprofiler_inputs
(test_dataset=None, source_name=None)¶ Create an proper input supported by PipelineProfiler based on the pipelines generated by an AutoML system
- Parameters
test_dataset – Path to dataset. If None it will use the search scores, otherwise will score the pipelines over the passed dataset
source_name – Name of the pipeline source. If None it will use the AutoML id
- Returns
List of pipelines in the PipelineProfiler input format
-
create_textanalizer_inputs
(dataset, text_column, label_column, positive_label=1, negative_label=0)¶ Create an proper input supported by VisualTextAnalyzer
- Parameters
dataset – Path to dataset. It supports D3M dataset, and CSV file
text_column – Name of the column that contains the texts
label_column – Name of the column that contains the classes
positive_label – Label for the positive class
negative_label – Label for the negative class
-
export_pipeline_code
(pipeline_id, ipython_cell=True)¶ Converts a Pipeline Description to an executable Python script
- Parameters
pipeline_id – Pipeline id
ipython_cell – Whether or not to show the Python code in a Jupyter Notebook cell
-
end_session
()¶ This safely ends session in D3M interface
-
plot_leaderboard
()¶ Plot pipelines’ leaderboard
-
plot_summary_dataset
(dataset, text_column=None)¶ Plot histograms of the dataset
- Parameters
dataset – Path to dataset. It supports D3M dataset, and CSV file
text_column – Name of the column that contains the texts. Only needed for D3M dataset that has collections
-
plot_comparison_pipelines
(test_dataset=None, source_name=None, precomputed_pipelines=None)¶ Plot PipelineProfiler visualization
- Parameters
test_dataset – Path to dataset. If None it will use the search scores, otherwise will score the pipelines over the passed dataset
source_name – Name of the pipeline source. If None it will use the AutoML id
precomputed_pipelines – If not None, it loads pipelines previously computed
-
plot_text_analysis
(dataset=None, text_column=None, label_column=None, positive_label=1, negative_label=0, precomputed_data=None)¶ Plot a visualization for text datasets
- Parameters
dataset – Path to dataset. It supports D3M dataset, and CSV file
text_column – Name of the column that contains the texts
label_column – Name of the column that contains the classes
positive_label – Label for the positive class
negative_label – Label for the negative class
precomputed_data – If not None, it loads words/named entities previously computed
-
plot_text_explanation
(model_id, instance_text, text_column, label_column, num_features=5, top_labels=1)¶ Plot a LIME visualization for model explanation
- Parameters
model_id – Model id
instance_text – Text to be explained
text_column – Name of the column that contains the texts
label_column – Name of the column that contains the classes
num_features – Maximum number of features present in the explanation
top_labels – Number of labels with highest prediction probabilities to use in the explanations
-
static
add_new_automl
(automl_id, docker_image_url)¶ Add a new AutoML system that is not already defined in the D3M Interface. It can also be a different version of a pre-existing AutoML (however it must be added with a different name)
- Parameters
automl_id – A id to identify the new AutoML
docker_image_url – The docker image url of the new AutoML