How To¶

Problem Specification¶

In order to build a predictive model, AlphaD3M needs a problem specification that describes a prediction problem. A problem specification includes:

A target variable, i.e., what should be predicted by the predictive model. In the d3m-interface environment, the target is an attribute from the dataset.
A task_keywords variable, which specifies the kind of prediction task and, therefore, the kind of technique that should be used to solve the prediction problem. In the d3m-interface environment, the task_keywords parameter must be defined as a list of keywords that capture the nature of the machine learning task. A few examples of supported tasks are tabular, nested, multiLabel, video, linkPrediction, multivariate, graphMatching, forecasting, classification, graph, semiSupervised, text, timeSeries, clustering, collaborativeFiltering, univariate, missingMetadata, remoteSensing, multiClass, regression, multiGraph, lupi, relational, audio, grouped, objectDetection, vertexNomination, communityDetection, geospatial, image, overlapping, nonOverlapping, speech, vertexClassification, and binary. See the complete list in our API documentation.
A metric variable, you can also specify the performance metric (Evaluation Metrics) you are interested in. A few examples of supported metrics are hammingLoss, accuracy, objectDetectionAP, rocAucMicro, f1Macro, meanSquaredError, f1, jaccardSimilarityScore, normalizedMutualInformation, rocAuc, f1Micro, hitsAtK, meanAbsoluteError, rocAucMacro, rSquared, recall, meanReciprocalRank, precision, precisionAtTopK, and rootMeanSquaredError. See the complete list in our API documentation.

More information about the problem schemas and related documentation is available in data-supply repository.

First, import the class AutoML from d3m-interface:

[1]:

from d3m_interface import AutoML

In this example, we are generating pipelines for CSV datasets. The Credit dataset is used for this example. This dataset consists of loan records to determine the best way to predict whether a loan applicant will fully repay or default on a loan.

Note

If you installed AlphaD3M from PyPI, please use the ‘pypi’ parameter in the constructor, like this: automl = AutoML(output_path, ‘AlphaD3M’, ‘pypi’)

[2]:

output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = '/Users/rlopez/D3M/examples/datasets/Credit/train_data.csv'
test_dataset = '/Users/rlopez/D3M/examples/datasets/Credit/test_data.csv'

In this example, we use AlphaD3M, developed by NYU, as the AutoML system.

[3]:

automl = AutoML(output_path, 'AlphaD3M')

Then, we specify the problem by setting up the target, the task_keywords, and the metric. Here, we are defining a multi-class classification problem, where the goal is to predict the ‘Loan Status’. In this problem, we will use accuracy as the performance metric.

[4]:

automl.search_pipelines(train_dataset, target='Loan Status', task_keywords=['classification', 'multiClass'], metric='accuracy', time_bound=5)

INFO: Reiceving a raw dataset, converting to D3M format
INFO: Initializing AlphaD3M AutoML...
INFO: AlphaD3M AutoML initialized!
INFO: Found pipeline id=a3d6bee1-cc24-4d67-9157-d3e1879e1fee, time=0:00:26.382475, scoring...
INFO: Scored pipeline id=a3d6bee1-cc24-4d67-9157-d3e1879e1fee, accuracy=0.77833
INFO: Found pipeline id=d82e5a47-76f6-48eb-b603-56f23f695fc2, time=0:00:44.600652, scoring...
INFO: Scored pipeline id=d82e5a47-76f6-48eb-b603-56f23f695fc2, accuracy=0.77917
INFO: Found pipeline id=6acb579c-d7ec-4125-b2ce-21770e39b5b2, time=0:01:02.807630, scoring...
INFO: Scored pipeline id=6acb579c-d7ec-4125-b2ce-21770e39b5b2, accuracy=0.8125
INFO: Found pipeline id=76f92c12-c034-4498-8561-65f94b27402a, time=0:01:21.103988, scoring...
INFO: Scored pipeline id=76f92c12-c034-4498-8561-65f94b27402a, accuracy=0.77083
INFO: Found pipeline id=3a6364e3-ec91-4839-98fb-2ba0a3e49207, time=0:01:39.402115, scoring...
INFO: Scored pipeline id=3a6364e3-ec91-4839-98fb-2ba0a3e49207, accuracy=0.76583
INFO: Found pipeline id=9426f447-72d2-4dd5-9fd6-13d25630487b, time=0:05:24.959725, scoring...
INFO: Found pipeline id=362ece87-b213-49eb-aace-3ebedad890d7, time=0:05:28.202371, scoring...
INFO: Search completed, still scoring some pending pipelines...
INFO: Scored pipeline id=9426f447-72d2-4dd5-9fd6-13d25630487b, accuracy=0.76167
INFO: Scored pipeline id=362ece87-b213-49eb-aace-3ebedad890d7, accuracy=0.22917
INFO: Scoring completed for all pipelines!

Exploring Tabular Datasets¶

plot_summary_dataset displays different views (compact, detail, and column views) to allow users to explore tabular datasets. It summarizes the column data using histograms. The column types are inferred using the datamart-profiler library. Additional column metadata is also shown in the column view such as mean, standard deviation, and unique values. Use the tabs above the table to switch between the different views of the dataset.

Note

You can partially interact with this visualization. Try it in Jupyter Notebook to get full access to all features.

[10]:

automl.plot_summary_dataset(train_dataset)

Exploring Text Datasets¶

plot_text_analysis displays a visualization to allow users to explore and analyze the text data. It includes word frequency analysis and named entities recognition, which help users to explore the fundamental characteristics of the text data. We use bar charts to create the visualizations integrated with the Jupyter Notebook environment. Word frequency analysis is a frequent task in text analytics. Word frequency measures the most frequently occurring words in a given text. Common stop words like ‘to’, ‘in’, ‘for’, were removed for the word frequency analysis. Named entity recognition is an information extraction method. The entities that are present in the text are classified into predefined entity types like ‘Person’, ‘Organization’, ‘City’, etc. By using this method, users can get great insights into the types of entities present in the given textual dataset.

First, import the class AutoML from d3m-interface:

[11]:

from d3m_interface import AutoML

The IED attacks dataset in D3M format is used for this example.

[12]:

train_dataset_path = '/Users/rlopez/D3M/examples/JIDO_SOHR_Articles_1061/TRAIN'
test_dataset_path = '/Users/rlopez/D3M/examples/JIDO_SOHR_Articles_1061/TEST'
score_dataset_path = '/Users/rlopez/D3M/examples/JIDO_SOHR_Articles_1061/SCORE'
output_path = '/Users/rlopez/D3M/examples/tmp/'

In this example, we use AlphaD3M, developed by NYU.

[13]:

automl = AutoML(output_path, 'AlphaD3M')

plot_text_analysis requires three parameters:

dataset_path – Path to dataset. It supports D3M dataset
label_column – Name of the column that contains the categories
text_column – Name of the column that contains the texts

Note

You can partially interact with this visualization. Try it in Jupyter Notebook to get full access to all features.

[14]:

automl.plot_text_analysis(train_dataset_path, label_column='articleofinterest', text_column='article')

Word Frequency:
Analyzing 7779 documents (positive category)
Analyzing 12004 documents (negative category)
Named Entity Recognition:
Analyzing 7779 documents (positive category)
Analyzing 12004 documents (negative category)

Download this example as a jupyter notebook file ( .ipynb ).

You can also find other Jupyter notebook examples about how to use d3m-interface with text datasets here.