TRAIN command

Uses automated machine learning to create an optimum predictive model using a training data set.

Note

The TRAIN command is not supported if you are running Analytics on a 32-bit computer. The computation required by the command is processor-intensive and better suited to 64-bit computers.

Syntax

TRAIN {CLASSIFIER|REGRESSOR} <ON> key_field <...n> TARGET labeled_field SCORER {ACCURACY|AUC|F1|LOGLOSS|PRECISION|RECALL|MAE|MSE|R2} SEARCHTIME minutes MAXEVALTIME minutes MODEL model_name TO table_name <IF test> <WHILE test> <FIRST range|NEXT range> FOLDS number_of_folds <SEED seed_value> <LINEAR> <NOFP>

Note

The maximum supported size of the data set used with the TRAIN command is 1 GB.

Parameters

Name Description
CLASSIFIER | REGRESSOR

The prediction type to use when training a predictive model:

  • CLASSIFIER use classification algorithms to train a model

    Use classification if you want to predict which class or category records belong to.

  • REGRESSOR use regression algorithms to train a model

    Use regression if you want to predict numeric values associated with records.

ON key_field <...n>

One or more training input fields.

Fields can be character, numeric, or logical. Multiple fields must be separated by spaces.

Note

Character fields must be "categorical". That is, they must identify categories or classes, and contain a maximum number of unique values.

The maximum is specified by the Maximum Categories option (Tools > Options > Command).

TARGET labeled_field

The field that the model is being trained to predict based on the training input fields.

The different prediction types (classification or regression) work with different field data types:

Valid with CLASSIFIER a character or logical target field
Valid with REGRESSOR a numeric target field
SCORER ACCURACY | AUC | F1 | LOGLOSS | PRECISION | RECALL | MAE | MSE | R2

The metric to use when scoring (tuning and ranking) the generated models.

The generated model with the best value for this metric is kept, and the rest of the models are discarded.

A different subset of metrics is valid depending on the prediction type you are using (classification or regression):

Valid with CLASSIFIER ACCURACY | AUC | F1 | LOGLOSS | PRECISION | RECALL
Valid with REGRESSOR MAE | MSE | R2

Note

The classification metric AUC is only valid when labeled_field contains binary data – that is, two classes, such as Yes/No, or True/False.

SEARCHTIME minutes

The total time in minutes to spend training and optimizing a predictive model.

Training and optimizing involves searching across different pipeline configurations (different model, preprocessor, and hyperparameter combinations).

Note

Total runtime of the TRAIN command is SEARCHTIME plus up to twice MAXEVALTIME.

Tip

Specify a SEARCHTIME that is at least 10x the MAXEVALTIME

This time allotment strikes a reasonable balance between processing time and allowing a variety of model types to be evaluated.

MAXEVALTIME minutes

Maximum runtime in minutes per model evaluation.

Tip

Allot 45 minutes for every 100 MB of training data.

This time allotment strikes a reasonable balance between processing time and allowing a variety of model types to be evaluated.

MODEL model_name

The name of the model file output by the training process.

The model file contains the model best fitted to the training data set. You will input the model to the PREDICT command to generate predictions about a new, unseen data set.

Specify model_name as a quoted string. For example: TO "Loan_default_prediction"

You can specify the *.model file extension, or let Analytics automatically specify it.

By default, the model file is saved to the folder containing the Analytics project.

Use either an absolute or relative file path to save the model file to a different, existing folder:

  • TO "C:\Loan_default_prediction"
  • TO "ML Train output\Loan_default_prediction.model"
TO table_name

The name of the model evaluation table output by the training process.

The model evaluation table contains two distinct types of information:

  • Scorer/Metric for the classification or regression metrics, quantitative estimates of the predictive performance of the model file output by the training process

    Different metrics provide different types of estimates. Scorer identifies the metric you specified with SCORER. Metric identifies the metrics you did not specify.

  • Importance/Coefficient in descending order, values indicating how much each feature (predictor) contributes to the predictions made by the model

Specify table_name as a quoted string with a .FIL file extension. For example: TO "Model_evaluation.FIL"

By default, the table data file (.FIL) is saved to the folder containing the Analytics project.

Use either an absolute or relative file path to save the data file to a different, existing folder:

  • TO "C:\Model_evaluation.FIL"
  • TO "ML Train output\Model_evaluation.FIL"

Note

Table names are limited to 64 alphanumeric characters, not including the .FIL extension. The name can include the underscore character ( _ ), but no other special characters, or any spaces. The name cannot start with a number.

IF test

optional

A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition.

Note

The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT).

WHILE test

optional

A conditional expression that must be true in order to process each record. The command is executed until the condition evaluates as false, or the end of the table is reached.

Note

If you use WHILE in conjunction with FIRST or NEXT, record processing stops as soon as one limit is reached.

FIRST range | NEXT range

optional

The number of records to process:

  • FIRST start processing from the first record until the specified number of records is reached
  • NEXT start processing from the currently selected record until the specified number of records is reached

Use range to specify the number of records to process.

If you omit FIRST and NEXT, all records are processed by default.

FOLDS number_of_folds

The number of cross-validation folds to use when evaluating and optimizing the model.

Folds are subdivisions of the training data set, and are used in a cross-validation process.

Typically, using from 5 to 10 folds yields good results when training a model. The minimum number of folds allowed is 2, and the maximum number is 10.

Tip

Increasing the number of folds can produce a better estimate of the predictive performance of a model, but it also increases overall runtime.

SEED seed_value

optional

The seed value to use to initialize the random number generator in Analytics.

If you omit SEED, Analytics randomly selects the seed value.

Explicitly specify a seed value, and record it, if you want to replicate the training process with the same data set in the future.

LINEAR

optional

Train and score only linear models.

If LINEAR is omitted, all model types relevant to classification or regression are evaluated.

Note

With larger data sets, the training process typically completes more quickly if you include only linear models.

Including only linear models guarantees coefficients in the output.

NOFP

optional

Exclude feature selection and data preprocessing from the training process.

Feature selection is the automated selection of the fields in the training data set that are the most useful in optimizing the predictive model. Automated selection can improve predictive performance, and reduce the amount of data involved in model optimization.

Data preprocessing performs transformations such as scaling and standardizing on the training data set to make it better suited for the training algorithms.

Caution

You should only exclude feature selection and data preprocessing if you have a reason for doing so.

Examples

Train a classification model

You want to train a classification model that you can use in a subsequent process to predict which loan applicants will default.

You train the model using a set of historical loan data with a known outcome for each loan, including whether the client defaulted.

In the subsequent prediction process, you will use the model produced by the TRAIN command to process current loan applicant data.

OPEN "Loan_applicants_historical"
TRAIN CLASSIFIER ON Age Job_Category Salary Account_Balance Loan_Amount Loan_Period Refinanced Credit_Score TARGET Default SCORER LOGLOSS SEARCHTIME 960 MAXEVALTIME 90 MODEL "Loan_default_prediction.model" TO "Model_evaluation.FIL" FOLDS 5

Train a regression model

You want to train a regression model that you can use in a subsequent process to predict the future sale price of houses.

You train the model using a set of recent house sales data, including the sale price.

In the subsequent prediction process, you will use the model produced by the TRAIN command to generate house price evaluations.

OPEN "House_sales"
TRAIN REGRESSOR ON Lot_Size Bedrooms Bathrooms Stories Driveway Recroom Full_Basement Gas_HW Air_conditioning Garage_Places Preferred_Area TARGET Price SCORER MSE SEARCHTIME 960 MAXEVALTIME 90 MODEL "House_price_prediction.model" TO "Model_evaluation.FIL" FOLDS 5

Remarks

For more information about how this command works, see Predicting classes and numeric values.