Leon Luithlen

Configuring sequifier preprocessing, training and inference

Sequifier enables training and infering many-to-many sequence autoregression models using the decoder-only transformer architecture. It also enables the preprocessing of the data into from a parsimonious representation into the one consumed by the training step, while model inference outputs the data with the same representation as fed into the preprocessing step. Each of these steps is configurable in many ways, and this document lays out what the configuration parameters are, what purpose they serve, and any relevant considerations in setting them.

Preprocessing

project_path: the path in which the sequifier project is configured and executed. To set up a sequifier project, run sequifier make PROJECT_NAME, and then PROJECT_NAME is the root directory for that sequifier project. Usually, you would execute the subsequent sequifier commands from that directory, and then the project_path is .
data_path: the path to the file that contains the input data to the preprocessing step
read_format: the file type of the input data, currently 'csv' and 'parquet' are accepted. Default is 'csv'
write_format: the file type of the preprocessed data, i.e. the input data to the training step. Default is 'parquet'
selected_columns: a list of input columns that should be preprocessd into the training data format. Especially when the input data has a lot of columns that will not be used in modelling, it is worth setting the columns to be included in the training data. If this value isn't set, all columns are included.
group_proportions: in training and evaluating models, it is always necessary to split the data into training, evaluation and test data sets. With this parameter, the relative size of these groups is configured. The 'standard' approach is to use the group created from the first value as the training dataset, the second as the validation and the third as the test dataset. If fewer or more splits are desired, this is also possible by passing the corresponding number of 'proportion' values. The data is split per sequence, so that, for example, the first 80% of each sequence is used for training, the subsequent 10% for validation and the last 10% for testing. If training, validation and test sets should be each composed of different sequences, the value '1.0' has to be passed here, and the preprocessing step output has to be manually split into these datasets afterwards. The sum of the proportions must be exactly 1.0, and all proportions must be positive.
seq_length: the sequence length of the preprocessed data, which is usually identical to the sequence length of the model input in the training step. If a range of model input sequence lengths need to be evaluated, the largest of these values should be passed here, so that preprocessing doesn't have to be repeated subsequently
seq_step_sizes: A list where each element gives the relative index for starting subsequent subsequences within the corresponding data split defined by group_proportions (the list lengths must match). For a given split, if its corresponding element in this list is set to 1, each subsequence within that split starts one row after the previous one (maximum overlap). If an element is set to seq_length, there is no or minimal overlap between subsequences within that split; any minimal overlap ensures all values in the split are covered, particularly if the split's length isn't perfectly divisible by seq_length. The default value that is used is a list containing seq_length for each data split.
max_rows: the maximum number of input rows to process, in case of very large data and when only a subset of the data is needed, for example to validate a preprocessing configuration. Default is 'None', in which case all rows are processed
seed: random seed used with numpy
n_cores: the number of cores used in parallel to preprocess the data. If left empty, it uses the number of available CPU cores. Occasionally, parallel processing leads to bugs, in which case it is worth setting n_cores to 1, even if it takes longer.

Training

General training config parameters

project_path: same as in preprocessing, sets the path to the project directory relative to the execution location in the filesystem. Should usually be .
model_name: name of the model being trained, used in various file paths (e.g. logging, model checkpoints, model weight outputs)
read_format: file format of input data, 'csv' and 'parquet' are allowed, default is 'parquet'
ddconfig_path: 'ddconfig' stands for 'data driven config', which is a json file created through the preprocessing step that contains various metadata. The path to the ddconfig is typically 'PROJECT_FOLDER/configs/ddconfig/DATASET_NAME.json'
selected_columns: input columns to train the model on. Note that these are 'columns' in the input data to the preprocessing step, in the input to the training data these are values of the 'inputCol' column
target_columns: target columns for model training. Note that these are 'columns' in the input data to the preprocessing step, in the input to the training data these are values of the 'inputCol' column
target_column_types: the column types of the target columns. Each target column needs to be mapped to 'categorical' or 'real', as these are the two options
seq_length: the sequence length of the input sequences to the model
inference_batch_size: the batch size used in the model after export. This is mainly relevant for the onnx model format
seed: seed set for numpy and pytorch
export_onnx: export ONNX format model, defaults to True
export_pt: export pytorch model with torch.save, defaults to False
export_with_dropout: export model with dropout, applies only to torch.save export, defaults to False
model_spec: the specification of the transformer model architecture, beyond the input and target columns
training_spec: the specification of the training run configuration

Model Spec

d_model: the number of expected features in the input (unless d_model is smaller than nhead, in which case nhead is used
d_model_by_column: the embedding dimensions for each input column. They mus sum to d_model. Defaults to 'None', in which case embedding space is automatically equally distributed between input columns.
nhead: the number of heads in the multiheadattention models
d_hid: the dimension of the feedforward network model
nlayers: the number of layers of the transformer model

Training Spec

device: the torch.device to train the model on, 'cuda', 'cpu' or 'mps' for apple silicon devices
epochs: the total number of epochs to train for. Training will stop after this epoch number is completed. If continue_training is True, this is the target total epoch number, not the number of additional epochs.
iter_save: the interval in epochs for checkpointing
batch_size: training batch size
log_interval: the interval in batches for logging
class_share_log_columns: a list of column names for which the class share of the predictions on the validation set should be logged, defaults to all categorical columns
lr: learning rate
accumulation_steps: if gradient accumulation should be used, set to something higher than 1
early_stopping_epochs: number of epochs to wait for a validation loss that improves on the existing minimum. If this happens, counting starts from 0, if it doesn't, training stops. Defaults to 'None', in which case no early stopping is used.
dropout: the dropout value of the transformer model
criterion: the map from each target column to the loss function to be applied. All pytorch loss functions can be used, but the loss function must work with the column type it is applied to (i.e. categorical or real)
optimizer: the optimizer itself is specified with the 'name' attribute, other optimizer hyperparameters should be specified after
scheduler: the learning rate scheduler. Like with the optimizer, the scheduler itself can be specified with the 'name' attribute, with other hyperparameters coming after
continue_training: load the latest checkpoint (based on the time of the last change (ctime)) with the same model name and continue training from there. This restores the model weights, optimizer state, and the epoch count.
loss_weights: map from columns to loss weights, to weight loss of each column separately. Can be necessary when using differently calibrated loss functions for different variables
class_weights: an optional dictionary mapping one or more categorical target columns to a list of weights (floats) for each class. This allows weighting the loss function based on class importance, which can be useful for imbalanced datasets. The length of the list must match the number of classes for that column, plus 1. The additional class is added by sequifier to allow the model to output "unknown" on inference. Defaults to 'None', implying equal weight for all classes.

Data driven config parameters

These parameters are typically read from the data driven config file created through the preprocessing step, with the path to that config file passed in the training config. For transparency and in case the preprocessing step isn't used, they are listed here.

training_data_path: path to training data, optional if ddconfig_path is passed, in which case it will use the first of the preprocessing output data files if no value is specified in the config file
validation_data_path: path to validation data, optional if ddconfig_path is passed, in which case it will use the second-to-last of the preprocessing output data files if no value is specified in the config file
column_types: a map from each column to a numeric type, either 'int64' or 'float64'
n_classes: the number of classes for each categorical column
id_maps: for each categorical column, a map from the distinct values of that column to their index+1, for modelling purposes

Inference

project_path :same as in preprocessing, sets the path to the project directory relative to the execution location in the filesystem. Should usually be .
ddconfig_path: 'ddconfig' stands for 'data driven config', which is a json file created through the preprocessing step that contains various metadata. The path to the ddconfig is typically 'PROJECT_FOLDER/configs/ddconfig/DATASET_NAME.json'
model_path: path to model output by training step
data_path: path to data to infer on, typically the last data set from the preprocessing step output
read_format: the file type of the input data, currently 'csv' and 'parquet' are accepted. Default is 'parquet'
write_format: the file type of the inference output, defaults to 'csv'
selected_columns: model input columns, must be identical to training config
target_columns: model output columns, must be identical to training config
target_column_types: the column types of the target columns. Each target column needs to be mapped to 'categorical' or 'real', as these are the two options. Must be identical to training config. Real-valued target columns in the output data are automatically inverse-normalized to match the scale of the original input data provided to the preprocessing step.
output_probabilities: output the probability distributions across categorical values for categorical target columns, default to False
map_to_id: map categorical output values back into the space of input values to the preprocessing step, defaults to True
device: the torch.device to infer the model with, 'cuda', 'cpu' or 'mps' for apple silicon devices. Values other than 'cpu' for onnx models depend on GPU tooling for onnx
seq_length: the input sequence length to the model, must be the same as in the training config
inference_batch_size: batch size for inference, must be the same as in the training config for onnx models, can be different for torch.save models
autoregression: infer every step after the first step from predicted values, rather than ground truth values, defaults to True
autoregression_additional_steps: infer not only for the steps in each sequence, but add N additional steps. Only works with autoregressive inference
training_config_path: path to the training config, must be passed to infer torch.save models
infer_with_dropout: apply dropout active during inference, works only with torch.save models, defaults to False

There are additional parameters to the inference step that are typically read from the data driven config, as listed above. If the data driven config path isn't passed, these need to be set manually for inference as well