Sequifier enables training and infering many-to-many sequence autoregression models using the decoder-only transformer architecture. It also enables the preprocessing of the data into from a parsimonious representation into the one consumed by the training step, while model inference outputs the data with the same representation as fed into the preprocessing step. Each of these steps is configurable in many ways, and this document lays out what the configuration parameters are, what purpose they serve, and any relevant considerations in setting them.
project_path
: the path in which the sequifier project is configured and executed. To set up a sequifier project, run sequifier make PROJECT_NAME
, and then PROJECT_NAME is the root directory for that sequifier project. Usually, you would execute the subsequent sequifier commands from that directory, and then the project_path
is .
data_path
: the path to the file that contains the input data to the preprocessing stepread_format
: the file type of the input data, currently 'csv' and 'parquet' are accepted. Default is 'csv'write_format
: the file type of the preprocessed data, i.e. the input data to the training step. Default is 'parquet'selected_columns
: a list of input columns that should be preprocessd into the training data format. Especially when the input data has a lot of columns that will not be used in modelling, it is worth setting the columns to be included in the training data. If this value isn't set, all columns are included.group_proportions
: in training and evaluating models, it is always necessary to split the data into training, evaluation and test data sets. With this parameter, the relative size of these groups is configured. The 'standard' approach is to use the group created from the first value as the training dataset, the second as the validation and the third as the test dataset. If fewer or more splits are desired, this is also possible by passing the corresponding number of 'proportion' values. The data is split per sequence, so that, for example, the first 80% of each sequence is used for training, the subsequent 10% for validation and the last 10% for testing. If training, validation and test sets should be each composed of different sequences, the value '1.0' has to be passed here, and the preprocessing step output has to be manually split into these datasets afterwards.seq_length
: the sequence length of the preprocessed data, which is usually identical to the sequence length of the model input in the training step. If a range of model input sequence lengths need to be evaluated, the largest of these values should be passed here, so that preprocessing doesn't have to be repeated subsequentlyseq_step_size
: gives the relative index of each subsequent subsequence. For example, if it is set to 1, each subsequence starts one row after the previous one. If set to seq_length
, there is no or minimal overlap between subsequences. The overlap is non-zero when the sequence length is not divisible by seq_length
, as the subsequences are then aranged so that no values of the sequence are ignored. The default value of seq_step_size
is seq_length
max_rows
: the maximum number of input rows to process, in case of very large data and when only a subset of the data is needed, for example to validate a preprocessing configuration. Default is 'None', in which case all rows are processedseed
: random seed used with numpyn_cores
: the number of cores used in parallel to preprocess the data. If left empty, it uses the number of available CPU cores. Occasionally, parallel processing leads to bugs, in which case it is worth setting n_cores
to 1, even if it takes longer.project_path
: same as in preprocessing, sets the path to the project directory relative to the execution location in the filesystem. Should usually be .
model_name
: name of the model being trained, used in various file paths (e.g. logging, model checkpoints, model weight outputs) read_format
: file format of input data, 'csv' and 'parquet' are allowed, default is 'parquet'ddconfig_path
: 'ddconfig' stands for 'data driven config', which is a json file created through the preprocessing step that contains various metadata. The path to the ddconfig is typically 'PROJECT_FOLDER/configs/ddconfig/DATASET_NAME.json'selected_columns
: input columns to train the model on. Note that these are 'columns' in the input data to the preprocessing step, in the input to the training data these are values of the 'inputCol' columntarget_columns
: target columns for model training. Note that these are 'columns' in the input data to the preprocessing step, in the input to the training data these are values of the 'inputCol' columntarget_column_types
: the column types of the target columns. Each target column needs to be mapped to 'categorical' or 'real', as these are the two optionsseq_length
: the sequence length of the input sequences to the modelinference_batch_size
: the batch size used in the model after export. This is mainly relevant for the onnx model formatseed
: seed set for numpy and pytorchexport_onnx
: export ONNX format model, defaults to Trueexport_pt
: export pytorch model with torch.save
, defaults to Falseexport_with_dropout
: export model with dropout, applies only to torch.save export, defaults to Falsemodel_spec
: the specification of the transformer model architecture, beyond the input and target columnstraining_spec
: the specification of the training run configurationd_model
: the number of expected features in the input (unless d_model is smaller than nhead
, in which case nhead
is usedd_model_by_column
: the embedding dimensions for each input column. They mus sum to d_model
. Defaults to 'None', in which case embedding space is automatically equally distributed between input columns.nhead
: the number of heads in the multiheadattention modelsd_hid
: the dimension of the feedforward network modelnlayers
: the number of layers of the transformer modeldevice
: the torch.device to train the model on, 'cuda', 'cpu' or 'mps' for apple silicon devicesepochs
: number of epochsiter_save
: the interval in epochs for checkpointingbatch_size
: training batch sizelog_interval
: the interval in batches for loggingclass_share_log_columns
: a list of column names for which the class share of the predictions on the validation set should be logged, defaults to all categorical columnslr
: learning rateaccumulation_steps
: if gradient accumulation should be used, set to something higher than 1early_stopping_epochs
: number of epochs to wait for a validation loss that improves on the existing minimum. If this happens, counting starts from 0, if it doesn't, training stops. Defaults to 'None', in which case no early stopping is used.dropout
: the dropout value of the transformer modelcriterion
: the map from each target column to the loss function to be applied. All pytorch loss functions can be used, but the loss function must work with the column type it is applied to (i.e. categorical or real)optimizer
: the optimizer itself is specified with the 'name' attribute, other optimizer hyperparameters should be specified afterscheduler
: the learning rate scheduler. Like with the optimizer, the scheduler itself can be specified with the 'name' attribute, with other hyperparameters coming aftercontinue_training
: load the last checkpoint with the same model name and continue training from thereloss_weights
: map from columns to loss weights, to weight loss of each column separately. Can be necessary when using differently calibrated loss functions for different variablesThese parameters are typically read from the data driven config file created through the preprocessing step, with the path to that config file passed in the training config. For transparency and in case the preprocessing step isn't used, they are listed here.
training_data_path
: path to training data, optional if ddconfig_path
is passed, in which case it will use the first of the preprocessing output data files if no value is specified in the config filevalidation_data_path
: path to validation data, optional if ddconfig_path
is passed, in which case it will use the second-to-last of the preprocessing output data files if no value is specified in the config filecolumn_types
: a map from each column to a numeric type, either 'int64' or 'float64'n_classes
: the number of classes for each categorical columnid_maps
: for each categorical column, a map from the distinct values of that column to their index+1, for modelling purposesproject_path
:same as in preprocessing, sets the path to the project directory relative to the execution location in the filesystem. Should usually be .
ddconfig_path
: 'ddconfig' stands for 'data driven config', which is a json file created through the preprocessing step that contains various metadata. The path to the ddconfig is typically 'PROJECT_FOLDER/configs/ddconfig/DATASET_NAME.json'model_path
: path to model output by training stepdata_path
: path to data to infer on, typically the last data set from the preprocessing step outputread_format
: the file type of the input data, currently 'csv' and 'parquet' are accepted. Default is 'parquet'write_format
: the file type of the inference output, defaults to 'csv'selected_columns
: model input columns, must be identical to training configtarget_columns
: model output columns, must be identical to training configtarget_column_types
: the column types of the target columns. Each target column needs to be mapped to 'categorical' or 'real', as these are the two options. Must be identical to trainign configoutput_probabilities
: output the probability distributions across categorical values for categorical target columns, default to Falsemap_to_id
: map categorical output values back into the space of input values to the preprocessing step, defaults to Truedevice
: the torch.device to infer the model with, 'cuda', 'cpu' or 'mps' for apple silicon devices. Values other than 'cpu' for onnx models depend on GPU tooling for onnxseq_length
: the input sequence length to the model, must be the same as in the training configinference_batch_size
: batch size for inference, must be the same as in the training config for onnx models, can be different for torch.save models autoregression
: infer every step after the first step from predicted values, rather than ground truth values, defaults to Trueautoregression_additional_steps
: infer not only for the steps in each sequence, but add N additional steps. Only works with autoregressive inferencetraining_config_path
: path to the training config, must be passed to infer torch.save models infer_with_dropout
: apply dropout active during inference, works only with torch.save models, defaults to FalseThere are additional parameters to the inference step that are typically read from the data driven config, as listed above. If the data driven config path isn't passed, these need to be set manually for inference as well