Introducing Sequifier


Autoregressive transformer models have proven exceedingly effective at natural language processing, yielding the large language models that power products such as ChatGPT and Claude. They have a more mixed record for timeseries modelling, but in other application domains such as biological sequences, the 'standard' architecture has also shown promising results. However, there are a large number of problems they have not yet been applied to, and we will not know how useful they are until we try it out. Sequifier aims to make this process faster, cheaper, and more easily reproducible.

The key to achieving these goals is to implement the architecture and processing steps around model training and inference once, with sufficient flexibility to adapt to different input types, and make it configurable via standardized config files. This will make different training configurations for training runs on the same data directly comparable, and enable us to draw lessons on sequence modelling that might generalize from one dataset and even domain to others.

Next to speed and reproducibility, I also hope that Sequifier will make transformer modelling more accessible to a wider audience of researchers and scientists. Currently, developing custom deep learning models requires considerable knowledge of the deep learning ecosystem and how to organize medium-sized Python codebases. These are skills that many in the specialist sciences, such as biological, social, earth, and natural sciences, are not routinely taught. Yet these are the people with the expertise to specify problems in ways that would yield potentially useful models. Sequifier lowers the barrier so that anyone who can set up a Python environment and preprocess the data into the requisite format can develop their own custom transformer model, expanding the space of problems that can be modeled with transformers.

(The package has not yet been audited by a third party, and until this happens it is released only as an "alpha" release.)

Workflow

The workflow consists of creating a Sequifier project, adapting the config files for the three steps: preprocessing, training, and inferring the transformer model. How to do this exactly can be seen in the README of the project itself.

The main lever for model development is the training config, where the number of layers, attention heads, and embedding dimensions can be set. The loss functions for each feature variable can be specified, and the optimizer, learning rate, and learning rate schedule are configured. The ability to model a large number of variables with their sequential interdependencies, along with these architecture configurations, opens a very large design space of possible models.

The main lever for model analyses is inferring these models on hold-out or test data, and running standard classification and regression evaluation methods. Additional options include outputting class probabilities, which might enable more sophisticated analyses of the learned distributions, and randomization during inference, which could enable the measurement of variance in real-valued target variables. It is also possible to generate outputs autoregressively.

Validated models could then be used for forecasting or prediction tasks, using the inference functionality that Sequifier provides.

An Early Example

Sequifier enabled us (with Morgan Rivers) to develop two separate generative models for sperm whale "language" (aka Whale-GPT) in less than a day for each. Sperm whales emit "clicks" that show recognizable patterns, called codas. The precise function of these codas is not yet identified, but for our purposes, we considered them analogous to either syllables or words. Sperm whales often communicate in groups, sometimes simultaneously with other whales. This yields two ways these conversations can be represented for transformer modeling.

Firstly, we can consider the perspective of an individual whale, recording the coda emitted by that "first-person perspective" whale and other associated properties in one set of columns, and click emissions from any other whale in a second set of columns. This is a way to model a call/response structure to whale communication between the group and the individual.

Secondly, we can put the emitted codas of all whales in the same set of columns, add an indicator column as to which whale is emitting a coda, and an additional "synchrony" column indicating if a given coda is emitted in synchrony with a coda from another whale. This representation has more of a "third-person" character and is designated in the codebase with the "script" suffix.

For each of these data representations, we developed an autoregressive transformer model that learns to predict codas and other associated features from the prior "conversation". The call/response model could enable the generation of responses to recorded sperm whale communication, albeit without improving our understanding of what is being "said" on its own. External validity remains elusive, but it was a fun exercise and an illustration of how models of animal language could be developed, ideally with more data and features.

The repository contains the config files used to train and infer these models, which might be helpful in understanding how these files can be customized for a specific modeling task.