Sequence Data Configuration

Complete configuration guide for sequential data including text, DNA sequences, and time series.

Overview

Sequence data in EIR handles data such as:

  • Text data - Natural language, documents, reviews

  • Biological sequences - DNA, RNA, protein sequences

  • Time series - Sequential measurements over time

Quick Example

input_info:
  input_source: "my/sequence/data/"
  input_name: "dna_sequence"
  input_type: "sequence"
input_type_info:
  vocab_file: "dna_vocab.json"
  max_length: 1024
model_config:
  model_type: "sequence-default"
  model_init_config:
    embedding_dim: 128
    num_heads: 8
    num_layers: 6

Input Data Configuration

Base Configuration

class eir.setup.schemas.SequenceInputDataConfig(
vocab_file: None | str = None,
max_length: int | Literal['max', 'average'] = 'average',
sampling_strategy_if_longer: Literal['from_start', 'uniform'] = 'uniform',
min_freq: int = 10,
split_on: str | None = ' ',
tokenizer: Literal['basic_english'] | Literal['basic'] | Literal['spacy'] | Literal['moses'] | Literal['toktok'] | Literal['revtok'] | Literal['subword'] | Literal['bpe'] | str | None = None,
tokenizer_language: str | None = None,
adaptive_tokenizer_max_vocab_size: int | None = None,
mixing_subtype: Literal['mixup'] = 'mixup',
modality_dropout_rate: float = 0.0,
)
Parameters:
  • vocab_file

    An optional text file containing pre-defined vocabulary to use for the training. If this is not passed in, the framework will automatically build the vocabulary from the training data. Passing in a vocabulary file is therefore useful if (a) you want to manually specify / limit the vocabulary used and/or (b) you want to save time by pre-computing the vocabulary.

    Here, there are two formats supported:

    • A .json file containing a dictionary with the vocabulary as keys and

    the corresponding token IDs as values. For example: {"the": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}

    • A .json file with the results of training and saving the vocabulary of

    a Huggingface BPE tokenizer. This is the file create by calling hf_tokenizer.save(). This is only valid when using the bpe tokenizer.

  • max_length – Maximum length to truncate/pad sequences to. This can be an integer or the values ‘max’ or ‘average’. The ‘max’ keyword will use the maximum sequence length found in the training data, while the ‘average’ will use the average length across all training samples.

  • sampling_strategy_if_longer – Controls how sequences are truncated if they are longer than the specified max_length parameter. Using ‘from_start’ will always truncate from the beginning of the sequence, ensuring the the samples will always be the same during training. Setting this parameter to uniform will uniformly sample a slice of a given sample sequence during training. Note that for consistency, the validation/test set samples always use the from_start setting when truncating.

  • min_freq – Minimum number of times a token must appear in the total training data to be included in the vocabulary. Note that this setting will not do anything if passing in vocab_file.

  • split_on – Which token to split the sequence on to generate separate tokens for the vocabulary. Setting this to None will split on every character in the sequence.

  • tokenizer – Which tokenizer to use. Relevant if modeling on language, but not as much when doing it on other arbitrary sequences.

  • tokenizer_language – Which language rules the tokenizer should apply when tokenizing the raw data. Only relevant when using a tokenizer that supports language-specific rules, such as spacy (which you have to install separately).

  • adaptive_tokenizer_max_vocab_size – If using an adaptive tokenizer ("bpe"), this parameter controls the maximum size of the vocabulary.

  • mixing_subtype – Which type of mixing to use on the sequence data given that mixing_alpha is set >0.0 in the global configuration.

  • modality_dropout_rate – Dropout rate to apply to the modality, e.g., 0.2 means that 20% of the time, this modality will be dropped out during training.

Model Selection

class eir.models.input.sequence.transformer_models.SequenceModelConfig(
model_init_config: BasicTransformerFeatureExtractorModelConfig | dict,
model_type: Literal['sequence-default'] | str = 'sequence-default',
embedding_dim: int = 64,
position: Literal['encode', 'embed'] = 'encode',
position_dropout: float = 0.1,
window_size: int = 0,
pool: Literal['avg'] | Literal['max'] | None = None,
masked: bool = False,
pretrained_model: bool = False,
freeze_pretrained_model: bool = False,
)
Parameters:
  • model_init_config – Configuration / arguments used to initialise model.

  • model_type – Which type of image model to use.

  • embedding_dim – Which dimension to use for the embeddings. If None, will automatically set this value based on the number of tokens and attention heads.

  • position – Whether to encode the token position or use learnable position embeddings.

  • position_dropout – Dropout for the positional encoding / embedding.

  • window_size – If set to more than 0, will apply a sliding window of feature extraction over the input, meaning the model (e.g. transformer) will only see a part of the input at a time. Can be Useful to avoid the O(n²) complexity of transformers, as it becomes O(window_size² * n_windows) instead.

  • pool – Whether and how to pool (max / avg) the final feature maps before being passed to the final fusion module / predictor. Meaning we pool over the sequence (i.e. time) dimension, so the resulting dimensions is embedding_dim instead of sequence_length * embedding_dim. If using windowed / conv transformers, this becomes embedding_dim * number_of_chunks.

  • masked – Whether to use a causal mask in the transformer encoder. Note that for sequence outputs, this is automatically applied to both the input transformer module (i.e., ‘encoder’) and to the output transformer module (i.e., ‘decoder’). However, this is made configurable in case you have a specific use case to manually enable masking for e.g. non sequence generation cases / masking extra inputs linked to the sequence generation modules.

  • pretrained_model – Specify whether the model type is assumed to be pretrained and from the Pytorch Image Models repository.

  • freeze_pretrained_model – Whether to freeze the pretrained model weights.

Available Feature Extractors

Built-in Sequence Models

class eir.models.input.sequence.transformer_models.BasicTransformerFeatureExtractorModelConfig(
num_heads: int = 8,
num_layers: int = 2,
dim_feedforward: int | Literal['auto'] = 'auto',
dropout: float = 0.1,
)
Parameters:
  • num_heads – The number of heads in the multi-head attention models

  • num_layers – The number of encoder blocks in the transformer model.

  • dim_feedforward – The dimension of the feedforward layers in the transformer model.

  • dropout – Dropout value to use in the encoder layers.

External Sequence Models

For pre-trained language models (BERT, GPT, etc.), please refer to Sequence Models for detailed configuration options.

Interpretation Support

class eir.setup.schemas.BasicInterpretationConfig(
interpretation_sampling_strategy: Literal['first_n', 'random_sample'] = 'first_n',
num_samples_to_interpret: int = 10,
manual_samples_to_interpret: Sequence[str] | None = None,
)
Parameters:
  • interpretation_sampling_strategy – How to sample sequences for attribution analysis. first_n always grabs the same first n values from the beginning of the dataset to interpret, while random_sample will sample uniformly from the whole dataset without replacement.

  • num_samples_to_interpret – How many samples to interpret.

  • manual_samples_to_interpret – IDs of samples to always interpret, irrespective of interpretation_sampling_strategy and num_samples_to_interpret. A caveat here is that they must be present in the dataset that is being interpreted (e.g., validation / test dataset), meaning that adding IDs here that happen to be in the training dataset will not work.