Sequence Data Configuration
Complete configuration guide for sequential data including text, DNA sequences, and time series.
Overview
Sequence data in EIR handles data such as:
Text data - Natural language, documents, reviews
Biological sequences - DNA, RNA, protein sequences
Time series - Sequential measurements over time
Quick Example
input_info:
input_source: "my/sequence/data/"
input_name: "dna_sequence"
input_type: "sequence"
input_type_info:
vocab_file: "dna_vocab.json"
max_length: 1024
model_config:
model_type: "sequence-default"
model_init_config:
embedding_dim: 128
num_heads: 8
num_layers: 6
Input Data Configuration
Base Configuration
- class eir.setup.schemas.SequenceInputDataConfig(
- vocab_file: None | str = None,
- max_length: int | Literal['max', 'average'] = 'average',
- sampling_strategy_if_longer: Literal['from_start', 'uniform'] = 'uniform',
- min_freq: int = 10,
- split_on: str | None = ' ',
- tokenizer: Literal['basic_english'] | Literal['basic'] | Literal['spacy'] | Literal['moses'] | Literal['toktok'] | Literal['revtok'] | Literal['subword'] | Literal['bpe'] | str | None = None,
- tokenizer_language: str | None = None,
- adaptive_tokenizer_max_vocab_size: int | None = None,
- mixing_subtype: Literal['mixup'] = 'mixup',
- modality_dropout_rate: float = 0.0,
- Parameters:
vocab_file –
An optional text file containing pre-defined vocabulary to use for the training. If this is not passed in, the framework will automatically build the vocabulary from the training data. Passing in a vocabulary file is therefore useful if (a) you want to manually specify / limit the vocabulary used and/or (b) you want to save time by pre-computing the vocabulary.
Here, there are two formats supported:
A
.jsonfile containing a dictionary with the vocabulary as keys and
the corresponding token IDs as values. For example:
{"the": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}A
.jsonfile with the results of training and saving the vocabulary of
a Huggingface BPE tokenizer. This is the file create by calling
hf_tokenizer.save(). This is only valid when using thebpetokenizer.max_length – Maximum length to truncate/pad sequences to. This can be an integer or the values ‘max’ or ‘average’. The ‘max’ keyword will use the maximum sequence length found in the training data, while the ‘average’ will use the average length across all training samples.
sampling_strategy_if_longer – Controls how sequences are truncated if they are longer than the specified
max_lengthparameter. Using ‘from_start’ will always truncate from the beginning of the sequence, ensuring the the samples will always be the same during training. Setting this parameter touniformwill uniformly sample a slice of a given sample sequence during training. Note that for consistency, the validation/test set samples always use thefrom_startsetting when truncating.min_freq – Minimum number of times a token must appear in the total training data to be included in the vocabulary. Note that this setting will not do anything if passing in
vocab_file.split_on – Which token to split the sequence on to generate separate tokens for the vocabulary. Setting this to
Nonewill split on every character in the sequence.tokenizer – Which tokenizer to use. Relevant if modeling on language, but not as much when doing it on other arbitrary sequences.
tokenizer_language – Which language rules the tokenizer should apply when tokenizing the raw data. Only relevant when using a tokenizer that supports language-specific rules, such as
spacy(which you have to install separately).adaptive_tokenizer_max_vocab_size – If using an adaptive tokenizer (
"bpe"), this parameter controls the maximum size of the vocabulary.mixing_subtype – Which type of mixing to use on the sequence data given that
mixing_alphais set >0.0 in the global configuration.modality_dropout_rate – Dropout rate to apply to the modality, e.g.,
0.2means that 20% of the time, this modality will be dropped out during training.
Model Selection
- class eir.models.input.sequence.transformer_models.SequenceModelConfig(
- model_init_config: BasicTransformerFeatureExtractorModelConfig | dict,
- model_type: Literal['sequence-default'] | str = 'sequence-default',
- embedding_dim: int = 64,
- position: Literal['encode', 'embed'] = 'encode',
- position_dropout: float = 0.1,
- window_size: int = 0,
- pool: Literal['avg'] | Literal['max'] | None = None,
- masked: bool = False,
- pretrained_model: bool = False,
- freeze_pretrained_model: bool = False,
- Parameters:
model_init_config – Configuration / arguments used to initialise model.
model_type – Which type of image model to use.
embedding_dim – Which dimension to use for the embeddings. If
None, will automatically set this value based on the number of tokens and attention heads.position – Whether to encode the token position or use learnable position embeddings.
position_dropout – Dropout for the positional encoding / embedding.
window_size – If set to more than 0, will apply a sliding window of feature extraction over the input, meaning the model (e.g. transformer) will only see a part of the input at a time. Can be Useful to avoid the O(n²) complexity of transformers, as it becomes O(window_size² * n_windows) instead.
pool – Whether and how to pool (max / avg) the final feature maps before being passed to the final fusion module / predictor. Meaning we pool over the sequence (i.e. time) dimension, so the resulting dimensions is embedding_dim instead of sequence_length * embedding_dim. If using windowed / conv transformers, this becomes embedding_dim * number_of_chunks.
masked – Whether to use a causal mask in the transformer encoder. Note that for sequence outputs, this is automatically applied to both the input transformer module (i.e., ‘encoder’) and to the output transformer module (i.e., ‘decoder’). However, this is made configurable in case you have a specific use case to manually enable masking for e.g. non sequence generation cases / masking extra inputs linked to the sequence generation modules.
pretrained_model – Specify whether the model type is assumed to be pretrained and from the Pytorch Image Models repository.
freeze_pretrained_model – Whether to freeze the pretrained model weights.
Available Feature Extractors
Built-in Sequence Models
- class eir.models.input.sequence.transformer_models.BasicTransformerFeatureExtractorModelConfig(
- num_heads: int = 8,
- num_layers: int = 2,
- dim_feedforward: int | Literal['auto'] = 'auto',
- dropout: float = 0.1,
- Parameters:
num_heads – The number of heads in the multi-head attention models
num_layers – The number of encoder blocks in the transformer model.
dim_feedforward – The dimension of the feedforward layers in the transformer model.
dropout – Dropout value to use in the encoder layers.
External Sequence Models
For pre-trained language models (BERT, GPT, etc.), please refer to Sequence Models for detailed configuration options.
Interpretation Support
- class eir.setup.schemas.BasicInterpretationConfig(
- interpretation_sampling_strategy: Literal['first_n', 'random_sample'] = 'first_n',
- num_samples_to_interpret: int = 10,
- manual_samples_to_interpret: Sequence[str] | None = None,
- Parameters:
interpretation_sampling_strategy – How to sample sequences for attribution analysis.
first_nalways grabs the same first n values from the beginning of the dataset to interpret, whilerandom_samplewill sample uniformly from the whole dataset without replacement.num_samples_to_interpret – How many samples to interpret.
manual_samples_to_interpret – IDs of samples to always interpret, irrespective of
interpretation_sampling_strategyandnum_samples_to_interpret. A caveat here is that they must be present in the dataset that is being interpreted (e.g., validation / test dataset), meaning that adding IDs here that happen to be in the training dataset will not work.