Sequence Output Configuration
Complete configuration guide for sequence generation and prediction tasks.
Overview
Sequence outputs handle sequential data generation including:
Text generation - Natural language generation, summarization
Sequence-to-sequence - Translation, transformation tasks
DNA generation - Biological sequence synthesis
Time series forecasting - Future value prediction
Quick Example
output_info:
output_source: "my_sequence_output_folder/"
output_name: "generated_text"
output_type: "sequence"
output_type_info:
vocab_file: "output_vocab.json"
max_length: 512
model_config:
model_type: "sequence"
model_init_config:
embedding_dim: 256
num_heads: 8
Output Type Configuration
- class eir.setup.schema_modules.output_schemas_sequence.SequenceOutputTypeConfig(
- vocab_file: None | str = None,
- max_length: al_max_sequence_length = 'average',
- sampling_strategy_if_longer: Literal['from_start', 'uniform'] = 'uniform',
- min_freq: int = 10,
- split_on: str | None = ' ',
- tokenizer: al_tokenizer_choices = None,
- tokenizer_language: str | None = None,
- adaptive_tokenizer_max_vocab_size: int | None = None,
- sequence_operation: Literal['autoregressive', 'mlm'] = 'autoregressive',
- Parameters:
vocab_file –
An optional text file containing pre-defined vocabulary to use for the training. If this is not passed in, the framework will automatically build the vocabulary from the training data. Passing in a vocabulary file is therefore useful if (a) you want to manually specify / limit the vocabulary used and/or (b) you want to save time by pre-computing the vocabulary.
Here, there are two formats supported:
A
.jsonfile containing a dictionary with the vocabulary as keys and
the corresponding token IDs as values. For example:
{"the": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}A
.jsonfile with the results of training and saving the vocabulary of
a Huggingface BPE tokenizer. This is the file create by calling
hf_tokenizer.save(). This is only valid when using thebpetokenizer.max_length – Maximum length to truncate/pad sequences to. This can be an integer or the values ‘max’ or ‘average’. The ‘max’ keyword will use the maximum sequence length found in the training data, while the ‘average’ will use the average length across all training samples.
sampling_strategy_if_longer – Controls how sequences are truncated if they are longer than the specified
max_lengthparameter. Using ‘from_start’ will always truncate from the beginning of the sequence, ensuring the the samples will always be the same during training. Setting this parameter touniformwill uniformly sample a slice of a given sample sequence during training. Note that for consistency, the validation/test set samples always use thefrom_startsetting when truncating.min_freq – Minimum number of times a token must appear in the total training data to be included in the vocabulary. Note that this setting will not do anything if passing in
vocab_file.split_on – Which token to split the sequence on to generate separate tokens for the vocabulary.
tokenizer – Which tokenizer to use. Relevant if modelling on language, but not as much when doing it on other arbitrary sequences.
tokenizer_language – Which language rules the tokenizer should apply when tokenizing the raw data.
adaptive_tokenizer_max_vocab_size – If using an adaptive tokenizer (
"bpe"), this parameter controls the maximum size of the vocabulary.sequence_operation – Which operation to perform on the sequence. Currently only
autoregressiveis supported, which means that the model will be trained to predict the next token in the sequence given the previous tokens.
Output Module Configuration
- class eir.models.output.sequence.sequence_output_modules.SequenceOutputModuleConfig(
- model_init_config: TransformerSequenceOutputModuleConfig,
- model_type: Literal['sequence'] = 'sequence',
- embedding_dim: int = 64,
- position: Literal['encode', 'embed'] = 'encode',
- position_dropout: float = 0.1,
- Parameters:
model_init_config – Configuration / arguments used to initialise model.
model_type – Which type of image model to use.
embedding_dim – Which dimension to use for the embeddings. If
None, will automatically set this value based on the number of tokens and attention heads.position – Whether to encode the token position or use learnable position embeddings.
position_dropout – Dropout for the positional encoding / embedding.
Output Sampling Configuration
- class eir.setup.schema_modules.output_schemas_sequence.SequenceOutputSamplingConfig(
- manual_inputs: Sequence[dict[str, str]] = (),
- n_eval_inputs: int = 10,
- generated_sequence_length: int = 64,
- repetition_penalty: float = 1.1,
- repetition_penalty_max_window: int = 64,
- frequency_penalty: float = 0.1,
- frequency_penalty_max_window: int = 128,
- temperature: float = 0.7,
- top_k: int = 20,
- top_p: float = 0.9,
- tau: float = 0.95,
- Parameters:
manual_inputs –
Manually specified inputs to use for sequence generation. This is useful if you want to generate sequences based on a specific input. Depending on the input type, different formats are expected:
sequence: A string written directly in the.yamlfile.omics: A file path to NumPy array of shape(3, n_SNPs)on disk.image: An image file path on disk.tabular: A mapping of(column key: value)written directly in the.yamlfile.array: A file path to NumPy array on disk.bytes: A file path to a file on disk.
n_eval_inputs – The number of inputs automatically sampled from the validation set for sequence generation.
generated_sequence_length – The length of the output sequences that are generated.
temperature – Controls the randomness of predictions by scaling the logits before applying softmax. A higher temperature results in more random predictions, while a lower temperature results in more deterministic predictions.
repetition_penalty – Discourages repetition by reducing the probability of tokens that have already appeared in the generated text. Values greater than 1.0 apply the penalty, with higher values (1.2-1.5) reducing repetition more aggressively. A value of 1.0 disables this feature.
repetition_penalty_max_window – The maximum number of most recent tokens to consider when applying the repetition penalty. A smaller window focuses on preventing local repetition, while a larger window prevents repetition across the entire sequence.
frequency_penalty – Reduces the probability of tokens proportional to how frequently they’ve appeared in the generated text. Unlike repetition penalty, this scales with usage count. Positive values (0.1-0.3) increase diversity, with higher values producing more varied vocabulary.
frequency_penalty_max_window – The maximum number of most recent tokens to track when calculating token frequencies for the frequency penalty. Larger windows maintain longer-term memory of word usage patterns.
top_k – The number of top candidates to consider when sampling the next token in an output sequence. By default, the model considers the top 20 candidates
top_p – The cumulative probability of the top candidates to consider when sampling the next token in an output sequence. For example, if top_p is 0.9, the model will stop sampling candidates once the cumulative probability of the most likely candidates reaches 0.9.
tau – Controls locally typical sampling by filtering tokens based on how close their probabilities are to the expected distribution. Values range from 0.0 to 1.0, where 1.0 disables the filter. Lower values produce more consistent text by removing outlier tokens.