Sequence Data Guide
Ready-to-use configurations for sequence data analysis using Transformer-based models in EIR.
Supported data types: Text (NLP), protein/peptide sequences, DNA/RNA, time series, and other discrete token sequences
Data format: A folder with
.txtfiles (filename is the ID) or a.csvfile with columns"ID"and"Sequence"Models: Built-in transformer (
sequence-default), external pretrained models (BERT, RoBERTa, etc., see Sequence Models).
Note
First step: Copy the Configuration Guides global configuration as your globals.yaml
Quick Start
Use cases: Sequence classification (sentiment, protein function), regression (binding affinity), or generation
Data requirements: Sequence data in text files or CSV format, labels for supervised tasks
Files needed:
input_info:
input_source: data/protein_sequences/ # Path to folder with .txt files or .csv file
input_name: sequence
input_type: sequence
input_type_info:
max_length: 512 # Sequence length (int, 'max', or 'average')
# Split on characters for proteins/DNA
# ("" for char-level, " " for words,
# null for no splitting e.g. when using BPE tokenizer)
split_on: ""
tokenizer: null # No tokenizer (see advanced options below)
min_freq: 2 # Minimum token frequency for vocabulary
model_config:
model_type: sequence-default # Built-in transformer for sequences
model_init_config:
embedding_dim: 128 # Token embedding dimension
num_layers: 4 # Number of transformer layers
num_heads: 8 # Number of attention heads per layer
dropout: 0.10 # Dropout rate
Note
The input_source can be:
A directory of
.txtfiles where the filename (without extension) is the sample IDA
.csvfile with columns"ID"and"Sequence"
For protein/DNA sequences, use split_on: "" for character-level tokenization.
For natural language, use split_on: " " for word-level tokenization.
Alternatively, set split_on: null for no splitting, and use the
BPE tokenizer
(tokenizer: "bpe") for an adaptive vocabulary.
output_info:
output_name: sequence_label
output_source: data/labels.csv # Must contain "ID" column + targets
output_type: tabular
output_type_info:
target_cat_columns:
- Function_Class # Categorical target (e.g., protein function)
target_con_columns:
- Binding_Affinity # Continuous target (optional)
Run command:
eirtrain --global_configs globals.yaml \
--input_configs inputs.yaml \
--output_configs outputs.yaml
About Sequence Models
Full model configuration with all available parameters:
model_config:
model_type: sequence-default
model_init_config:
# Architecture parameters
embedding_dim: 128 # Dimension of token embeddings
num_layers: 6 # Number of transformer layers
num_heads: 8 # Number of attention heads
dropout: 0.10 # Dropout rate in transformer layers
# Advanced architecture options
dim_feedforward: 512 # Feedforward network dimension
# Attention mechanisms
window_size: null # Local attention window (null = full attention)
As always, please refer to the API documentation Sequence Data Configuration for the full list of available parameters and more in-depth explanations.
Common Use Cases
Natural Language Processing
For text classification, sentiment analysis, or document classification:
input_type_info:
max_length: 512
split_on: " " # Split on whitespace for words
tokenizer: "basic_english" # English text normalization
min_freq: 5 # Filter rare words
Biological Sequences
For protein, peptide, or DNA sequence analysis:
input_type_info:
max_length: 1024 # Typical protein length
split_on: "" # Character-level tokenization
tokenizer: null # No additional tokenization
min_freq: 1 # Keep all amino acids/nucleotides
Time Series Data
For sequential numeric data represented as text (assumes they have e.g. been binned/discretized beforehand):
input_type_info:
max_length: "average" # Use average sequence length
split_on: "," # Split on delimiter
tokenizer: null # No tokenization
sampling_strategy_if_longer: "uniform" # Random sampling for long sequences
Advanced Tokenization
BPE (Byte Pair Encoding) Tokenization:
For subword tokenization, particularly useful for handling out-of-vocabulary words:
input_type_info:
tokenizer: "bpe"
adaptive_tokenizer_max_vocab_size: 10000 # Maximum vocabulary size
vocab_file: null # Will be trained on your data
split_on: null # BPE handles splitting internally
Custom Vocabulary:
Using a pre-defined vocabulary file:
input_type_info:
vocab_file: "data/custom_vocab.json" # JSON file with token->id mapping
Note
The vocab file is a optional text file containing pre-defined vocabulary to use for the training. If this is not passed in, the framework will automatically build the vocabulary from the training data. Passing in a vocabulary file is therefore useful if (a) you want to manually specify / limit the vocabulary used and/or (b) you want to save time by pre-computing the vocabulary.
Here, there are two formats supported:
A
.jsonfile containing a dictionary with the vocabulary as keys and the corresponding token IDs as values. For example:{"the": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}A
.jsonfile with the results of training and saving the vocabulary of a Huggingface BPE tokenizer. This is the file create by callinghf_tokenizer.save(). This is only valid when using thebpetokenizer.
Sequence Length Strategies
Dynamic Length Calculation:
input_type_info:
max_length: "max" # Use longest sequence in dataset
# OR
max_length: "average" # Use average length
# OR
max_length: 512 # Fixed length
Handling Long Sequences:
input_type_info:
sampling_strategy_if_longer: "uniform" # Random sampling for training
# OR
sampling_strategy_if_longer: "from_start" # Always truncate from beginning
Note
Validation and test sets always use "from_start" for consistency,
regardless of the training strategy.
External Pretrained Models
For leveraging pretrained language models:
model_config:
model_type: "bert-base-uncased" # Hugging Face model name
pretrained_model: true # Use pretrained weights
model_init_config:
num_labels: 2 # Number of output classes
See Sequence Models for the full list of supported models.
Attribution Analysis
Enable feature importance analysis to understand which parts of sequences contribute most to predictions:
attribution_analysis:
compute_attributions: true
max_attributions_per_class: 100 # Samples per class to analyze
attributions_every_sample_factor: 4 # Compute every 4th evaluation
This uses Integrated Gradients to compute token-level importance scores, helping you understand model decisions.
Complete Configuration Examples
Protein Function Prediction:
# inputs.yaml
input_info:
input_source: data/protein_sequences/
input_name: protein_seq
input_type: sequence
input_type_info:
max_length: 1024
split_on: "" # Character-level for amino acids
min_freq: 1 # Keep all amino acids
model_config:
model_type: sequence-default
model_init_config:
embedding_dim: 128
num_layers: 4
num_heads: 8
dropout: 0.10
# outputs.yaml
output_info:
output_name: protein_function
output_source: data/protein_labels.csv
output_type: tabular
output_type_info:
target_cat_columns:
- Enzyme_Class
- Subcellular_Location
Sentiment Analysis:
# inputs.yaml
input_info:
input_source: data/reviews.csv # CSV with ID and Sequence columns
input_name: review_text
input_type: sequence
input_type_info:
max_length: 512
split_on: " " # Word-level tokenization
tokenizer: "basic_english" # Text normalization
min_freq: 5 # Filter rare words
model_config:
model_type: sequence-default
model_init_config:
embedding_dim: 256
num_layers: 6
num_heads: 8
dropout: 0.10
# outputs.yaml
output_info:
output_name: sentiment
output_source: data/sentiment_labels.csv
output_type: tabular
output_type_info:
target_cat_columns:
- Sentiment # Positive/Negative