Sequence Data Guide

Ready-to-use configurations for sequence data analysis using Transformer-based models in EIR.

Supported data types: Text (NLP), protein/peptide sequences, DNA/RNA, time series, and other discrete token sequences
Data format: A folder with .txt files (filename is the ID) or a .csv file with columns "ID" and "Sequence"
Models: Built-in transformer (sequence-default), external pretrained models (BERT, RoBERTa, etc., see Sequence Models).

Note

First step: Copy the Configuration Guides global configuration as your globals.yaml

Quick Start 

Use cases: Sequence classification (sentiment, protein function), regression (binding affinity), or generation
Data requirements: Sequence data in text files or CSV format, labels for supervised tasks

Files needed:

inputs.yaml

input_info:
  input_source: data/protein_sequences/      # Path to folder with .txt files or .csv file
  input_name: sequence
  input_type: sequence

input_type_info:
  max_length: 512                           # Sequence length (int, 'max', or 'average')

  # Split on characters for proteins/DNA
  # ("" for char-level, " " for words,
  # null for no splitting e.g. when using BPE tokenizer)
  split_on: ""

  tokenizer: null                           # No tokenizer (see advanced options below)
  min_freq: 2                               # Minimum token frequency for vocabulary

model_config:
  model_type: sequence-default              # Built-in transformer for sequences
  model_init_config:
    embedding_dim: 128                      # Token embedding dimension
    num_layers: 4                           # Number of transformer layers
    num_heads: 8                            # Number of attention heads per layer
    dropout: 0.10                           # Dropout rate

Note

The input_source can be:

A directory of .txt files where the filename (without extension) is the sample ID
A .csv file with columns "ID" and "Sequence"

For protein/DNA sequences, use split_on: "" for character-level tokenization. For natural language, use split_on: " " for word-level tokenization.

Alternatively, set split_on: null for no splitting, and use the BPE tokenizer (tokenizer: "bpe") for an adaptive vocabulary.

outputs.yaml

output_info:
  output_name: sequence_label
  output_source: data/labels.csv           # Must contain "ID" column + targets
  output_type: tabular

output_type_info:
  target_cat_columns:
    - Function_Class                        # Categorical target (e.g., protein function)
  target_con_columns:
    - Binding_Affinity                      # Continuous target (optional)

Run command:

eirtrain --global_configs globals.yaml \
         --input_configs inputs.yaml \
         --output_configs outputs.yaml

About Sequence Models 

Full model configuration with all available parameters:

Advanced sequence configuration

model_config:
  model_type: sequence-default
  model_init_config:
    # Architecture parameters
    embedding_dim: 128                      # Dimension of token embeddings
    num_layers: 6                           # Number of transformer layers
    num_heads: 8                            # Number of attention heads
    dropout: 0.10                           # Dropout rate in transformer layers

    # Advanced architecture options
    dim_feedforward: 512                    # Feedforward network dimension

    # Attention mechanisms
    window_size: null                       # Local attention window (null = full attention)

As always, please refer to the API documentation Sequence Data Configuration for the full list of available parameters and more in-depth explanations.

Common Use Cases 

Natural Language Processing 

For text classification, sentiment analysis, or document classification:

Text classification setup

input_type_info:
  max_length: 512
  split_on: " "                             # Split on whitespace for words
  tokenizer: "basic_english"                # English text normalization
  min_freq: 5                               # Filter rare words

Biological Sequences 

For protein, peptide, or DNA sequence analysis:

Protein sequence setup

input_type_info:
  max_length: 1024                          # Typical protein length
  split_on: ""                              # Character-level tokenization
  tokenizer: null                           # No additional tokenization
  min_freq: 1                               # Keep all amino acids/nucleotides

Time Series Data 

For sequential numeric data represented as text (assumes they have e.g. been binned/discretized beforehand):

Time series setup

input_type_info:
  max_length: "average"                     # Use average sequence length
  split_on: ","                             # Split on delimiter
  tokenizer: null                           # No tokenization
  sampling_strategy_if_longer: "uniform"    # Random sampling for long sequences

The vocab file is a optional text file containing pre-defined vocabulary to use for the training. If this is not passed in, the framework will automatically build the vocabulary from the training data. Passing in a vocabulary file is therefore useful if (a) you want to manually specify / limit the vocabulary used and/or (b) you want to save time by pre-computing the vocabulary.

Here, there are two formats supported:

A .json file containing a dictionary with the vocabulary as keys and the corresponding token IDs as values. For example: {"the": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}
A .json file with the results of training and saving the vocabulary of a Huggingface BPE tokenizer. This is the file create by calling hf_tokenizer.save(). This is only valid when using the bpe tokenizer.

Sequence Length Strategies 

Dynamic Length Calculation:

Dynamic length options

input_type_info:
  max_length: "max"                         # Use longest sequence in dataset
  # OR
  max_length: "average"                     # Use average length
  # OR
  max_length: 512                           # Fixed length

Handling Long Sequences:

Long sequence handling

input_type_info:
  sampling_strategy_if_longer: "uniform"   # Random sampling for training
  # OR
  sampling_strategy_if_longer: "from_start" # Always truncate from beginning

Note

Validation and test sets always use "from_start" for consistency, regardless of the training strategy.

External Pretrained Models 

For leveraging pretrained language models:

Using pretrained BERT

model_config:
  model_type: "bert-base-uncased"           # Hugging Face model name
  pretrained_model: true                    # Use pretrained weights
  model_init_config:
    num_labels: 2                           # Number of output classes

See Sequence Models for the full list of supported models.

Attribution Analysis 

Enable feature importance analysis to understand which parts of sequences contribute most to predictions:

Attribution analysis setup (in globals.yaml)

attribution_analysis:
  compute_attributions: true
  max_attributions_per_class: 100          # Samples per class to analyze
  attributions_every_sample_factor: 4      # Compute every 4th evaluation

This uses Integrated Gradients to compute token-level importance scores, helping you understand model decisions.

Complete Configuration Examples 

Protein Function Prediction:

Complete protein classification setup

# inputs.yaml
input_info:
  input_source: data/protein_sequences/
  input_name: protein_seq
  input_type: sequence
input_type_info:
  max_length: 1024
  split_on: ""                              # Character-level for amino acids
  min_freq: 1                               # Keep all amino acids
model_config:
  model_type: sequence-default
  model_init_config:
    embedding_dim: 128
    num_layers: 4
    num_heads: 8
    dropout: 0.10

# outputs.yaml
output_info:
  output_name: protein_function
  output_source: data/protein_labels.csv
  output_type: tabular
output_type_info:
  target_cat_columns:
    - Enzyme_Class
    - Subcellular_Location

Sentiment Analysis:

Complete sentiment analysis setup

# inputs.yaml
input_info:
  input_source: data/reviews.csv           # CSV with ID and Sequence columns
  input_name: review_text
  input_type: sequence
input_type_info:
  max_length: 512
  split_on: " "                             # Word-level tokenization
  tokenizer: "basic_english"                # Text normalization
  min_freq: 5                               # Filter rare words
model_config:
  model_type: sequence-default
  model_init_config:
    embedding_dim: 256
    num_layers: 6
    num_heads: 8
    dropout: 0.10

# outputs.yaml
output_info:
  output_name: sentiment
  output_source: data/sentiment_labels.csv
  output_type: tabular
output_type_info:
  target_cat_columns:
    - Sentiment                             # Positive/Negative