Sequence Data Guide

Ready-to-use configurations for sequence data analysis using Transformer-based models in EIR.

  • Supported data types: Text (NLP), protein/peptide sequences, DNA/RNA, time series, and other discrete token sequences

  • Data format: A folder with .txt files (filename is the ID) or a .csv file with columns "ID" and "Sequence"

  • Models: Built-in transformer (sequence-default), external pretrained models (BERT, RoBERTa, etc., see Sequence Models).

Note

First step: Copy the Configuration Guides global configuration as your globals.yaml

Quick Start

  • Use cases: Sequence classification (sentiment, protein function), regression (binding affinity), or generation

  • Data requirements: Sequence data in text files or CSV format, labels for supervised tasks

Files needed:

inputs.yaml
input_info:
  input_source: data/protein_sequences/      # Path to folder with .txt files or .csv file
  input_name: sequence
  input_type: sequence

input_type_info:
  max_length: 512                           # Sequence length (int, 'max', or 'average')

  # Split on characters for proteins/DNA
  # ("" for char-level, " " for words,
  # null for no splitting e.g. when using BPE tokenizer)
  split_on: ""

  tokenizer: null                           # No tokenizer (see advanced options below)
  min_freq: 2                               # Minimum token frequency for vocabulary

model_config:
  model_type: sequence-default              # Built-in transformer for sequences
  model_init_config:
    embedding_dim: 128                      # Token embedding dimension
    num_layers: 4                           # Number of transformer layers
    num_heads: 8                            # Number of attention heads per layer
    dropout: 0.10                           # Dropout rate

Note

The input_source can be:

  • A directory of .txt files where the filename (without extension) is the sample ID

  • A .csv file with columns "ID" and "Sequence"

For protein/DNA sequences, use split_on: "" for character-level tokenization. For natural language, use split_on: " " for word-level tokenization.

Alternatively, set split_on: null for no splitting, and use the BPE tokenizer (tokenizer: "bpe") for an adaptive vocabulary.

outputs.yaml
output_info:
  output_name: sequence_label
  output_source: data/labels.csv           # Must contain "ID" column + targets
  output_type: tabular

output_type_info:
  target_cat_columns:
    - Function_Class                        # Categorical target (e.g., protein function)
  target_con_columns:
    - Binding_Affinity                      # Continuous target (optional)

Run command:

eirtrain --global_configs globals.yaml \
         --input_configs inputs.yaml \
         --output_configs outputs.yaml

About Sequence Models

Full model configuration with all available parameters:

Advanced sequence configuration
model_config:
  model_type: sequence-default
  model_init_config:
    # Architecture parameters
    embedding_dim: 128                      # Dimension of token embeddings
    num_layers: 6                           # Number of transformer layers
    num_heads: 8                            # Number of attention heads
    dropout: 0.10                           # Dropout rate in transformer layers

    # Advanced architecture options
    dim_feedforward: 512                    # Feedforward network dimension

    # Attention mechanisms
    window_size: null                       # Local attention window (null = full attention)

As always, please refer to the API documentation Sequence Data Configuration for the full list of available parameters and more in-depth explanations.

Common Use Cases

Natural Language Processing

For text classification, sentiment analysis, or document classification:

Text classification setup
input_type_info:
  max_length: 512
  split_on: " "                             # Split on whitespace for words
  tokenizer: "basic_english"                # English text normalization
  min_freq: 5                               # Filter rare words

Biological Sequences

For protein, peptide, or DNA sequence analysis:

Protein sequence setup
input_type_info:
  max_length: 1024                          # Typical protein length
  split_on: ""                              # Character-level tokenization
  tokenizer: null                           # No additional tokenization
  min_freq: 1                               # Keep all amino acids/nucleotides

Time Series Data

For sequential numeric data represented as text (assumes they have e.g. been binned/discretized beforehand):

Time series setup
input_type_info:
  max_length: "average"                     # Use average sequence length
  split_on: ","                             # Split on delimiter
  tokenizer: null                           # No tokenization
  sampling_strategy_if_longer: "uniform"    # Random sampling for long sequences

Advanced Tokenization

BPE (Byte Pair Encoding) Tokenization:

For subword tokenization, particularly useful for handling out-of-vocabulary words:

BPE tokenizer configuration
input_type_info:
  tokenizer: "bpe"
  adaptive_tokenizer_max_vocab_size: 10000  # Maximum vocabulary size
  vocab_file: null                          # Will be trained on your data
  split_on: null                            # BPE handles splitting internally

Custom Vocabulary:

Using a pre-defined vocabulary file:

Custom vocabulary setup
input_type_info:
  vocab_file: "data/custom_vocab.json"     # JSON file with token->id mapping

Note

The vocab file is a optional text file containing pre-defined vocabulary to use for the training. If this is not passed in, the framework will automatically build the vocabulary from the training data. Passing in a vocabulary file is therefore useful if (a) you want to manually specify / limit the vocabulary used and/or (b) you want to save time by pre-computing the vocabulary.

Here, there are two formats supported:

  • A .json file containing a dictionary with the vocabulary as keys and the corresponding token IDs as values. For example: {"the": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}

  • A .json file with the results of training and saving the vocabulary of a Huggingface BPE tokenizer. This is the file create by calling hf_tokenizer.save(). This is only valid when using the bpe tokenizer.

Sequence Length Strategies

Dynamic Length Calculation:

Dynamic length options
input_type_info:
  max_length: "max"                         # Use longest sequence in dataset
  # OR
  max_length: "average"                     # Use average length
  # OR
  max_length: 512                           # Fixed length

Handling Long Sequences:

Long sequence handling
input_type_info:
  sampling_strategy_if_longer: "uniform"   # Random sampling for training
  # OR
  sampling_strategy_if_longer: "from_start" # Always truncate from beginning

Note

Validation and test sets always use "from_start" for consistency, regardless of the training strategy.

External Pretrained Models

For leveraging pretrained language models:

Using pretrained BERT
model_config:
  model_type: "bert-base-uncased"           # Hugging Face model name
  pretrained_model: true                    # Use pretrained weights
  model_init_config:
    num_labels: 2                           # Number of output classes

See Sequence Models for the full list of supported models.

Attribution Analysis

Enable feature importance analysis to understand which parts of sequences contribute most to predictions:

Attribution analysis setup (in globals.yaml)
attribution_analysis:
  compute_attributions: true
  max_attributions_per_class: 100          # Samples per class to analyze
  attributions_every_sample_factor: 4      # Compute every 4th evaluation

This uses Integrated Gradients to compute token-level importance scores, helping you understand model decisions.

Complete Configuration Examples

Protein Function Prediction:

Complete protein classification setup
# inputs.yaml
input_info:
  input_source: data/protein_sequences/
  input_name: protein_seq
  input_type: sequence
input_type_info:
  max_length: 1024
  split_on: ""                              # Character-level for amino acids
  min_freq: 1                               # Keep all amino acids
model_config:
  model_type: sequence-default
  model_init_config:
    embedding_dim: 128
    num_layers: 4
    num_heads: 8
    dropout: 0.10

# outputs.yaml
output_info:
  output_name: protein_function
  output_source: data/protein_labels.csv
  output_type: tabular
output_type_info:
  target_cat_columns:
    - Enzyme_Class
    - Subcellular_Location

Sentiment Analysis:

Complete sentiment analysis setup
# inputs.yaml
input_info:
  input_source: data/reviews.csv           # CSV with ID and Sequence columns
  input_name: review_text
  input_type: sequence
input_type_info:
  max_length: 512
  split_on: " "                             # Word-level tokenization
  tokenizer: "basic_english"                # Text normalization
  min_freq: 5                               # Filter rare words
model_config:
  model_type: sequence-default
  model_init_config:
    embedding_dim: 256
    num_layers: 6
    num_heads: 8
    dropout: 0.10

# outputs.yaml
output_info:
  output_name: sentiment
  output_source: data/sentiment_labels.csv
  output_type: tabular
output_type_info:
  target_cat_columns:
    - Sentiment                             # Positive/Negative