Sequence Data Guide =================== Ready-to-use configurations for sequence data analysis using Transformer-based models in EIR. - **Supported data types:** Text (NLP), protein/peptide sequences, DNA/RNA, time series, and other discrete token sequences - **Data format:** A folder with ``.txt`` files (filename is the ID) or a ``.csv`` file with columns ``"ID"`` and ``"Sequence"`` - **Models:** Built-in transformer (``sequence-default``), external pretrained models (BERT, RoBERTa, etc., see :ref:`external-sequence-models`). .. note:: **First step:** Copy the :doc:`../guides_index` global configuration as your ``globals.yaml`` .. contents:: :local: :depth: 2 Quick Start ----------- - **Use cases:** Sequence classification (sentiment, protein function), regression (binding affinity), or generation - **Data requirements:** Sequence data in text files or CSV format, labels for supervised tasks **Files needed:** .. code-block:: yaml :caption: inputs.yaml input_info: input_source: data/protein_sequences/ # Path to folder with .txt files or .csv file input_name: sequence input_type: sequence input_type_info: max_length: 512 # Sequence length (int, 'max', or 'average') # Split on characters for proteins/DNA # ("" for char-level, " " for words, # null for no splitting e.g. when using BPE tokenizer) split_on: "" tokenizer: null # No tokenizer (see advanced options below) min_freq: 2 # Minimum token frequency for vocabulary model_config: model_type: sequence-default # Built-in transformer for sequences model_init_config: embedding_dim: 128 # Token embedding dimension num_layers: 4 # Number of transformer layers num_heads: 8 # Number of attention heads per layer dropout: 0.10 # Dropout rate .. note:: The ``input_source`` can be: - A directory of ``.txt`` files where the filename (without extension) is the sample ID - A ``.csv`` file with columns ``"ID"`` and ``"Sequence"`` For protein/DNA sequences, use ``split_on: ""`` for character-level tokenization. For natural language, use ``split_on: " "`` for word-level tokenization. Alternatively, set ``split_on: null`` for no splitting, and use the `BPE tokenizer `_ (``tokenizer: "bpe"``) for an adaptive vocabulary. .. code-block:: yaml :caption: outputs.yaml output_info: output_name: sequence_label output_source: data/labels.csv # Must contain "ID" column + targets output_type: tabular output_type_info: target_cat_columns: - Function_Class # Categorical target (e.g., protein function) target_con_columns: - Binding_Affinity # Continuous target (optional) **Run command:** .. code-block:: bash eirtrain --global_configs globals.yaml \ --input_configs inputs.yaml \ --output_configs outputs.yaml About Sequence Models --------------------- **Full model configuration with all available parameters:** .. code-block:: yaml :caption: Advanced sequence configuration model_config: model_type: sequence-default model_init_config: # Architecture parameters embedding_dim: 128 # Dimension of token embeddings num_layers: 6 # Number of transformer layers num_heads: 8 # Number of attention heads dropout: 0.10 # Dropout rate in transformer layers # Advanced architecture options dim_feedforward: 512 # Feedforward network dimension # Attention mechanisms window_size: null # Local attention window (null = full attention) As always, please refer to the API documentation :ref:`sequence-configurations` for the full list of available parameters and more in-depth explanations. Common Use Cases ---------------- Natural Language Processing ^^^^^^^^^^^^^^^^^^^^^^^^^^^ For text classification, sentiment analysis, or document classification: .. code-block:: yaml :caption: Text classification setup input_type_info: max_length: 512 split_on: " " # Split on whitespace for words tokenizer: "basic_english" # English text normalization min_freq: 5 # Filter rare words Biological Sequences ^^^^^^^^^^^^^^^^^^^^ For protein, peptide, or DNA sequence analysis: .. code-block:: yaml :caption: Protein sequence setup input_type_info: max_length: 1024 # Typical protein length split_on: "" # Character-level tokenization tokenizer: null # No additional tokenization min_freq: 1 # Keep all amino acids/nucleotides Time Series Data ^^^^^^^^^^^^^^^^ For sequential numeric data represented as text (assumes they have e.g. been binned/discretized beforehand): .. code-block:: yaml :caption: Time series setup input_type_info: max_length: "average" # Use average sequence length split_on: "," # Split on delimiter tokenizer: null # No tokenization sampling_strategy_if_longer: "uniform" # Random sampling for long sequences Advanced Tokenization --------------------- **BPE (Byte Pair Encoding) Tokenization:** For subword tokenization, particularly useful for handling out-of-vocabulary words: .. code-block:: yaml :caption: BPE tokenizer configuration input_type_info: tokenizer: "bpe" adaptive_tokenizer_max_vocab_size: 10000 # Maximum vocabulary size vocab_file: null # Will be trained on your data split_on: null # BPE handles splitting internally **Custom Vocabulary:** Using a pre-defined vocabulary file: .. code-block:: yaml :caption: Custom vocabulary setup input_type_info: vocab_file: "data/custom_vocab.json" # JSON file with token->id mapping .. note:: The vocab file is a optional text file containing pre-defined vocabulary to use for the training. If this is not passed in, the framework will automatically build the vocabulary from the training data. Passing in a vocabulary file is therefore useful if (a) you want to manually specify / limit the vocabulary used and/or (b) you want to save time by pre-computing the vocabulary. Here, there are two formats supported: - A ``.json`` file containing a dictionary with the vocabulary as keys and the corresponding token IDs as values. For example: ``{"the": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}`` - A ``.json`` file with the results of training and saving the vocabulary of a Huggingface BPE tokenizer. This is the file create by calling ``hf_tokenizer.save()``. This is only valid when using the ``bpe`` tokenizer. Sequence Length Strategies -------------------------- **Dynamic Length Calculation:** .. code-block:: yaml :caption: Dynamic length options input_type_info: max_length: "max" # Use longest sequence in dataset # OR max_length: "average" # Use average length # OR max_length: 512 # Fixed length **Handling Long Sequences:** .. code-block:: yaml :caption: Long sequence handling input_type_info: sampling_strategy_if_longer: "uniform" # Random sampling for training # OR sampling_strategy_if_longer: "from_start" # Always truncate from beginning .. note:: Validation and test sets always use ``"from_start"`` for consistency, regardless of the training strategy. External Pretrained Models -------------------------- For leveraging pretrained language models: .. code-block:: yaml :caption: Using pretrained BERT model_config: model_type: "bert-base-uncased" # Hugging Face model name pretrained_model: true # Use pretrained weights model_init_config: num_labels: 2 # Number of output classes See :ref:`external-sequence-models` for the full list of supported models. Attribution Analysis -------------------- Enable feature importance analysis to understand which parts of sequences contribute most to predictions: .. code-block:: yaml :caption: Attribution analysis setup (in globals.yaml) attribution_analysis: compute_attributions: true max_attributions_per_class: 100 # Samples per class to analyze attributions_every_sample_factor: 4 # Compute every 4th evaluation This uses `Integrated Gradients `_ to compute token-level importance scores, helping you understand model decisions. Complete Configuration Examples ------------------------------- **Protein Function Prediction:** .. code-block:: yaml :caption: Complete protein classification setup # inputs.yaml input_info: input_source: data/protein_sequences/ input_name: protein_seq input_type: sequence input_type_info: max_length: 1024 split_on: "" # Character-level for amino acids min_freq: 1 # Keep all amino acids model_config: model_type: sequence-default model_init_config: embedding_dim: 128 num_layers: 4 num_heads: 8 dropout: 0.10 # outputs.yaml output_info: output_name: protein_function output_source: data/protein_labels.csv output_type: tabular output_type_info: target_cat_columns: - Enzyme_Class - Subcellular_Location **Sentiment Analysis:** .. code-block:: yaml :caption: Complete sentiment analysis setup # inputs.yaml input_info: input_source: data/reviews.csv # CSV with ID and Sequence columns input_name: review_text input_type: sequence input_type_info: max_length: 512 split_on: " " # Word-level tokenization tokenizer: "basic_english" # Text normalization min_freq: 5 # Filter rare words model_config: model_type: sequence-default model_init_config: embedding_dim: 256 num_layers: 6 num_heads: 8 dropout: 0.10 # outputs.yaml output_info: output_name: sentiment output_source: data/sentiment_labels.csv output_type: tabular output_type_info: target_cat_columns: - Sentiment # Positive/Negative