Frequently Asked Questions

This guide addresses common questions and issues users encounter when working with EIR, based on real user experiences.

Table of Contents

  1. Attribution Analysis

  2. Model Overfitting and Performance

  3. Multi-modal Data Integration

  4. Model Architecture

  5. Prediction and Configuration

  6. Data Handling

  7. Validation and Testing

  8. Technical Issues and Performance


Attribution Analysis

Q: How do I enable attribution analysis during training?

A: Add the following to your global configuration file:

attribution_analysis:
  compute_attributions: true
  max_attributions_per_class: 100  # Samples per class to analyze
  attributions_every_sample_factor: 4  # Compute every 4th evaluation

Note: Attribution calculations are computationally expensive, especially with many output targets. Consider:

  • Using higher attributions_every_sample_factor values (e.g., 4 or 8) to reduce computation

  • Running attributions only on your best model after training

  • Allocating more computational resources when using attributions

Q: What do the attribution values mean?

A: Attribution values represent the average influence of each feature on the model’s raw output.

  • Values are not normalized to sum to 1 by default

  • They show the feature importance using Integrated Gradients method

  • Higher absolute values indicate stronger influence

  • Can be positive (increases output) or negative (decreases output)

To convert to percentage contributions, you could for example:

  1. Check the feature_importance.csv file in the attributions folder

  2. Calculate the mean attribution for each feature

  3. Normalize to sum to 1 for relative importance


Model Performance

Q: My model starts overfitting very quickly. What can I do?

A: Try these strategies, organized by where they are configured:

  1. Reduce batch size (in global configuration):

    # In your global configuration file
    basic_experiment:
      batch_size: 32  # Reduce from default 64
    
  2. Add regularization via mixing (in global configuration):

    # In your global configuration file
    training_control:
      mixing_alpha: 0.2  # Mixup augmentation (0.0-1.0)
    
  3. Adjust learning rate and weight decay (in global configuration):

    # In your global configuration file
    optimization:
      lr: 0.0001  # Reduce from default 0.0003
      wd: 0.001   # Increase weight decay from default 0.0001
    
  4. Enable early stopping (in global configuration):

    # In your global configuration file
    training_control:
      early_stopping_patience: 10
      early_stopping_buffer: 2000  # Optional: wait before checking
    
  5. Increase dropout in fusion module (in fusion configuration):

    # In your fusion configuration file
    model_type: mlp-residual
    model_config:
      rb_do: 0.25          # Residual block dropout (default 0.1)
      fc_do: 0.25          # Final layer dropout (default 0.1)
      stochastic_depth_p: 0.2  # Stochastic depth (default 0.1)
    
  6. Add dropout to output models (in output configuration):

    # In your output configuration file (for tabular outputs)
    output_type_info:
      # ... other settings ...
    model_config:
      model_init_config:
        rb_do: 0.25
        fc_do: 0.25
        stochastic_depth_p: 0.2
    
  7. For classification tasks, use label smoothing (in output configuration):

    # In your output configuration file
    output_type_info:
      target_cat_columns:
        - target_column
      cat_label_smoothing: 0.1  # Smooths one-hot labels
    
  8. For tabular inputs, add L1 regularization (in input configuration):

    # In your tabular input configuration file
    model_config:
      model_type: tabular
      model_init_config:
        l1: 0.0001  # L1 penalty on embeddings
    
  9. For image inputs, use data augmentation (in input configuration):

    # In your image input configuration file
    input_type_info:
      mixing_subtype: "cutmix"  # or "mixup"
      # Also uses standard augmentations by default
    
  10. Reduce model complexity:

Reduce layers or hidden dimensions, fusion example shown below:

# In fusion configuration
model_config:
  layers: [1]  # Reduce from default [2]
  fc_task_dim: 128  # Reduce from default 256

Q: When should I stop training? How do I know my model is “good enough”?

A: Consider these indicators:

  1. Check validation curves: Look for plateau or degradation in validation performance

  2. Performance gap: Large gap between train/validation indicates overfitting

  3. Task requirements: Compare performance to your domain-specific needs

  4. Convergence: If performance is still improving at max epochs, increase n_epochs

  5. Multiple metrics: Check ROC-AUC, MCC, and confusion matrices, not just loss

Look at files like training_curve_ROC-AUC-MACRO.pdf and training_curve_PERF-AVERAGE.pdf in your results folder.

Q: What is this “average performance” metric?

A: EIR uses a single “average performance” metric to track overall model performance across potentially multiple tasks and output types. This metric is crucial as it controls:

  • Early stopping: Training stops when this metric doesn’t improve

  • Model checkpointing: Models are saved based on this metric (visible in filenames like model_1000_perf-average=0.8547.pt)

  • Learning rate scheduling: LR reduction on plateau uses this metric

How it’s calculated:

  1. For categorical outputs (classification):

    • Default: Average of MCC, ROC-AUC-MACRO, and AP-MACRO

    • Each metric contributes equally to the average

    • Higher is better (range 0-1)

  2. For continuous outputs (regression):

    • Default: Average of (1.0 - LOSS), PCC, and R²

    • Note: Loss is inverted so higher is better

    • Each metric contributes equally to the average

  3. For other outputs (sequence/image generation):

    • Uses 1.0 - LOSS by default

    • Higher is better

  4. For multi-task learning:

    • Averages across all tasks

    • Each task contributes equally regardless of output type

Customizing the metric:

You can choose which metrics to include in the average:

# In your global configuration file
metrics:
  cat_averaging_metrics:
    - mcc
    - roc-auc-macro
    # Omit 'ap-macro' to exclude it
  con_averaging_metrics:
    - r2
    - pcc
    # Omit 'loss' to exclude it

Example interpretation:

  • model_950_perf-average=0.3114.pt: Model at iteration 950 with average performance of 0.3114

  • model_2000_perf-average=0.8547.pt: Model at iteration 2000 with average performance of 0.8547 (better)

Important notes:

  • This metric is computed on the validation set

  • A higher value always indicates better performance

  • For imbalanced datasets, this averaging might mask poor performance on rare classes


Missing Data Handling

Q: How does EIR handle missing data?

A: EIR has handling for different types of missing data, both in inputs and outputs:

Input Data - Missing Values Within a Tabular Modality:

For partially missing data within a modality (e.g., some NaN values in tabular columns):

  • Continuous columns: Imputed with the mean from the training set (e.g. will be 0 if data is already mean-normalized before being passed to EIR)

  • Categorical columns: Encoded as a special __NULL__ category

  • No manual imputation needed - EIR handles this automatically, but you can preprocess if desired

Input Data - Completely Missing Modalities:

When an entire modality is missing for a sample (e.g., no image for a specific ID):

  • Tabular: Uses the within-modality strategy above

  • Images: Filled with random noise (Gaussian distribution)

  • Sequences/Text: Filled with padding tokens

  • Omics: Filled with zeros (0 values for the one-hot encoding)

  • Arrays: Filled with random noise (Gaussian distribution)

Output Data - Missing Target Values:

  • EIR excludes NaN values from loss computation - they don’t contribute to backpropagation

  • Supports partial outputs: Can have some target columns missing for specific samples

  • No imputation needed: The model learns only from available labels

Best Practices:

  1. Preprocessing: You may still want to filter features/samples with excessive missing values

Example: Multi-modal with Missing Data

Tabular Data

ID

Feature1

Feature2

Feature3

sample1

1.5

NaN

A

sample2

2.3

0.8

B

sample3

NaN

1.2

NaN

# images folder, note that sample2 is missing
sample1.jpg
sample3.jpg

EIR will automatically handle the NaN values in Feature2/Feature3 and the missing image for sample2.

Model Architecture

Q: What’s the difference between mlp-residual and regular MLP?

A: The mlp-residual model uses residual blocks with:

  • Skip connections

  • Layer normalization via RMSNorm

  • GELU activation

  • Stochastic depth option

  • LayerScale for better training stability

Q: How do I interpret the model architecture?

A: Check model_info.txt in your experiment folder.


Prediction and Configuration

Q: Why does eirpredict require the global_configs file?

A: The global configuration contains settings needed for prediction:

  • Batch size (might want to increase for faster inference)

  • Attribution settings (if computing on test set)

  • Dataloader workers

  • Other runtime parameters

These aren’t just training parameters - they affect how predictions are computed.

Q: How do I predict on data without labels?

A: Set output_source: null in your output configuration:

output_info:
  output_name: my_output
  output_source: null  # Instead of path to labels
  output_type: tabular
output_type_info:
  target_cat_columns:
    - target_column

Q: Which model checkpoint should I use for predictions?

A: Generally use the model with best validation performance:

  • Check the filename: model_950_perf-average=0.3114.pt

  • The number (950) is the iteration

  • perf-average shows the validation performance

  • Higher is better for most metrics


Data Handling

Q: How should I format time series data?

A: Time series data can for example be formatted as sequences:

Sequence Data

ID

Sequence

sample1

val1 val2 val3 val4 val5

sample2

val1 val2 val3 val4 val5

Configuration example:

input_type_info:
  max_length: 48
  split_on: " "
  sampling_strategy_if_longer: "from_start"

Note you can also have them as separate .txt files, filename being the sample ID and content being the sequence values.


Validation and Testing

Q: How do I ensemble multiple model runs?

A: For better stability, train multiple models with different seeds:

EIR_SEED=0 eirtrain ...
EIR_SEED=1 eirtrain ...
EIR_SEED=2 eirtrain ...

Then average predictions across models.


Technical Issues and Performance

Q: Attribution analysis makes training very slow. What can I do?

A: Several strategies:

  1. Increase sampling interval:

    attributions_every_sample_factor: 8  # or higher
    
  2. Reduce samples analyzed:

    max_attributions_per_class: 50  # instead of 100+
    
  3. Run post-training: Train without attributions, then run eirpredict with attributions enabled

  4. Allocate more resources: Increase CPU/RAM allocation on your cluster

Q: How do I reduce training time?

A: Try these optimizations:

  1. Enable model compilation on GPU/CUDA devices:

    # In your global configuration file
    model:
      compile_model: true
    
  2. Use mixed precision training (especially on modern GPUs):

    # In your global configuration file
    accelerator:
      precision: "16-mixed"  # or "bf16-mixed" for newer GPUs
    
  3. Load data into memory (if you have enough RAM):

    # In your global configuration file
    basic_experiment:
      memory_dataset: true
    
  4. Increase dataloader workers (for CPU-bound data loading):

    # In your global configuration file
    basic_experiment:
      dataloader_workers: 8  # Adjust based on CPU cores
    
  5. Use gradient accumulation (simulate larger batches without more memory):

    # In your global configuration file
    optimization:
      gradient_accumulation_steps: 4  # Effective batch = batch_size * 4
    
  6. Reduce evaluation frequency:

    # In your global configuration file
    evaluation_checkpoint:
      sample_interval: 500  # instead of 200
      checkpoint_interval: 500
    
  7. Feature selection: Use fewer input features based on prior knowledge or attributions

  8. Smaller models: Reduce layers or hidden dimensions in fusion/output configs

  9. Early stopping: Stop when validation performance plateaus

Quick wins for GPU training:

  • Set compile_model: true and precision: "16-mixed"

  • Use memory_dataset: true if your dataset fits in RAM

  • Increase dataloader_workers to 2-4

Note: Model compilation may not work with all architectures. Mixed precision can slightly affect model accuracy but usually provides significant speedup with minimal impact.


Need More Help?

  • Check the official documentation

  • Review tutorials for specific use cases

  • For genomics-specific tasks, consider EIR-auto-GP

  • Examine the generated model_info.txt for architecture details