.. _genomics-guide: Genomics Guide ============== Ready-to-use configurations for genetic data analysis using the Genome Local Network (GLN) model. - **Data format:** NumPy arrays + .bim variant files (from PLINK processing) - **Model:** GLN - specifically designed for large-scale genomics data .. note:: **First step:** Copy the :doc:`../guides_index` global configuration as your ``globals.yaml`` .. contents:: :local: :depth: 2 Quick Start ----------- - **Use case:** Genomic prediction (for example T2D disease risk, or continuous traits like blood metabolites) from SNP data - **Data requirements:** Individual-level genotype data, phenotype labels **Files needed:** .. code-block:: yaml :caption: inputs.yaml input_info: input_source: data/genotype_arrays/ # Path to your .npy genotype files input_name: genotype input_type: omics input_type_info: snp_file: data/variants.bim # PLINK .bim file with variant information model_config: model_type: genome-local-net # GLN model designed for genomics .. note:: The ``input_source`` should contain NumPy arrays of shape ``(3, n_SNPs)``, where each SNP is represented by 3 values (one-hot encoded). Missing genotypes are represented as all-zeros across the 3 channels. You can convert your ``.bed/.bim/.fam`` files to EIR format using the `plink pipelines `_ tool. .. code-block:: yaml :caption: outputs.yaml output_info: output_name: disease_risk output_source: data/phenotypes.csv # Must contain "ID" column + target output_type: tabular # Note: You don't have to include both ``target_cat_columns`` # and ``target_con_columns``, but you can if you have both and multiple # targets in both categories if you want. output_type_info: target_cat_columns: - Disease_Status # Categorical target column target_con_columns: - BMI # Continuous target column (optional) **Run command:** .. code-block:: bash eirtrain --global_configs globals.yaml \ --input_configs inputs.yaml \ --output_configs outputs.yaml About the GLN Model ------------------- **Full model configuration with all available parameters:** .. code-block:: yaml :caption: Advanced GLN configuration model_config: model_type: genome-local-net model_init_config: # Architecture control layers: null # Auto-determine layers based on cutoff cutoff: 1024 # Feature dimension where auto setup stops direction: "down" # "down" (compress) or "up" (expand) # Kernel configuration kernel_width: 12 # Width of locally connected kernels (4 SNPs × 3 channels) first_kernel_expansion: -2 # Shrink first kernel (negative = divide, positive = multiply) num_lcl_chunks: null # Alternative: split input into N chunks # Kernel/feature configuration channel_exp_base: 2 # Power of 2 for number of channels/weights applied to each local patch (2^2 = 4 channels) first_channel_expansion: 1 # Channel multiplier for first layer # Regularization rb_do: 0.10 # Dropout in residual blocks stochastic_depth_p: 0.00 # Probability of dropping entire layers l1: 0.00 # L1 regularization on first layer # Advanced features attention_inclusion_cutoff: null # Add attention when feature length > cutoff Feel free to click on the figure below to see more information about the GLN model architecture and how the different parameters above affect it: .. figure:: static/img/gln.svg :width: 100% :align: center | Next section title ------------------ Large-Scale Cohorts (UK Biobank Scale) -------------------------------------- **Use case:** Analysis on 100K+ samples with 500K+ variants **Challenge:** Optimal parameter selection becomes more important at this scale For large-scale genomics, parameter tuning more dataset-dependent. That's one of the reason we created the `EIR-auto-GP `_ project, which performs automated parameter selection based data characteristics: **Automated parameter selection examples:** - **Learning rate** scales with SNP count: 1e-3 (< 1K SNPs) → 1e-5 (> 2M SNPs) - **GLN kernel expansion** adapts to data size: -4 (small) → +8 (larger datasets) - **Memory management** automatically detects available RAM and dataset size - **Batch size & validation** dynamically sized based on sample count - **Early stopping buffer** scales with iterations per epoch **Manual Parameter Selection Guide** If configuring manually, these are some of the criteria we found useful (and are implemented in EIR-auto-GP): **Learning Rate Selection:** .. code-block:: yaml :caption: Choose learning rate based on SNP count # < 1,000 SNPs optimization: lr: 0.001 # 1K - 10K SNPs optimization: lr: 0.0005 # 10K - 100K SNPs optimization: lr: 0.0002 # 100K - 500K SNPs optimization: lr: 0.0001 # 500K - 2M SNPs optimization: lr: 0.00005 # > 2M SNPs optimization: lr: 0.00001 **GLN Kernel Parameters:** .. code-block:: yaml :caption: Kernel expansion scales with data complexity # < 1K SNPs: Smaller kernels for limited data model_init_config: kernel_width: 12 first_kernel_expansion: -4 # 16/4 = 4 (covers 1 SNP) # 1K - 10K SNPs model_init_config: kernel_width: 12 first_kernel_expansion: -2 # 16/2 = 8 (covers 2 SNPs) # 10K - 100K SNPs model_init_config: kernel_width: 12 first_kernel_expansion: 1 # 16*1 = 16 (covers 4 SNPs) # 100K - 500K SNPs model_init_config: kernel_width: 12 first_kernel_expansion: 2 # 16*2 = 32 (covers 8 SNPs) # > 500K SNPs: Higher context to reduce feature size more aggressively model_init_config: kernel_width: 12 first_kernel_expansion: 4 # 16*4 = 64 (covers 16 SNPs) **Memory and Performance:** .. code-block:: yaml :caption: Resource management based on dataset size basic_experiment: # Memory dataset decision: dataset_size < 60% of available RAM # Formula: (n_snps × n_samples × 4 bytes) < (0.6 × RAM) memory_dataset: false # Use for large datasets # Batch size: balance memory usage with training stability batch_size: 64 # Standard for most genomics datasets batch_size: 32 # Reduce if GPU memory limited # Workers: scale with CPU cores and memory usage dataloader_workers: 8 # ~80% of available cores for disk loading dataloader_workers: 0 # Use when memory_dataset: true training_control: # Early stopping buffer: min(5000, iterations_per_epoch × 5) early_stopping_buffer: 2000 # Large datasets need more burn-in time # Sample interval: min(1000, iterations_per_epoch) sample_interval: 1000 # Less frequent evaluation for efficiency **Complete large-scale configuration:** .. code-block:: yaml :caption: globals.yaml basic_experiment: output_folder: "results/ukb_analysis" n_epochs: 50 batch_size: 64 memory_dataset: false dataloader_workers: 8 optimization: lr: 0.0001 # For 100K-500K SNPs training_control: early_stopping_patience: 15 early_stopping_buffer: 2000 attribution_analysis: compute_attributions: true max_attributions_per_class: 1024 .. code-block:: yaml :caption: inputs.yaml input_info: input_source: data/ukb_genotypes/ input_name: genotype input_type: omics input_type_info: snp_file: data/ukb_variants.bim model_config: model_type: genome-local-net model_init_config: kernel_width: 12 first_kernel_expansion: 2 # For 100K-500K SNPs .. note:: **Recommended approach:** Use `EIR-auto-GP `_ for automatic parameter optimization on large-scale data. It handles the complexity of parameter selection based on your specific dataset characteristics.