01 – Genotype Tutorial: Ancestry Prediction
A - Setup
In this tutorial, we will be using genotype data to train deep learning models for ancestry prediction.
Note
This tutorial goes into some detail about how EIR
works,
and how to use it. If you are more interested in quickly training
the deep learning models for genomic prediction, the EIR-auto-GP
project might be of use to you.
To start, please download processed sample data (or process your own .bed, .bim, .fam files with e.g. plink pipelines). The sample data we are using here for predicting ancestry is the public Human Origins dataset, but the same approach can just as well be used for e.g. disease predictions in other cohorts (for example the UK Biobank).
Examining the sample data, we can see the following structure:
processed_sample_data
├── arrays # Genotype data as NumPy arrays
├── data_final_gen.bim # Variant information file accompanying the genotype arrays
└── human_origins_labels.csv # Contains the target labels (what we want to predict from the genotype data)
Important
The label file ID column must be called “ID” (uppercase).
For this tutorial, we are going to use the data above to models to predict ancestry, of which there are 6 classes (Asia, Eastern Asia, Europe, Latin America and the Caribbean, Middle East and Sub-Saharan Africa). Before diving into the model training, we first have to configure our experiments.
To configure the experiments we want to run,
we will use .yaml
configurations.
Running eirtrain --help
,
we can see the configurations needed:
usage: eirtrain [-h] --global_configs GLOBAL_CONFIGS [GLOBAL_CONFIGS ...]
[--input_configs [INPUT_CONFIGS ...]]
[--fusion_configs [FUSION_CONFIGS ...]] --output_configs
OUTPUT_CONFIGS [OUTPUT_CONFIGS ...]
options:
-h, --help show this help message and exit
--global_configs GLOBAL_CONFIGS [GLOBAL_CONFIGS ...]
Global .yaml configurations for the experiment.
--input_configs [INPUT_CONFIGS ...]
Input feature extraction .yaml configurations. Each
configuration represents one input.
--fusion_configs [FUSION_CONFIGS ...]
Fusion .yaml configurations.
--output_configs OUTPUT_CONFIGS [OUTPUT_CONFIGS ...]
Output .yaml configurations.
Above we can see that there are four types of configurations we can use: global, inputs, fusion and outputs. To see more details about what should be in these configuration files, we can check the Configuration API reference.
Note
Instead of having to type out the configuration files below manually, you can
download them from the docs/tutorials/tutorial_files/01_basic_tutorial
directory
in the project repository
While the global configuration has a lot of options,
the only one we really need to fill in now is
output_folder
and evaluation interval (in batch iterations),
so we have the following tutorial_01_globals.yaml
file:
output_folder: eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run
checkpoint_interval: 200
sample_interval: 200
We also need to tell the framework where to load inputs from,
and some information about the input, for that we use an input .yaml
configuration
called tutorial_01_inputs.yaml
:
input_info:
input_source: eir_tutorials/a_using_eir/01_basic_tutorial/data/processed_sample_data/arrays
input_name: genotype
input_type: omics
input_type_info:
snp_file: eir_tutorials/a_using_eir/01_basic_tutorial/data/processed_sample_data/data_final_gen.bim
model_config:
model_type: genome-local-net
Above we can see that the input needs 3 fields: input_info
, input_type_info
and
model_config
.
The input_info
contains basic information about the input.
The input_type_info
contains information specific to the input type (in this case
omics).
Finally, the model_config
contains configuration for
the model that should be
trained with the input data.
For more information about the
configurations, e.g. which parameters are relevant for the chosen models and what they
do, head over to the Configuration API reference.
Finally, we need to specify what outputs to predict during training. For that we
will use the tutorial_01_outputs.yaml
file with the following content:
output_info:
output_name: ancestry_output
output_source: eir_tutorials/a_using_eir/01_basic_tutorial/data/processed_sample_data/human_origins_labels.csv
output_type: tabular
output_type_info:
target_cat_columns:
- Origin
Note
You might notice that we have not written any fusion config so far. The fusion configuration controls how different modalities (i.e. input data types, for example genotype and clinical data) are combined using a neural network. While we indeed can configure the fusion, we will leave use the defaults for now. The default fusion model is a fully connected neural network.
With all this, we should have our project directory looking something like this:
eir_tutorials/a_using_eir/01_basic_tutorial/
├── conf
│ ├── large_scale_fusion.yaml
│ ├── large_scale_globals.yaml
│ ├── large_scale_input_gln.yaml
│ ├── large_scale_input_tabular.yaml
│ ├── large_scale_output.yaml
│ ├── tutorial_01_globals.yaml
│ ├── tutorial_01_input.yaml
│ ├── tutorial_01_outputs.yaml
│ └── tutorial_01_outputs_unknown.yaml
└── data
├── processed_sample_data
│ ├── arrays
│ ├── data_final_gen.bim
│ └── human_origins_labels.csv
└── processed_sample_data.zip
B - Training
Training a GLN model
Now that we have our configurations set up, training is simply passing them to the framework, like so:
eirtrain \
--global_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_globals.yaml \
--input_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_input.yaml \
--output_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_outputs.yaml
This will generate a folder in the current directory called eir_tutorials
,
and eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run
(note that the inner run name comes from the value in
global_config
we set before)
will contain the results from our experiment.
Tip
You might try running the command above again after it partially/completely
finishes, and most likely you will encounter a FileExistsError
.
This is to avoid accidentally overwriting previous experiments. When performing
another run, we will have to delete/rename the experiment, or change it in the
configuration (see below).
Examining the directory, we see the following structure (some files have been excluded here for brevity):
eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run/
├── configs
├── meta
│ └── eir_version.txt
├── model_info.txt
├── results
│ └── ancestry_output
│ └── Origin
│ ├── samples
│ │ ├── 200
│ │ │ ├── confusion_matrix.pdf
│ │ │ ├── mc_pr_curve.pdf
│ │ │ ├── mc_roc_curve.pdf
│ │ │ └── predictions.csv
│ │ ├── 400
│ │ │ ├── confusion_matrix.pdf
│ │ │ ├── mc_pr_curve.pdf
│ │ │ ├── mc_roc_curve.pdf
│ │ │ └── predictions.csv
│ │ └── 600
│ │ ├── confusion_matrix.pdf
│ │ ├── mc_pr_curve.pdf
│ │ ├── mc_roc_curve.pdf
│ │ └── predictions.csv
│ ├── training_curve_ACC.pdf
│ ├── training_curve_AP-MACRO.pdf
│ ├── training_curve_LOSS.pdf
│ ├── training_curve_MCC.pdf
│ └── training_curve_ROC-AUC-MACRO.pdf
├── saved_models
├── test_predictions
│ ├── known_outputs
│ │ ├── ancestry_output
│ │ │ └── Origin
│ │ │ ├── confusion_matrix.pdf
│ │ │ ├── mc_pr_curve.pdf
│ │ │ ├── mc_roc_curve.pdf
│ │ │ └── predictions.csv
│ │ └── calculated_metrics.json
│ └── unknown_outputs
│ └── ancestry_output
│ └── Origin
│ └── predictions.csv
├── training_curve_LOSS-AVERAGE.pdf
└── training_curve_PERF-AVERAGE.pdf
In the results folder for a given output,
the [200, 400, 600] folders
contain our validation results
according to our sample_interval
configuration
in the global config.
We can examine how our model did with respect to accuracy (let’s assume our targets are fairly balanced in this case) by checking the training_curve_ACC.png file:
Examining the actual predictions and how they matched the target labels,
we can look at the confusion matrix in one of the evaluation folders of
results/Origin/samples
. When I ran this, I got the following at iteration 600:
In the training curve above,
we can see that our model barely got going before the run finished!
Let’s try another experiment.
We can change the output_folder
value
in 01_basic_tutorial/tutorial_01_globals.yaml
,
but the framework also supports rudimentary injection of values from the command line.
Let’s try that,
setting a new run name,
increasing the number of epochs and
changing the learning rate:
eirtrain \
--global_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_globals.yaml \
--input_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_input.yaml \
--output_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_outputs.yaml \
--tutorial_01_globals.output_folder=eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run_lr-0.002_epochs-20 \
--tutorial_01_globals.lr=0.002 \
--tutorial_01_globals.n_epochs=20
Note
The injected values are according to the configuration filenames.
Looking at the training curve from that run, we can see we did a bit better:
We also notice that there is a gap between the training and evaluation performances, indicating that the model is starting to overfit on the training data. There are a bunch of regularisation settings we could try, such as increasing dropout in the input, fusion and output modules. Check the Configuration API reference for a full overview.
C - Predicting on external samples
Predicting on samples with known labels
To predict on external samples, we run eirpredict
.
As we can see when running eirpredict --help
, it looks quite
similar to eirtrain
:
usage: eirpredict [-h] [--global_configs [GLOBAL_CONFIGS ...]]
[--input_configs [INPUT_CONFIGS ...]]
[--fusion_configs [FUSION_CONFIGS ...]]
[--output_configs [OUTPUT_CONFIGS ...]] --model_path
MODEL_PATH [--evaluate] --output_folder OUTPUT_FOLDER
[--attribution_background_source {train,predict}]
options:
-h, --help show this help message and exit
--global_configs [GLOBAL_CONFIGS ...]
Global .yaml configurations for the experiment.
--input_configs [INPUT_CONFIGS ...]
Input feature extraction .yaml configurations. Each
configuration represents one input.
--fusion_configs [FUSION_CONFIGS ...]
Fusion .yaml configurations.
--output_configs [OUTPUT_CONFIGS ...]
Output .yaml configurations.
--model_path MODEL_PATH
Path to model to use for predictions.
--evaluate
--output_folder OUTPUT_FOLDER
Where to save prediction results.
--attribution_background_source {train,predict}
For attribution analysis, whether to load backgrounds
from the data used for training or to use the current
data passed to the predict module.
Generally we do not change much of the configs when predicting, with the exception of
the input configs (and then mainly setting the input_source
,
i.e. where to load our samples to predict/test on from) and perhaps the global config
(e.g. we might not compute attributions during training, but compute them on our test set
by activating compute_attributions
in the global config when predicting). Specific to
eirpredict
, we have to choose a saved model (--model_path
), whether we want to
evaluate the performance on the test set (--evaluate
this means that the respective
labels must be present in the --output_configs
) and where to save the prediction
results (--output_folder
).
For the sake of this tutorial, we use one of the saved models from our previous training
run and use it for inference using eirpredict
module. Here, we will simply use it
to predict on the same data as before.
Warning
We are only predicting on the same data we trained on in this tutorial to show
how to use the eirpredict
module. Always take care in separating what data you
use for training and to evaluate generalization performance of your models!
Run the commands below, making sure you add the correct path of a saved model to the
--model_path
argument.
To test, we can run the following command
(note that you will have to add the path to your saved model for the --model_path
parameter below).
eirpredict \
--global_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_globals.yaml \
--input_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_input.yaml \
--output_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_outputs.yaml \
--model_path eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run/saved_models/tutorial_01_run_model_600_perf-average=0.8764.pt \
--evaluate \
--output_folder eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run/test_predictions/known_outputs
This will generate a file called
calculated_metrics.json
in the supplied output_folder
as well
as a folder for each output (in this case called ancestry_output
containing the actual predictions and plots. Of course the metrics are quite nonsensical
here, as we are predicting on the same data we trained on.
One of the files generated are the actual predictions,
found in the predictions.csv
file:
ID | True Label Untransformed |
True Label | Asia | Eastern_Asia | Europe | Latin_America_and_th e_Caribbean |
Middle_East | Sub-Saharan_Africa |
---|---|---|---|---|---|---|---|---|
MAL-005 | Sub-Saharan_Africa | 5 | -1.61 | -1.88 | -2.82 | 0.81 | -0.37 | 4.79 |
MAL-009 | Sub-Saharan_Africa | 5 | -1.92 | -2.09 | -2.75 | 0.75 | 0.03 | 4.65 |
MAL-011 | Sub-Saharan_Africa | 5 | -1.91 | -2.12 | -2.75 | 1.32 | -0.56 | 4.68 |
MAL-012 | Sub-Saharan_Africa | 5 | -1.53 | -1.95 | -2.84 | 0.56 | -0.20 | 4.67 |
MAL-014 | Sub-Saharan_Africa | 5 | -1.56 | -2.02 | -2.85 | 0.73 | -0.27 | 4.80 |
The True Label Untransformed
column contains the actual labels, as they were
in the raw data. The True Label
column contains the labels after they have been
numerically encoded / normalized in EIR
.
The other columns represent the raw network outputs
for each of the classes.
Predicting on samples with unknown labels
Notice that when running the command above, we knew the labels of the samples we were
predicting on. In practice, we are often predicting on samples we have no clue
about the labels of. In this case, we can again use the eirpredict
with slightly
modified arguments:
eirpredict \
--global_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_globals.yaml \
--input_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_input.yaml \
--output_configs eir_tutorials/a_using_eir/01_basic_tutorial/conf/tutorial_01_outputs_unknown.yaml \
--model_path eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run/saved_models/tutorial_01_run_model_600_perf-average=0.8764.pt \
--output_folder eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run/test_predictions/unknown_outputs
We can notice a couple of changes here compared to the previous command:
We have removed the
--evaluate
flag, as we do not have the labels for the samples we are predicting on.We have a different output configuation file,
tutorial_01_outputs_unknown.yaml
.We have a different output folder,
tutorial_01_unknown
.
If we take a look at the tutorial_01_outputs_unknown.yaml
file, we can see that
it contains the following:
output_info:
output_name: ancestry_output
output_source: null
output_type: tabular
output_type_info:
target_cat_columns:
- Origin
Notice that everything is the same as before, but for output_source
we have
null
instead of the .csv label file we had before.
Taking a look at the produced predictions.csv
file, we can see that we only
have the actual predictions, and no true labels:
ID | Asia | Eastern_Asia | Europe | Latin_America_and_th e_Caribbean |
Middle_East | Sub-Saharan_Africa |
---|---|---|---|---|---|---|
HGDP01358 | -0.48 | -2.05 | 4.80 | -0.62 | -0.21 | -2.98 |
HG04182.SG | 4.23 | 0.50 | -1.97 | -0.91 | -2.86 | -1.24 |
HG02870.SG | -1.68 | -2.06 | -2.74 | 1.93 | -1.36 | 4.48 |
HGDP00720 | 2.44 | 3.63 | -0.12 | -2.71 | -2.34 | -2.55 |
S_Kinh-1.DG | 3.01 | 3.51 | -1.06 | -2.78 | -2.39 | -2.51 |
D - Applying to your own data (e.g. UK Biobank)
Thank you for reading this far! Hopefully this tutorial introduced you well enough to
the framework so you can apply it to your own data. For that, you will have to process
it first (see: plink pipelines). Then you will have to set the relevant paths for the
inputs (e.g. input_source
, snp_file
) and outputs
(e.g. output_source
, target_cat_columns
or target_con_columns
if you have continuous targets).
Important
If you are interested in quickly training deep learning models for genomic prediction, the EIR-auto-GP project might be of use to you.
When moving to large scale data such as the UK Biobank, the configurations we used on the ancestry toy data in this tutorial will likely not be sufficient. For example, the learning rate is likely too high. For this, here are some baseline configurations that we have found to work well as a starting point:
output_folder: "FILL"
sample_interval: 500
checkpoint_interval: 500
batch_size: "FILL"
lr: 0.0002
lr_plateau_patience: 5
gradient_clipping: 1.0
valid_size: "FILL"
n_epochs: 50
dataloader_workers: "FILL"
device: "FILL"
early_stopping_buffer: 2000
early_stopping_patience: 10
mixing_alpha: 0.2
optimizer: "adabelief"
weighted_sampling_columns: # for categorical targets, remove if only doing regression
- "all"
log_level: "debug"
input_info:
input_source: "FILL"
input_name: "genotype"
input_type: "omics"
input_type_info:
mixing_subtype: "cutmix-block"
na_augment_alpha: 1.0
na_augment_beta: 2.0
snp_file: "FILL" # can delete if not computing attributions
model_config:
model_type: "genome-local-net"
model_init_config:
rb_do: 0.1
channel_exp_base: 2
kernel_width: 16
first_kernel_expansion: -4
l1: 0.0
cutoff: 4096
input_info:
input_source: "FILL"
input_name: "tabular_input"
input_type: "tabular"
input_type_info:
input_cat_columns:
- "FILL"
input_con_columns:
- "FILL"
model_config:
model_type: "tabular"
model_init_config:
fc_layer: true
model_config:
fc_do: 0.1
fc_task_dim: 512
layers:
- 2
rb_do: 0.1
stochastic_depth_p: 0.1
model_type: "default"
output_info:
output_name: "FILL"
output_source: "FILL"
output_type: "tabular"
output_type_info:
target_con_columns:
- "FILL"
target_cat_columns:
- "FILL"
model_config:
model_type: "mlp_residual"
model_init_config:
rb_do: 0.2
fc_do: 0.2
fc_task_dim: 512
layers:
- 2
stochastic_depth_p: 0.2
final_layer_type: "linear"
E - Serving
In this final section, we demonstrate serving our trained model as a web service and interacting with it using HTTP requests.
Starting the Web Service
To serve the model, use the following command:
eirserve --model-path [MODEL_PATH]
Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.
Here is an example of the command:
eirserve \
--model-path eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run/saved_models/tutorial_01_run_model_600_perf-average=0.8764.pt
Sending Requests
With the server running, we can now send requests. The requests are prepared by loading numpy array data, converting it to base64 encoded strings, and then constructing a JSON payload.
Here’s an example Python function demonstrating this process:
import numpy as np
import base64
import requests
def encode_numpy_array(file_path: str) -> str:
array = np.load(file_path)
encoded = base64.b64encode(array.tobytes()).decode('utf-8')
return encoded
def send_request(url: str, payload: dict):
response = requests.post(url, json=payload)
return response.json()
encoded_data = encode_numpy_array('path_to_your_numpy_array.npy')
response = send_request('http://localhost:8000/predict', {'genotype': encoded_data})
print(response)
Analyzing Responses
Here are some examples of responses from the server for a set of inputs:
[
{
"request": {
"genotype": "eir_tutorials/a_using_eir/01_basic_tutorial/data/processed_sample_data/arrays/A374.npy"
},
"response": {
"result": {
"ancestry_output": {
"Origin": {
"Asia": 0.010410779155790806,
"Eastern_Asia": 0.0011356589384377003,
"Europe": 0.854654848575592,
"Latin_America_and_the_Caribbean": 0.008827924728393555,
"Middle_East": 0.1237422451376915,
"Sub-Saharan_Africa": 0.00122847652528435
}
}
}
}
},
{
"request": {
"genotype": "eir_tutorials/a_using_eir/01_basic_tutorial/data/processed_sample_data/arrays/Ayodo_468C.npy"
},
"response": {
"result": {
"ancestry_output": {
"Origin": {
"Asia": 0.0017986423335969448,
"Eastern_Asia": 0.0030721763614565134,
"Europe": 0.0034481489565223455,
"Latin_America_and_the_Caribbean": 0.026503251865506172,
"Middle_East": 0.1034306138753891,
"Sub-Saharan_Africa": 0.861747145652771
}
}
}
}
},
{
"request": {
"genotype": "eir_tutorials/a_using_eir/01_basic_tutorial/data/processed_sample_data/arrays/NOR146.npy"
},
"response": {
"result": {
"ancestry_output": {
"Origin": {
"Asia": 0.008015047758817673,
"Eastern_Asia": 0.0006639149505645037,
"Europe": 0.9414306282997131,
"Latin_America_and_the_Caribbean": 0.03938961401581764,
"Middle_East": 0.009529098868370056,
"Sub-Saharan_Africa": 0.0009716283529996872
}
}
}
}
}
]