.. _01-genotype-tutorial: Genotype Tutorial: Ancestry Prediction ====================================== A - Setup ^^^^^^^^^ In this tutorial, we will be using `genotype data `__ to train deep learning models for ancestry prediction. .. note:: This tutorial goes into some detail about how ``EIR`` works, and how to use it. If you are more interested in quickly training the deep learning models for genomic prediction, the `EIR-auto-GP`_ project might be of use to you. .. _EIR-auto-GP: https://github.com/arnor-sigurdsson/EIR-auto-GP To start, please download `processed sample data`_ (or process your own `.bed`, `.bim`, `.fam` files with e.g. `plink pipelines`_). The sample data we are using here for predicting ancestry is the public `Human Origins`_ dataset, but the same approach can just as well be used for e.g. disease predictions in other cohorts (for example the `UK Biobank`_). .. _processed sample data: https://drive.google.com/file/d/1MELauhv7zFwxM8nonnj3iu_SmS69MuNi .. _plink pipelines: https://github.com/arnor-sigurdsson/plink_pipelines .. _Human Origins: https://www.nature.com/articles/nature13673 .. _UK Biobank: https://www.nature.com/articles/s41586-018-0579-z Examining the sample data, we can see the following structure: .. code-block:: console processed_sample_data ├── arrays # Genotype data as NumPy arrays ├── data_final_gen.bim # Variant information file accompanying the genotype arrays └── human_origins_labels.csv # Contains the target labels (what we want to predict from the genotype data) .. important:: The label file ID column must be called "ID" (uppercase). For this tutorial, we are going to use the data above to models to predict ancestry, of which there are 6 classes (Asia, Eastern Asia, Europe, Latin America and the Caribbean, Middle East and Sub-Saharan Africa). Before diving into the model training, we first have to configure our experiments. To configure the experiments we want to run, we will use ``.yaml`` configurations. Running ``eirtrain --help``, we can see the configurations needed: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/eirtrain_help.txt :language: console :lines: 2- Above we can see that there are four types of configurations we can use: *global*, *inputs*, *fusion* and *outputs*. To see more details about what should be in these configuration files, we can check the :ref:`api-reference` reference. .. note:: Instead of having to type out the configuration files below manually, you can download them from the ``docs/tutorials/tutorial_files/01_basic_tutorial`` directory in the `project repository `_ While the **global** configuration has a lot of options, the only one we really need to fill in now is ``output_folder`` and evaluation interval (in batch iterations), so we have the following ``tutorial_01_globals.yaml`` file: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/tutorial_01_globals.yaml :language: yaml :caption: tutorial_01_globals.yaml We also need to tell the framework where to load **inputs** from, and some information about the input, for that we use an input ``.yaml`` configuration called ``tutorial_01_inputs.yaml``: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/tutorial_01_input.yaml :language: yaml :caption: tutorial_01_input.yaml Above we can see that the input needs 3 fields: ``input_info``, ``input_type_info`` and ``model_config``. The ``input_info`` contains basic information about the input. The ``input_type_info`` contains information specific to the input type (in this case `omics`). Finally, the ``model_config`` contains configuration for the model that should be trained with the input data. For more information about the configurations, e.g. which parameters are relevant for the chosen models and what they do, head over to the :ref:`api-reference` reference. Finally, we need to specify what **outputs** to predict during training. For that we will use the ``tutorial_01_outputs.yaml`` file with the following content: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/tutorial_01_outputs.yaml :language: yaml :caption: tutorial_01_outputs.yaml .. note:: You might notice that we have not written any fusion config so far. The fusion configuration controls how different modalities (i.e. input data types, for example genotype and clinical data) are combined using a neural network. While we indeed *can* configure the fusion, we will leave use the defaults for now. The default fusion model is a fully connected neural network. With all this, we should have our project directory looking something like this: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/tutorial_folder.txt :language: console B - Training ^^^^^^^^^^^^ Training a GLN model """""""""""""""""""" Now that we have our configurations set up, training is simply passing them to the framework, like so: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_1.txt :language: console This will generate a folder in the current directory called ``eir_tutorials``, and ``eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run`` (note that the inner run name comes from the value in ``global_config`` we set before) will contain the results from our experiment. .. tip:: You might try running the command above again after it partially/completely finishes, and most likely you will encounter a ``FileExistsError``. This is to avoid accidentally overwriting previous experiments. When performing another run, we will have to delete/rename the experiment, or change it in the configuration (see below). Examining the directory, we see the following structure (some files have been excluded here for brevity): .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/experiment_01_folder.txt :language: console In the results folder for a given output, the [200, 400, 600] folders contain our validation results according to our ``sample_interval`` configuration in the global config. We can examine how our model did with respect to accuracy (let's assume our targets are fairly balanced in this case) by checking the `training_curve_ACC.png` file: .. image:: ../tutorial_files/a_using_eir/01_basic_tutorial/figures/tutorial_01_training_curve_ACC_gln_1.png Examining the actual predictions and how they matched the target labels, we can look at the confusion matrix in one of the evaluation folders of ``results/Origin/samples``. When I ran this, I got the following at iteration 600: .. image:: ../tutorial_files/a_using_eir/01_basic_tutorial/figures/tutorial_01_confusion_matrix_gln_1.png In the training curve above, we can see that our model barely got going before the run finished! Let's try another experiment. We can change the ``output_folder`` value in ``01_basic_tutorial/tutorial_01_globals.yaml``, but the framework also supports rudimentary injection of values from the command line. Let's try that, setting a new run name, increasing the number of epochs and changing the learning rate: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_2.txt :language: console .. note:: The injected values are according to the configuration filenames. Looking at the training curve from that run, we can see we did a bit better: .. image:: ../tutorial_files/a_using_eir/01_basic_tutorial/figures/tutorial_01_training_curve_ACC_gln_2.png We also notice that there is a gap between the training and evaluation performances, indicating that the model is starting to overfit on the training data. There are a bunch of regularisation settings we could try, such as increasing dropout in the input, fusion and output modules. Check the :ref:`api-reference` reference for a full overview. C - Predicting on external samples ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Predicting on samples with known labels """"""""""""""""""""""""""""""""""""""" To predict on external samples, we run ``eirpredict``. As we can see when running ``eirpredict --help``, it looks quite similar to ``eirtrain``: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/eirpredict_help.txt :language: console :lines: 2- Generally we do not change much of the configs when predicting, with the exception of the input configs (and then mainly setting the ``input_source``, i.e. where to load our samples to predict/test on from) and perhaps the global config (e.g. we might not compute attributions during training, but compute them on our test set by activating ``compute_attributions`` in the global config when predicting). Specific to ``eirpredict``, we have to choose a saved model (``--model_path``), whether we want to evaluate the performance on the test set (``--evaluate`` this means that the respective labels must be present in the ``--output_configs``) and where to save the prediction results (``--output_folder``). For the sake of this tutorial, we use one of the saved models from our previous training run and use it for inference using ``eirpredict`` module. Here, we will simply use it to predict on the same data as before. .. warning:: We are only predicting on the same data we trained on in this tutorial to show how to use the ``eirpredict`` module. Always take care in separating what data you use for training and to evaluate generalization performance of your models! Run the commands below, making sure you add the correct path of a saved model to the ``--model_path`` argument. To test, we can run the following command (note that you will have to add the path to your saved model for the ``--model_path`` parameter below). .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_1_PREDICT.txt :language: console This will generate a file called ``calculated_metrics.json`` in the supplied ``output_folder`` as well as a folder for each output (in this case called ``ancestry_output`` containing the actual predictions and plots. Of course the metrics are quite nonsensical here, as we are predicting on the same data we trained on. One of the files generated are the actual predictions, found in the ``predictions.csv`` file: .. raw:: html :file: ../tutorial_files/a_using_eir/01_basic_tutorial/csv_preview.html The ``True Label Untransformed`` column contains the actual labels, as they were in the raw data. The ``True Label`` column contains the labels after they have been numerically encoded / normalized in ``EIR``. The other columns represent the raw network outputs for each of the classes. Predicting on samples with unknown labels """"""""""""""""""""""""""""""""""""""""" Notice that when running the command above, we knew the labels of the samples we were predicting on. In practice, we are often predicting on samples we have no clue about the labels of. In this case, we can again use the ``eirpredict`` with slightly modified arguments: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_1_PREDICT_UNKNOWN.txt :language: console :emphasize-lines: 4,6 We can notice a couple of changes here compared to the previous command: 1. We have removed the ``--evaluate`` flag, as we do not have the labels for the samples we are predicting on. 2. We have a different output configuation file, ``tutorial_01_outputs_unknown.yaml``. 3. We have a different output folder, ``tutorial_01_unknown``. If we take a look at the ``tutorial_01_outputs_unknown.yaml`` file, we can see that it contains the following: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/tutorial_01_outputs_unknown.yaml :language: yaml :caption: tutorial_01_outputs_unknown.yaml :emphasize-lines: 3 Notice that everything is the same as before, but for ``output_source`` we have ``null`` instead of the `.csv` label file we had before. Taking a look at the produced ``predictions.csv`` file, we can see that we only have the actual predictions, and no true labels: .. raw:: html :file: ../tutorial_files/a_using_eir/01_basic_tutorial/csv_preview_unknown.html D - Serving ^^^^^^^^^^^ In this final section, we demonstrate serving our trained model as a web service and interacting with it using HTTP requests. Starting the Web Service """"""""""""""""""""""""" To serve the model, use the following command: .. code-block:: shell eirserve --model-path [MODEL_PATH] Replace `[MODEL_PATH]` with the actual path to your trained model. This command initiates a web service that listens for incoming requests. Here is an example of the command: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_1_DEPLOY.txt :language: console .. note:: After serving the model, you can access the automatically generated OpenAPI documentation at ``http://localhost:8000/docs`` to explore and interact with the API endpoints. Additionally ``http://localhost:8000/info`` provides some basic information about intput/output data specifications. There is also a UI available at ``http://localhost:8000/redoc`` for an alternative view of the API documentation. Sending Requests """""""""""""""" With the server running, we can now send requests. The requests are prepared by loading numpy array data, converting it to base64 encoded strings, and then constructing a JSON payload. Here's an example Python function demonstrating this process: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/request_example/request_example_module.py :language: python :caption: request_example_module.py When running this, we get the following output: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/request_example/request_example.json :language: json :caption: request_example.json Analyzing Responses """"""""""""""""""" Here are some examples of responses from the server for a set of inputs: .. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/serve_results/predictions.json :language: json :caption: predictions.json E - Applying to your own data (e.g. UK Biobank) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Thank you for reading this far! Hopefully this tutorial introduced you well enough to the framework so you can apply it to your own genetic data. For that, you will have to process it first (see: `plink pipelines`_). Then you will have to set the relevant paths for the inputs (e.g. ``input_source``, ``snp_file``) and outputs (e.g. ``output_source``, ``target_cat_columns`` or ``target_con_columns`` if you have continuous targets). However, when moving to large scale data such as the UK Biobank, the configurations we used on the ancestry toy data in this tutorial will likely not be sufficient. For example, the learning rate is likely too high. For this, we specifically designed the `EIR-auto-GP`_ project, which focuses on allow you to quickly train deep learning models for genomic prediction. Additionally, you can have a look at the :ref:`genomics-guide` for some more information on the parameters that are relevant for genomic data and how do adapt the configurations to your own data. If you made it this far, thank you for reading!