04 - Tabular to Sequence: Protein Sequence Generation
In this tutorial, we’ll employ EIR
for sequence generation conditioned
on tabular data. Specifically, we will be generating protein sequences conditioned
on their classification.
A - Data
The dataset for this tutorial can be downloaded from here.
This dataset is processed from a Kaggle dataset available here. The original data, in turn, originates from the RCSB Protein Data Bank.
After downloading the data, your folder structure should look something like this (we will add the configuration files as we progress):
eir_tutorials/c_sequence_output/04_protein_sequence_generation
├── conf
│ ├── fusion.yaml
│ ├── globals.yaml
│ ├── inputs_tabular.yaml
│ ├── inputs_tabular_test.yaml
│ ├── output.yaml
│ ├── output_conditioned.yaml
│ └── output_conditioned_test.yaml
└── data
├── test_protein_sequences.csv
├── test_tabular_info.csv
├── train_protein_sequences.csv
└── train_tabular_info.csv
B - Unconditional Protein Sequence Generation
Training will be similar to what we did in a previous tutorial, 01 – Sequence Generation: Generating Movie Reviews. First, we will start by establishing a baseline by training a model on the protein sequences only:
Below are the relevant configurations:
output_folder: eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequences
valid_size: 512
n_saved_models: 1
checkpoint_interval: 500
sample_interval: 500
memory_dataset: false
n_epochs: 20
batch_size: 256
lr: 0.0005
optimizer: "adabelief"
device: "mps"
latent_sampling:
layers_to_sample:
- "output_modules.protein_sequence.output_transformer.layers.1"
model_type: "pass-through"
output_info:
output_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/train_protein_sequences.csv
output_name: protein_sequence
output_type: sequence
output_type_info:
max_length: 128
split_on: ""
sampling_strategy_if_longer: "uniform"
min_freq: 1
model_config:
embedding_dim: 64
sampling_config:
generated_sequence_length: 128
n_eval_inputs: 10
Training the model:
eirtrain \
--global_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/globals.yaml \
--fusion_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/output.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_sequence_only
Executing the command above resulted in the following training curve:
You might have noticed the latent_sampling
parameter in the global configuration,
which allows us to extract a representation from a specified the model.
In a addition to saving the validation set representations,
we also get a couple of visualizations. For example, here is a t-SNE plot
of the validation set representations at iteration 5000:
C - Conditional Protein Sequence Generation
Next, we’ll train a model incorporating both tabular data, which contains the protein type classification and the protein sequences.
For this, we add the input configuration containing the tabular data:
input_info:
input_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/train_tabular_info.csv
input_name: proteins_tabular
input_type: tabular
input_type_info:
input_cat_columns:
- classification
Additionally, we can update our output configuration to generate sequences based on manually specified tabular input values:
output_info:
output_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/train_protein_sequences.csv
output_name: protein_sequence
output_type: sequence
output_type_info:
max_length: 128
split_on: ""
sampling_strategy_if_longer: "uniform"
min_freq: 1
model_config:
embedding_dim: 64
sampling_config:
generated_sequence_length: 128
n_eval_inputs: 0
manual_inputs:
- proteins_tabular:
classification: "HYDROLASE"
protein_sequence: ""
- proteins_tabular:
classification: "TRANSFERASE"
protein_sequence: ""
- proteins_tabular:
classification: "OXIDOREDUCTASE"
protein_sequence: ""
Note
While not shown here, you can view the generated sequences in the
samples/<iteration>/manual
folder during/after training.
Training the conditional model:
eirtrain \
--global_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/globals.yaml \
--input_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/inputs_tabular.yaml \
--fusion_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/output_conditioned.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_tabular
When executing the above command, the following training curve was obtained:
The (albeit slightly) lowered validation loss suggests the model effectively uses tabular data to improve sequence quality.
Similarly to before, we can visualize the validation set representations at iteration 5000, now for the conditional model:
The separation does seem to be slightly better than before, which could be due to the model given the additional information from the tabular data.
D - Generating New Sequences of a Specific Protein Type
Finally, we will take a quick look at how we can use a trained model to generate new sequences of a specific protein type. For this, we will use configuration files similar to the ones used for training, but now pointing to the test set data:
input_info:
input_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/test_tabular_info.csv
input_name: proteins_tabular
input_type: tabular
input_type_info:
input_cat_columns:
- classification
output_info:
output_source: eir_tutorials/c_sequence_output/04_protein_sequence_generation/data/test_protein_sequences.csv
output_name: protein_sequence
output_type: sequence
output_type_info:
max_length: 128
split_on: ""
sampling_strategy_if_longer: "uniform"
min_freq: 1
model_config:
embedding_dim: 64
sampling_config:
generated_sequence_length: 512
n_eval_inputs: 0
manual_inputs:
- proteins_tabular:
classification: "HYDROLASE"
protein_sequence: ""
- proteins_tabular:
classification: "TRANSFERASE"
protein_sequence: ""
- proteins_tabular:
classification: "OXIDOREDUCTASE"
protein_sequence: ""
Now, we can use the eirpredict
command as follows:
eirpredict \
--global_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/globals.yaml \
--input_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/inputs_tabular_test.yaml \
--fusion_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/04_protein_sequence_generation/conf/output_conditioned_test.yaml \
--model_path eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_tabular/saved_models/04_protein_sequence_generation_tabular_model_5500_perf-average=-1.7293.pt \
--output_folder eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_tabular/test_results \
--evaluate
This will save the results in the specified --output_folder
. While we do evaluate
the loss, it’s perhaps more interesting to look at the generated sequences as well as
the latent sampling, available in the results
and latents
folders, respectively.
F - Serving
In this final section, we demonstrate serving our trained model for protein sequence generation with tabular inputs as a web service and interacting with it using HTTP requests.
Starting the Web Service
To serve the model, use the following command:
eirserve --model-path [MODEL_PATH]
Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.
Here is an example of the command:
eirserve \
--model-path eir_tutorials/tutorial_runs/c_sequence_output/04_protein_sequence_generation_tabular/saved_models/04_protein_sequence_generation_tabular_model_5500_perf-average=-1.7293.pt
Sending Requests
With the server running, we can now send requests that include tabular data to generate protein sequences.
Here’s an example Python function demonstrating this process:
import requests
def send_request(url: str, payload: dict):
response = requests.post(url, json=payload)
return response.json()
example_requests = [
{"proteins_tabular": {"classification": "HYDROLASE"}, "protein_sequence": ""},
{"proteins_tabular": {"classification": "TRANSFERASE"}, "protein_sequence": ""},
]
for payload in example_requests:
response = send_request('http://localhost:8000/predict', payload)
print(f"Classification: {payload['proteins_tabular']['classification']}")
print(f"Generated protein sequence: {response['protein_sequence']}\n")
Additionally, you can send requests using bash:
curl -X 'POST' \\
'http://localhost:8000/predict' \\
-H 'accept: application/json' \\
-H 'Content-Type: application/json' \\
-d '{
"proteins_tabular": {"classification": "HYDROLASE"},
"protein_sequence": ""
}'
Analyzing Responses
After sending requests to the served model, the responses can be analyzed. These responses provide insights into the model’s ability to generate protein sequences based on the tabular input.
[
{
"request": {
"proteins_tabular": {
"classification": "HYDROLASE"
},
"protein_sequence": ""
},
"response": {
"result": {
"protein_sequence": "EILYEGKLLSGGVDAVFLPVRRDIKSVSALGYQSVDEDRILQSGDTIIVRDGPKIIGGLRAHAVHESIGLTLEGPAEFGVGSPEARFDETVRRTGVLVDHLDVAPVTARRRGVLVKGRLEFAIGLVIA"
}
}
},
{
"request": {
"proteins_tabular": {
"classification": "TRANSFERASE"
},
"protein_sequence": ""
},
"response": {
"result": {
"protein_sequence": "KEIYLNGAVNKYIYNVTNLSSGKEATKDIKKASKVTGQAAIREVKGDKIIKAYARKEDKLSKDPIIKDNLIVGIKELISFEYVTGNPDFVSLRLKGVLGGYTFEFVKPNKDEFFVAIPYFKTVEEKID"
}
}
},
{
"request": {
"proteins_tabular": {
"classification": "OXIDOREDUCTASE"
},
"protein_sequence": "AAA"
},
"response": {
"result": {
"protein_sequence": "AAALLKLKKAVVLTGSQAILALGAVGAGASLRGGSADFQPVVAPGTASGIPTASVTFVKEAAQVLAENAATAVFGRDGDALRLTVTDAELDRTVETRVSPPLEKAVILALASAEDEEATRGVIVATGA"
}
}
}
]
If you made it this far, I want to thank you for reading!
Thank you for reading!