03 – Sequence Tutorial: Movie Reviews and Peptides
In this tutorial, we will be training models using discrete sequences as inputs. Here, we will be doing two tasks. Firstly, we train a model to classify positive vs. negative sentiment in the IMDB reviews dataset. Secondly, we will train another model to detect anticancer properties in peptides using the anticancer peptides dataset.
Note that this tutorial assumes that you are already familiar with the basic functionality of the framework (see 01 – Genotype Tutorial: Ancestry Prediction).
A - IMDB Reviews
A1 - IMDB Setup
For this first task, we will do a relatively classic NLP task, where we train a model to predict sentiment from IMDB reviews, see here for more information about the data. To download the data and configurations for this part of the tutorial, use this link.
Here we can see an example of one review from the dataset.
$ cat IMDB/IMDB_Reviews/3314_1.txt
Reading through all these positive reviews I find myself baffled.
How is it that so many enjoyed what I consider to be a woefully bad adaptation
of my second favourite Jane Austen novel? There are many problems with the film,
already mentioned in a few reviews; simply put it is a hammed-up, over-acted,
chintzy mess from opening credits to butchered ending.<br /><br />While many
characters are mis-cast and neither Ewan McGregor nor Toni Collette puts in a
performance that is worthy of them, the worst by far is Paltrow. \
I have very much enjoyed her performance in some roles, but here she is
abominable - she is self-conscious, nasal, slouching and entirely disconnected
from her characters and those around her. An extremely disappointing effort -
though even a perfect Emma could not have saved this film.
Whatever movie this review is from, it seems that the person certainly did not enjoy it! This is fairly obvious for us to see, now the question is if we train a model to do the same.
As in previous tutorials, we will start by defining our configurations.
output_folder: eir_tutorials/tutorial_runs/a_using_eir/tutorial_03_imdb_run
valid_size: 0.10
n_saved_models: 1
checkpoint_interval: 500
sample_interval: 500
memory_dataset: true
n_epochs: 25
compute_attributions: true
max_attributions_per_class: 512
attributions_every_sample_factor: 4
Note
You might notice that in the global configuration in this tutorial, we have a couple
of new parameters going on. Namely the compute_attributions
, max_attributions_per_class
and
attributions_every_sample_factor
. These are settings related to computing attributions
so we can interpret/explain how our inputs influence the model outputs. For more
information, check out the Configuration API reference.
input_info:
input_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/IMDB/IMDB_Reviews
input_name: imdb_reviews
input_type: sequence
input_type_info:
sampling_strategy_if_longer: "uniform"
max_length: 64
split_on: " "
min_freq: 10
tokenizer: "basic_english"
tokenizer_language: "en"
model_config:
model_type: sequence-default
embedding_dim: 32
position: embed
pool: avg
model_init_config:
num_heads: 2
dropout: 0.2
output_info:
output_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/IMDB/imdb_labels.csv
output_name: imdb_output
output_type: tabular
output_type_info:
target_cat_columns:
- Sentiment
Tip
There are a lot of new configuration options going on here, head over to the Configuration API reference for more details.
Now with the configurations set up, our folder structure should look like this:
eir_tutorials/a_using_eir/03_sequence_tutorial/
├── a_IMDB
│ └── conf
│ ├── 03a_imdb_globals.yaml
│ ├── 03a_imdb_input.yaml
│ └── 03a_imdb_output.yaml
└── data
└── IMDB
├── IMDB_Reviews
├── conf
├── imdb.vocab
└── imdb_labels.csv
A2 - IMDB Training
As before, we can train a model using eirtrain
:
eirtrain \
--global_configs eir_tutorials/a_using_eir/03_sequence_tutorial/a_IMDB/conf/03a_imdb_globals.yaml \
--input_configs eir_tutorials/a_using_eir/03_sequence_tutorial/a_IMDB/conf/03a_imdb_input.yaml \
--output_configs eir_tutorials/a_using_eir/03_sequence_tutorial/a_IMDB/conf/03a_imdb_output.yaml
This took around 20 minutes to run on my laptop, so this is a good chance to take a nap or do something else for a while!
Looking at the accuracy, I got the following training/validation results:
Perhaps not great, but not too bad either! Especially since we are using a relatively short sequence length.
Note
Here we are using a transformer based neural network for the training, however do not underestimate the power of classical, more established methods. In fact, simpler, non neural-network based methods have attained better accuracy that what we see above! If you have some time to kill, try playing with the hyper parameters a bit to see how they affect the performance.
A3 - IMDB Interpretation
Now remember those new flags we used in the global configuration,
compute_attributions
and friends? Setting those will instruct the
framework to compute and analyze
how the inputs influence the model
towards a certain output. In this case,
the attributions can be found in the
imdb_sentiment/results/Sentiment/samples/<every_2000_iterations>/attributions
folders. Behind the scenes,
the framework uses integrated gradients,
implemented in the fantastic the Captum library,
to compute the attributions.
Firstly, let’s have a look at the words that had the biggest influence towards a Positive and Negative sentiment.
Note
Which tokens are included in this plot and how they are sorted is based both on the average and 95% confidence interval of the attribution. The raw values are also stored, in case you want to do your own analysis. The CIs represent the 95% confidence interval after 1,000 bootstrap samples.
So fortunately, it seems indeed that our model learned some relevant things! When training on sequences, the framework will also by default save attributions towards the relevant label for 10 single samples, here is one such example, where we look at the attributions towards a positive sentiment.
ID | True Label | Attribution Score | Token Importance |
---|---|---|---|
this movie didn ' t really surprise me , as such , it just got better and better . i thought paul | |||
i have to say that this miniseries was the best interpretation of the beloved novel jane eyre . both dalton and clarke are very believable as rochester and jane . i ' ve seen other versions , but none compare to this one . the best one for me . i could never imagine anyone else playing these characters ever again . the last | |||
at first glance , it would seem natural to compare where the sidewalk ends with laura . both have noirish qualities , both were directed by otto preminger , and both star dana andrews and gene tierney . but that ' s where most of the comparisons end . laura dealt with posh , sophisticated people with means who just happen to find themselves | |||
north africa in the 1930 ' s . to a small arab town on the edge of the | |||
this short is a | |||
house of games has a strong story where obsession and illusion play a big part . a psychologist offers to help a patient with his gambling debts and gets caught at the game . have you ever felt fascination for something that was both dangerous and wrong ? watch what happens if you pursue this urge and go all the way . sit on | |||
this movie seems on the surface to be a run of mill kids movie that parents can | |||
if , like me , you like your films to be unique , and unlike the majority of other movies , then i wholly recommend that you check out the beast . the film is a grotesque , erotic , fantasy fairytale that centres around a mythological ' beast ' that is | |||
i would like to comment the series as a great effort . the story line although requiring a few improvements was pretty well , especially in season 1 . season 2 however became more of a freak show , and lost da ' s original charm . season one story line was more interesting , a light side to the life at jam pony | |||
one night i stumbled upon this on the satellite station bravo . initially out of curiosity i decided to watch it . to be perfectly honest i wasn ' t disappointed . the main character is beautiful and her body is shown off well . you would think her talents would be wasted as a | |||
That concludes the NLP specific part of this tutorial, next we will apply the same approach but for biological data!
B - Anticancer Peptides
B1 - Anticancer Peptides Setup
Modelling on language like we did above is both fun and relatable, but now we try something a bit more niche. For this second part of the tutorial, we will use the framework to predict anti breast cancer properties of peptides (a peptide is basically a short protein sequence). See here for more information about the dataset. To download the data and configurations for this part of the tutorial, use this link.
Again, let’s take a quick look at one sample we are going to be modelling on:
Here we can see an example of one review from the dataset.
$ cat Anticancer_Peptides/breast_cancer_train/1.txt
AAWKWAWAKKWAKAKKWAKAA
So immediately we can see that this is fairly different from our movie reviews,
let’s see how it goes with the modelling part.
As always,
we start with the configurations.
You might notice a new option in the global configuration,
weighted_sampling_columns
.
This setting controls
which target column to use for weighted sampling,
and the special keyword all
will take an average across
all target columns.
In this case we have only one (“class”),
so it just accounts for that one.
This can be useful for this dataset
as it is quite imbalanced w.r.t. target labels,
as you will see momentarily.
output_folder: eir_tutorials/tutorial_runs/a_using_eir/tutorial_03_anti_breast_cancer_peptides_run
valid_size: 0.25
n_saved_models: 1
checkpoint_interval: 200
sample_interval: 200
n_epochs: 500
memory_dataset: True
batch_size: 32
early_stopping_buffer: 2000
compute_attributions: True
attributions_every_sample_factor: 3
max_attributions_per_class: 512
weighted_sampling_columns:
- all
Note
You might notice that we use a large validation set here. This a similar situation as in 02 – Tabular Tutorial: Nonlinear Poker Hands, where we used a manual validation set to ensure that we have all classes present in the validation set. Here, we take the lazier approach and just make the validation set larger. Currently the framework does not handle having a mismatch in which classes are present in the training and validation sets.
Notice that the input configuration is slightly different. For example, as we are not dealing with natural language, we do not split on whitespace anymore, but rather on “”.
input_info:
input_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/Anticancer_Peptides/breast_cancer_train
input_name: peptide_sequences
input_type: sequence
input_type_info:
max_length: "max"
split_on: ""
min_freq: 1
model_config:
model_type: sequence-default
position: embed
embedding_dim: 32
pool: avg
model_init_config:
num_heads: 8
dropout: 0.2
interpretation_config:
num_samples_to_interpret: 30
interpretation_sampling_strategy: random_sample
output_info:
output_source: eir_tutorials/a_using_eir/03_sequence_tutorial/data/Anticancer_Peptides/breast_cancer_labels.csv
output_name: peptides_output
output_type: tabular
output_type_info:
target_cat_columns:
- class
B1 - Anticancer Peptides Training
For the peptide data, the folder structure should look something like this:
eir_tutorials/a_using_eir/03_sequence_tutorial/
├── b_Anticancer_peptides
│ └── conf
│ ├── 03b_peptides_globals.yaml
│ ├── 03b_peptides_input.yaml
│ └── 03b_peptides_output.yaml
└── data
└── Anticancer_Peptides
├── breast_cancer_labels.csv
└── breast_cancer_train
As before, we run:
eirtrain \
--global_configs eir_tutorials/a_using_eir/03_sequence_tutorial/b_Anticancer_peptides/conf/03b_peptides_globals.yaml \
--input_configs eir_tutorials/a_using_eir/03_sequence_tutorial/b_Anticancer_peptides/conf/03b_peptides_input.yaml \
--output_configs eir_tutorials/a_using_eir/03_sequence_tutorial/b_Anticancer_peptides/conf/03b_peptides_output.yaml
As the data is imbalanced, we will look at the MCC training curve:
Checking the confusion matrix at iteration 2000, we see:
Looking at the training curve, we see that we are definitely overfitting quite a bit! We could probably squeeze out a better performance by playing with the hyperparameters a bit, but for now we will keep going!
As before, let’s have a look at the attributions. In this case we will check attributions towards the moderately active class:
In this case, it seems that there is a high degree of uncertainty in the attributions, as the confidence intervals are quite large. This is likely due to the fact that the dataset is quite imbalanced, and there are few samples of moderately active peptides in the validation set.
Looking at an example of single moderately active sample and how its inputs influence the model towards a prediction of the moderately active class, we see:
ID | True Label | Attribution Score | Token Importance |
---|---|---|---|
T S A Q T K V V V D A | |||
L D A Y I N L G N V L K E A | |||
K K A E A V A T V V A A V D Q A R V R | |||
K K K F P W W W P F K K K | |||
E E V K K H G T T V L T A L G R I L K Q | |||
N A A G W D L L L T L Y R S A | |||
E D N L L R Q L A Q K V | |||
S E E R I R S G V K R L S K S R Q | |||
K W K L F K K I L K F L H L A K K F | |||
M Y S N R M R S Y K Q E M G K L E T D F K R S R I | |||
E V K R S V N R D F A K W F L I V F I | |||
T D Q Q K V S E I F Q S S K E K L Q G D A K V V S D A F K | |||
L P H F Y E L F S L W A | |||
D D Y L K E Q V L H M K Q Y V S D N | |||
S Y K D L F L E L Y G K I K D | |||
P D I K A Q Y Q Q R W L | |||
V G V L L Q L L V Q A | |||
N S N H Q M L L V Q Q A E D K I K E L L N T | |||
E D L P K W S G Y F E K L L K K N | |||
H G L V K A G H P L K R K L G H | |||
G L F D I A K K V I G V I G S L | |||
N Y E E I Y I L N H I L R | |||
T Q S D V Y A M V G Y I H E L W | |||
G A Q Y I Q A A G V A L G L K M R | |||
E P G Q R K I V M H K | |||
N P A R A L Y Q T V R E L I E N S L D A | |||
D Q E D V A Q T I R D Y D | |||
V N F L V A D A L K Q H R H R R D D V I V M L S A R | |||
A Q T S R W A A M Q I G M S F I S A Y | |||
A S L P E A I E A L T K G | |||
Warning
Remember that this does not necessarily tell us anything about actual biological causality!
E - Serving
In this final section, we demonstrate serving our trained model as a web service and interacting with it using HTTP requests.
Starting the Web Service
To serve the model, use the following command:
eirserve --model-path [MODEL_PATH]
Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.
Here is an example of the command:
eirserve \
--model-path eir_tutorials/tutorial_runs/a_using_eir/tutorial_03_imdb_run/saved_models/tutorial_03_imdb_run_model_3500_perf-average=0.7969.pt
Sending Requests
With the server running, we can now send requests. For sequence data like IMDb reviews, we send the payload as a simple JSON object.
Here’s an example Python function demonstrating this process:
import requests
def send_request(url: str, payload: dict):
response = requests.post(url, json=payload)
return response.json()
payload = {
"imdb_reviews": "This movie was great! I loved it!"
}
response = send_request('http://localhost:8000/predict', payload)
print(response)
Additionally, you can send requests using bash:
curl -X 'POST' \\
'http://localhost:8000/predict' \\
-H 'accept: application/json' \\
-H 'Content-Type: application/json' \\
-d '{
"imdb_reviews": "This movie was great! I loved it!"
}'
Analyzing Responses
After sending requests to the served model, the responses can be analyzed. These responses provide insights into the model’s predictions based on the input data.
[
{
"request": {
"imdb_reviews": "This move was great! I loved it!"
},
"response": {
"result": {
"imdb_output": {
"Sentiment": {
"Negative": 0.10254316031932831,
"Positive": 0.8974568247795105
}
}
}
}
},
{
"request": {
"imdb_reviews": "This move was terrible! I hated it!"
},
"response": {
"result": {
"imdb_output": {
"Sentiment": {
"Negative": 0.862532913684845,
"Positive": 0.13746710121631622
}
}
}
}
},
{
"request": {
"imdb_reviews": "You'll have to have your wits about you and your brain fully switched on watching Oppenheimer as it could easily get away from a nonattentive viewer. This is intelligent filmmaking which shows it's audience great respect. It fires dialogue packed with information at a relentless pace and jumps to very different times in Oppenheimer's life continuously through it's 3 hour runtime. There are visual clues to guide the viewer through these times but again you'll have to get to grips with these quite quickly. This relentlessness helps to express the urgency with which the US attacked it's chase for the atomic bomb before Germany could do the same. An absolute career best performance from (the consistenly brilliant) Cillian Murphy anchors the film. "
},
"response": {
"result": {
"imdb_output": {
"Sentiment": {
"Negative": 0.017507053911685944,
"Positive": 0.9824929237365723
}
}
}
}
}
]
This concludes the sequence tutorial, thank you for making it this far. I hope you enjoyed it and it was useful to you. Feel free to try this out on your own data, I would love to hear about it!
- Versions
- latest
- stable
- 0.1.30-alpha
- 0.1.29-alpha
- 0.1.28-alpha
- 0.1.27-alpha
- 0.1.26-alpha
- 0.1.25-alpha
- 0.1.24-alpha
- On Read the Docs
- Project Home
- Builds