Sequence Generation: Generating Movie Reviews
In this tutorial, we will look into the built-in support in EIR for sequence generation tasks (similar to what GPT does). Sequences can represent various types of data such as time series, sentences, genetic information, and more. This technique allows us to generate new, meaningful sequences based on patterns learned from the training data.
We will be using the same dataset we used in the Sequence Tutorial: Movie Reviews and Peptides: the IMDB reviews dataset. However, instead of classifying the reviews, our goal this time will be to generate new movie reviews.
Note
This tutorial assumes you are familiar with the basics of EIR, and have gone through the Genotype Tutorial: Ancestry Prediction and the Sequence Tutorial: Movie Reviews and Peptides. Not required, but recommended.
A - Data
As in the Sequence Tutorial: Movie Reviews and Peptides, we will be using the IMDB reviews dataset. See here for more information about the data. To download the data, use this link.
After downloading the data, the folder structure should look like this (we will look at the configs in a bit):
eir_tutorials/c_sequence_output/01_sequence_generation
├── conf
│ ├── fusion.yaml
│ ├── globals.yaml
│ ├── output_bpe.yaml
│ ├── output_test.yaml
│ └── output.yaml
└── data
└── IMDB
└── IMDB_Reviews
B - Training
Training is almost the same as when doing supervised learning, with a couple of changes in our configurations. The biggest difference is perhaps that when doing pure sequence generation tasks (i.e., there are no auxiliary inputs), we do not need to specify an input configuration, we only have a global, fusion and output config:
The global config is does not introduce any new parameters:
basic_experiment:
batch_size: 256
memory_dataset: true
n_epochs: 100
output_folder: eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation
valid_size: 500
evaluation_checkpoint:
checkpoint_interval: 500
n_saved_models: 1
sample_interval: 500
optimization:
lr: 0.001
Note
Above I am using the mps device for training, which is in some
Macs. If you are using a different device, you can change it to cpu or e.g.,
cuda:0.
When we are doing only sequence generation (i.e., that is the only task), the only supported fusion module is “pass-through” currently, this is because each sequence generation head performs its own fusion. Therefore, customizing the fusion module with settings we have seen before (e.g., setting the model type to “mlp-residual”) would not have any effect. However, if you are doing sequence generation as one of multiple tasks, where at least one of the tasks is a supervised prediction, you can customize the fusion module. However, it will only be used for the supervised task, the sequence generation task will still use the “pass-through” fusion, which is automatically added.
model_type: "pass-through"
Now for the output, the structure is very similar to what we have seen before,
but with a couple of changes. The first difference is the output_type, here
instead of tabular, we set it to sequence. The other difference is that
we now have a sampling_config, specific to sequence generation. This allows
us to configure various parameters related to the sampling process during training,
where sequences are generated every sample_interval.
Another thing of note is that here we are training a character-level model,
as split_on is set to "".
output_info:
output_source: eir_tutorials/c_sequence_output/01_sequence_generation/data/IMDB/IMDB_Reviews
output_name: imdb_output
output_type: sequence
output_type_info:
max_length: 64
split_on: ""
sampling_strategy_if_longer: "uniform"
min_freq: 1
model_config:
embedding_dim: 64
model_init_config:
num_layers: 6
sampling_config:
generated_sequence_length: 128
n_eval_inputs: 1
manual_inputs:
- imdb_output: "This movie is the most"
- imdb_output: "Steven"
After setting up the configs, training is similar to what we have seen before:
eirtrain \
--global_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/globals.yaml \
--fusion_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/output_bpe.yaml \
--globals.basic_experiment.output_folder=eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation_bpe
I got the following results:
However, the most interesting part is not the training curve, but the
generated sequences. If we look in the familiar samples folder,
we can see the generated sequences. At iteration 500, they are mostly
gibberish:
r anlse me s t lls r t pnnrt te oodooacoe tr ans woaee t iten aafo nuse o nh lre wlt f n mil and na til iis threager honaan att
This movie is the mosts rapan sden l ae iliteenear as ted t wrsiriteene mns trohei con ne ae sin arilotetas olhreeg aatusuimree
However, at iteration 6000, we can see that the model is starting to generate more meaningful sequences:
s of the writing is a strike in the plays me interested in a movie was displaying for his face to stuff the climax, as made bet
This movie is the most scary, but in a depiction of it doesn't work of his own who doesn't say that she didn't be the opening a
C - Prediction: Creating new sequences with a trained model
Now that we have trained our model,
we can use it to generate new sequences.
Similarly to the process when we are doing supervised prediction,
we use the eirpredict command, with a couple of minor changes
now that we are doing sequence generation.
The first change can be seen in the output configuration. Here we have a file
called output_test.yaml, which is similar to the output.yaml we used
for training, but notice the change in output_source:
output_info:
output_source: null
output_name: imdb_output
output_type: sequence
output_type_info:
max_length: 64
split_on: ""
sampling_strategy_if_longer: "uniform"
min_freq: 1
model_config:
embedding_dim: 64
model_init_config:
num_layers: 6
sampling_config:
generated_sequence_length: 64
n_eval_inputs: 10
manual_inputs:
- imdb_output: "This movie is the most"
- imdb_output: "Steven"
Here we have null for the output_source, which is because we do not have
any concrete inputs for the sequence generation task. Now, to control the
sequence generation prediction functionality, we are using the sampling_config
in the configuration above, which allows to e.g. specify the generated sequence length,
now many sequences to generate from an empty prompt (n_eval_inputs) and
finally generate sequences from custom prompts (manual_inputs).
Now we execute our eirpredict command:
eirpredict \
--global_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/globals.yaml \
--fusion_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/01_sequence_generation/conf/output_test.yaml \
--model_path eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation/saved_models/01_sequence_generation_checkpoint_6000_perf-average=-0.3792.pt \
--output_folder eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation/test_results
This will save our results under the paths
specified in the output_folder parameter,
containing both the auto-generated and manually generated sequences.
Here is an example of an auto-generated sequence:
to see how things. I mean, but this is a clip-trands and caree
And here are the manually generated sequences with our custom prompts:
This movie is the most left and enjoyable the film are given in
Steven scenes are worst complexity.<br /><br />The only comes a
While our generated reviews are far from realistic, they do show that the model is learning to generate sequences that are somewhat meaningful.
E - Sequence Generation with BPE Tokenization
Now that we have seen how to do sequence generation with a character-level model, let’s see how we can do it with a token-level model. This time, we will use the IMDB dataset, but with an implementation of BPE (Byte Pair Encoding) tokenization.
BPE, as detailed in this paper, is a sub-word tokenization method that progressively learns the most common sequences of characters (or bytes) to form an efficient set of tokens.
As we’ll see, using BPE tokenization allows us to generate longer sequences than with the character model.
To use it, a couple of changes are needed in the output configuration:
output_info:
output_source: eir_tutorials/c_sequence_output/01_sequence_generation/data/IMDB/IMDB_Reviews
output_name: imdb_output
output_type: sequence
output_type_info:
max_length: 32
split_on: null
tokenizer: "bpe"
adaptive_tokenizer_max_vocab_size: 1024
sampling_strategy_if_longer: "uniform"
min_freq: 1
model_config:
embedding_dim: 64
model_init_config:
num_layers: 2
sampling_config:
generated_sequence_length: 64
n_eval_inputs: 1
manual_inputs:
- imdb_output: "This movie is the most"
- imdb_output: "Steven"
Since the tokenizer can operate on the raw text, we set split_on to null,
and we can also control the maximum vocabulary size with
adaptive_tokenizer_max_vocab_size parameter.
Here is the training curve I got for this model:
Here are the auto-generated and manually generated sequences at iteration 500:
tee s , and pnlecestyand is e d tto s a ed a ing and i fa nand ing bining is ttthe enctdand ing nis e , tsthe a ls ssis s
This movie is the most, ydes a pc, e sithe ito , ed n, tts eand s linerclnreing bthe cic ed a and ndntis a a is hand , is it dr
And as before, at iteration 6000, we can see that the model is starting to generate more meaningful sequences:
n the movie I don't have to say that the story is basically a half and it does not be the camera that I have ever seen. I had never been released for it for it in the movie.<br /><br />One of those two films are not so nearly the
This movie is the mostly criminal longer of "Berman" in "Pross is very funny) but that they went to read the fight of Morgan Bergan and Green (Tonna Carry
Hopefully this tutorial has given you a good overview of how to use
the sequence generation functionality in EIR. Thank you for reading!
F - Serving
In this final section, we demonstrate serving our trained model for sequence generation as a web service and interacting with it using HTTP requests.
Starting the Web Service
To serve the model, use the following command:
eirserve --model-path [MODEL_PATH]
Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.
Here is an example of the command:
eirserve \
--model-path eir_tutorials/tutorial_runs/c_sequence_output/01_sequence_generation/saved_models/01_sequence_generation_checkpoint_6000_perf-average=-0.3792.pt
Sending Requests
With the server running, we can now send requests for generating sequences based on initial text prompts.
Here’s an example Python function demonstrating this process:
import requests
def send_request(url: str, payload: list[dict]) -> list[dict]:
response = requests.post(url, json=payload)
response.raise_for_status()
return response.json()
payload = [
{"imdb_output": "This movie was silly, I have to say "},
]
response = send_request(url="http://localhost:8000/predict", payload=payload)
print(response)
When running this, we get the following output:
{
"result": [
{
"imdb_output": "This movie was silly, I have to say the grate his film. It's not even incredible, he would think this film all the end of the b"
}
]
}
Additionally, you can send requests using bash:
curl -X POST \
"http://localhost:8000/predict" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '[{"imdb_output": "This movie was great! I loved "}]
'
When running this, we get the following output:
{
"result": [
{
"imdb_output": "This movie was great! I loved the complexity should be enjoyed me. The theater that they could like to be an actually was so an"
}
]
}
Analyzing Responses
After sending requests to the served model, the responses can be analyzed. These responses demonstrate the model’s ability to generate text sequences based on the provided prompts.
[
{
"request": [
{
"imdb_output": "This movie was great, I have to say "
},
{
"imdb_output": "This movie was terrible, I "
},
{
"imdb_output": "This movie was so "
},
{
"imdb_output": "This movi"
},
{
"imdb_output": "Toda"
},
{
"imdb_output": ""
}
],
"response": {
"result": [
{
"imdb_output": "This movie was great, I have to say the movie for the scenes are place the other in my place of film for his father, and the fi"
},
{
"imdb_output": "This movie was terrible, I think the way the worst story, mostly characters are fighting as the other and back of the case, as "
},
{
"imdb_output": "This movie was so finding at the movie, with a bit the original. The movie I've ever seen that can expect as an effective the w"
},
{
"imdb_output": "This movie is bad anything player with an acting between the cast of suspense. It's also a supporting of the director and havin"
},
{
"imdb_output": "Today, sorry since the budget complete streets in a structure to see it that. I've ever seen a lot of any situation of the film"
},
{
"imdb_output": " but they would decide to be seen this film.<br /><br />A large being with the same student movies which is interesting to watc"
}
]
}
}
]
If you made it this far, I want to thank you for reading!