02 - Creating and Using a Mini Foundation Model
In this tutorial, we will explore how to create custom foundation models using EIR
.
Here we use the term “foundation model” as a fancy way of saying we pretrain a model
for one task, and then use it or parts of it as a building block for other tasks.
We’ll be working with three different datasets —IMDB reviews, COCO 2017 images, and CIFAR-10 images.
The overall goal is as follows:
Train a mini-foundation model for image captioning, which includes an image and text encoder (feature extractors), and a text decoder (output module).
Use the text encoder part from the mini-foundation model to train a sentiment analysis model on IMDB reviews.
Use the image encoder part from the mini-foundation model to train an image classification model on CIFAR-10.
A - Data
For this tutorial, we will use datasets from three different domains:
Text Data: IMDB Reviews - More information can be found here.
Image Data: COCO 2017 - Used mainly for image-to-text tasks like image captioning. More details can be found at the COCO 2017 dataset.
Image Data: CIFAR-10 - A dataset of 60,000 32x32 color images in 10 different classes. Useful for object recognition tasks. Learn more here.
You can download all datasets for this tutorial from the following link.
After downloading the data, your folder structure should be organized similarly to the following (the config files we will create as we go along the tutorial):
eir_tutorials/e_pretraining/02_mini_foundation
├── conf
│ ├── cifar
│ │ ├── cifar_fusion.yaml
│ │ ├── cifar_globals.yaml
│ │ ├── cifar_input.yaml
│ │ └── cifar_output.yaml
│ ├── fusion.yaml
│ ├── globals.yaml
│ ├── imdb
│ │ ├── imdb_fusion.yaml
│ │ ├── imdb_globals.yaml
│ │ ├── imdb_input.yaml
│ │ └── imdb_output.yaml
│ ├── inputs_image_array_cnn.yaml
│ ├── inputs_sequence.yaml
│ └── output_sequence.yaml
└── data
├── 02_mini_foundation
│ ├── configs
│ ├── logging_history.log
│ ├── meta
│ ├── model_info.txt
│ ├── results
│ ├── saved_models
│ ├── serializations
│ ├── tensorboard_logs
│ ├── train_average_history.log
│ ├── training_curve_LOSS-AVERAGE.pdf
│ ├── training_curve_PERF-AVERAGE.pdf
│ └── validation_average_history.log
├── CIFAR10
│ ├── images
│ └── images_classes.csv
├── IMDB
│ ├── imdb_labels.csv
│ └── imdb_reviews.csv
├── image_captioning
│ ├── captions.csv
│ └── images
└── vocab.txt
Notice how in the downloaded data, we actually include a 02_mini_foundation
experiment. This is so that you do not have to train the entire model from scratch,
and also shows how one can share pre-trained models with others.
B - Training a Mini Foundation Model
Important
As mentioned above, you can download the pre-trained model for this tutorial and skip this section. However, if you want to train the model yourself, you can follow the steps below.
Here, we will show the training of a model for image captioning, similar to what we did in 03 - Image to Sequence: Image Captioning, where the model uses both an image and text input to generate a caption for the image.
The global configuration establishes the foundational settings for training:
output_folder: eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation
valid_size: 1024
n_saved_models: 1
checkpoint_interval: 500
plot_skip_steps: 200
sample_interval: 500
memory_dataset: true
dataloader_workers: 0
n_epochs: 20
batch_size: 256
lr: 0.0005
optimizer: "adabelief"
device: "mps"
input_info:
input_source: eir_tutorials/e_pretraining/02_mini_foundation/data/image_captioning/captions.csv
input_name: text
input_type: sequence
input_type_info:
max_length: 128
split_on: ""
sampling_strategy_if_longer: "uniform"
vocab_file: eir_tutorials/e_pretraining/02_mini_foundation/data/vocab.txt
modality_dropout_rate: 0.1
model_config:
embedding_dim: 64
input_info:
input_source: eir_tutorials/e_pretraining/02_mini_foundation/data/image_captioning/images
input_name: image_input
input_type: image
model_config:
model_type: cnn
model_init_config:
channel_exp_base: 5
kernel_width: 2
down_stride_width: 2
kernel_height: 2
down_stride_height: 2
model_type: "pass-through"
output_info:
output_source: eir_tutorials/e_pretraining/02_mini_foundation/data/image_captioning/captions.csv
output_name: text
output_type: sequence
output_type_info:
max_length: 128
split_on: ""
sampling_strategy_if_longer: "uniform"
vocab_file: eir_tutorials/e_pretraining/02_mini_foundation/data/vocab.txt
sampling_config:
generated_sequence_length: 64
n_eval_inputs: 10
To train, we use the following command:
eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/inputs_image_array_cnn.yaml eir_tutorials/e_pretraining/02_mini_foundation/conf/inputs_sequence.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/output_sequence.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation
Here we can see the training curve for the mini foundation model:
Now, given that we have either downloaded or trained the mini foundation model, we can use it to train other models.
C - Establishing an IMDB Baseline
Before using the mini foundation model, let’s first establish a baseline by training a model from scratch to perform sentiment analysis on IMDB reviews.
Here are the configurations:
output_folder: eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation
valid_size: 1024
n_saved_models: 1
checkpoint_interval: 100
plot_skip_steps: 0
sample_interval: 100
memory_dataset: true
dataloader_workers: 0
n_epochs: 20
batch_size: 64
lr: 0.0005
optimizer: "adabelief"
device: "cpu"
input_info:
input_source: eir_tutorials/e_pretraining/02_mini_foundation/data/IMDB/imdb_reviews.csv
input_name: text
input_type: sequence
input_type_info:
max_length: 128
split_on: ""
sampling_strategy_if_longer: "uniform"
vocab_file: eir_tutorials/e_pretraining/02_mini_foundation/data/vocab.txt
model_config:
embedding_dim: 64
output_info:
output_source: eir_tutorials/e_pretraining/02_mini_foundation/data/IMDB/imdb_labels.csv
output_name: imdb_output
output_type: tabular
output_type_info:
target_cat_columns:
- Sentiment
To kick off the training for IMDB from scratch, run the following command:
eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_input.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_output.yaml \
--imdb_globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation_imdb_from_scratch
The performance can be evaluated through these generated plots:
This serves as our baseline, which we’ll aim to improve in the next section by using the mini foundation model.
D - Using the Mini Foundation Model for IMDB
In this section, we’ll use the pre-trained mini foundation model as a starting point for training our IMDB sentiment analysis model. Specifically, we will only load the text encoder part of the mini foundation model while other parts of the IMDB model will be trained from scratch.
While the configuration files remain the same, there is a slight change in the training command:
eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_input.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_output.yaml \
--imdb_globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation_imdb_from_pretrained \
--imdb_input.pretrained_config.model_path=eir_tutorials/e_pretraining/02_mini_foundation/data/02_mini_foundation/saved_models/02_mini_foundation_model_18000_perf-average=0.0809.pt \
--imdb_input.pretrained_config.load_module_name=text
Let’s examine the performance improvements, if any:
In this specific case, the training and validation losses are very marginally lower compared to the baseline. This indicates that the mini foundation model didn’t contribute significantly to enhancing the model’s performance for IMDB sentiment analysis. One reason could be that the text data each model is trained on is very different, with the mini foundation model being trained on somewhat robotic image captions, while the IMDB model is trained on various movie reviews.
Note
You might notice that the the pre-trained model was trained for more iterations, this was due to early stopping being activated earlier in the model trained from scratch, which might simply be due to randomness. Hence, the fact that the pre-trained model performs slightly better might be due to the fact that it was trained for more iterations, not necessarily because of the pre-training.
While the performance improvements are not significant in the text case, we will not give up on our mini foundation model just yet. Let’s see how well the image encoder part of the mini foundation model performs when used for image classification.
E - Establishing a CIFAR10 Baseline
Just like for the IMDB case, we will first establish a baseline.
Here are the configurations for the CIFAR10 baseline:
output_folder: eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation
valid_size: 1024
n_saved_models: 1
checkpoint_interval: 100
plot_skip_steps: 0
sample_interval: 100
memory_dataset: true
dataloader_workers: 0
n_epochs: 20
batch_size: 64
lr: 0.0005
optimizer: "adabelief"
device: "mps"
input_info:
input_source: eir_tutorials/e_pretraining/02_mini_foundation/data/CIFAR10/images
input_name: image_input
input_type: image
model_config:
model_type: cnn
model_init_config:
channel_exp_base: 5
kernel_width: 2
down_stride_width: 2
kernel_height: 2
down_stride_height: 2
output_info:
output_source: eir_tutorials/e_pretraining/02_mini_foundation/data/CIFAR10/images_classes.csv
output_name: cifar_output
output_type: tabular
output_type_info:
target_cat_columns:
- Class
To initiate the training for CIFAR10 from scratch, execute the following command:
eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_input.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_output.yaml \
--cifar_globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation_cifar_from_scratch
Training curve:
This will serve as our baseline for CIFAR10, which we will compare against the model that uses the image encoder from the mini foundation model in the next section.
F - Using the Mini Foundation Model for CIFAR10
In this section, we’ll use the pre-trained mini foundation model for CIFAR10 image classification. Specifically, we’ll load only the image encoder from the mini foundation model, while the rest of the CIFAR10 model will be trained from scratch.
Again, the configuration files for this step are the same as in the baseline, with one change in the training command:
eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_input.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_output.yaml \
--cifar_globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation_cifar_from_pretrained \
--cifar_input.pretrained_config.model_path=eir_tutorials/e_pretraining/02_mini_foundation/data/02_mini_foundation/saved_models/02_mini_foundation_model_18000_perf-average=0.0809.pt \
--cifar_input.pretrained_config.load_module_name=image_input
Now, let’s review the impact on performance:
In contrast to the text-based IMDB model, the CIFAR10 model shows improvements in both the speed of convergence (e.g., the loss at iteration 1500 is lower for the pre-trained model than the model trained from scratch) and the final performance when initialized with the image encoder from the mini foundation model.
These results suggest that the image encoder from the mini foundation model can be transferred to image classification, indicating that one can successfully train and, in a modular fashion, transfer parts of a model to other tasks.
Thank you very much for reading this tutorial!