02 - Creating and Using a Mini Foundation Model

In this tutorial, we will explore how to create custom foundation models using EIR. Here we use the term “foundation model” as a fancy way of saying we pretrain a model for one task, and then use it or parts of it as a building block for other tasks.

We’ll be working with three different datasets —IMDB reviews, COCO 2017 images, and CIFAR-10 images.

The overall goal is as follows:

Train a mini-foundation model for image captioning, which includes an image and text encoder (feature extractors), and a text decoder (output module).
Use the text encoder part from the mini-foundation model to train a sentiment analysis model on IMDB reviews.
Use the image encoder part from the mini-foundation model to train an image classification model on CIFAR-10.

A - Data

For this tutorial, we will use datasets from three different domains:

Text Data: IMDB Reviews - More information can be found here.
Image Data: COCO 2017 - Used mainly for image-to-text tasks like image captioning. More details can be found at the COCO 2017 dataset.
Image Data: CIFAR-10 - A dataset of 60,000 32x32 color images in 10 different classes. Useful for object recognition tasks. Learn more here.

You can download all datasets for this tutorial from the following link.

After downloading the data, your folder structure should be organized similarly to the following (the config files we will create as we go along the tutorial):

eir_tutorials/e_pretraining/02_mini_foundation
├── conf
│   ├── cifar
│   │   ├── cifar_fusion.yaml
│   │   ├── cifar_globals.yaml
│   │   ├── cifar_input.yaml
│   │   └── cifar_output.yaml
│   ├── fusion.yaml
│   ├── globals.yaml
│   ├── imdb
│   │   ├── imdb_fusion.yaml
│   │   ├── imdb_globals.yaml
│   │   ├── imdb_input.yaml
│   │   └── imdb_output.yaml
│   ├── inputs_image_array_cnn.yaml
│   ├── inputs_sequence.yaml
│   └── output_sequence.yaml
└── data
    ├── 02_mini_foundation
    │   ├── configs
    │   ├── logging_history.log
    │   ├── meta
    │   ├── model_info.txt
    │   ├── results
    │   ├── saved_models
    │   ├── serializations
    │   ├── tensorboard_logs
    │   ├── train_average_history.log
    │   ├── training_curve_LOSS-AVERAGE.pdf
    │   ├── training_curve_PERF-AVERAGE.pdf
    │   └── validation_average_history.log
    ├── CIFAR10
    │   ├── images
    │   └── images_classes.csv
    ├── IMDB
    │   ├── imdb_labels.csv
    │   └── imdb_reviews.csv
    ├── image_captioning
    │   ├── captions.csv
    │   └── images
    └── vocab.txt

Notice how in the downloaded data, we actually include a 02_mini_foundation experiment. This is so that you do not have to train the entire model from scratch, and also shows how one can share pre-trained models with others.

B - Training a Mini Foundation Model

Important

As mentioned above, you can download the pre-trained model for this tutorial and skip this section. However, if you want to train the model yourself, you can follow the steps below.

Here, we will show the training of a model for image captioning, similar to what we did in 03 - Image to Sequence: Image Captioning, where the model uses both an image and text input to generate a caption for the image.

The global configuration establishes the foundational settings for training:

globals.yaml

output_folder: eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation
valid_size: 1024
n_saved_models: 1
checkpoint_interval: 500
plot_skip_steps: 200
sample_interval: 500
memory_dataset: true
dataloader_workers: 0
n_epochs: 20
batch_size: 256
lr: 0.0005
optimizer: "adabelief"
device: "mps"

inputs_sequence.yaml

input_info:
  input_source: eir_tutorials/e_pretraining/02_mini_foundation/data/image_captioning/captions.csv
  input_name: text
  input_type: sequence

input_type_info:
  max_length: 128
  split_on: ""
  sampling_strategy_if_longer: "uniform"
  vocab_file: eir_tutorials/e_pretraining/02_mini_foundation/data/vocab.txt
  modality_dropout_rate: 0.1

model_config:
  embedding_dim: 64

inputs_image_array_cnn.yaml

input_info:
  input_source: eir_tutorials/e_pretraining/02_mini_foundation/data/image_captioning/images
  input_name: image_input
  input_type: image

model_config:
  model_type: cnn
  model_init_config:
    channel_exp_base: 5
    kernel_width: 2
    down_stride_width: 2
    kernel_height: 2
    down_stride_height: 2

fusion.yaml

model_type: "pass-through"

outputs.yaml

output_info:
  output_source: eir_tutorials/e_pretraining/02_mini_foundation/data/image_captioning/captions.csv
  output_name: text
  output_type: sequence

output_type_info:
  max_length: 128
  split_on: ""
  sampling_strategy_if_longer: "uniform"
  vocab_file: eir_tutorials/e_pretraining/02_mini_foundation/data/vocab.txt

sampling_config:
  generated_sequence_length: 64
  n_eval_inputs: 10

To train, we use the following command:

eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/inputs_image_array_cnn.yaml eir_tutorials/e_pretraining/02_mini_foundation/conf/inputs_sequence.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/output_sequence.yaml \
--globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation

Here we can see the training curve for the mini foundation model:

../../_images/training_curve_LOSS_0_pretrain.png

Now, given that we have either downloaded or trained the mini foundation model, we can use it to train other models.

C - Establishing an IMDB Baseline

Before using the mini foundation model, let’s first establish a baseline by training a model from scratch to perform sentiment analysis on IMDB reviews.

Here are the configurations:

imdb_globals.yaml

output_folder: eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation
valid_size: 1024
n_saved_models: 1
checkpoint_interval: 100
plot_skip_steps: 0
sample_interval: 100
memory_dataset: true
dataloader_workers: 0
n_epochs: 20
batch_size: 64
lr: 0.0005
optimizer: "adabelief"
device: "cpu"

imdb_input.yaml

input_info:
  input_source: eir_tutorials/e_pretraining/02_mini_foundation/data/IMDB/imdb_reviews.csv
  input_name: text
  input_type: sequence

input_type_info:
  max_length: 128
  split_on: ""
  sampling_strategy_if_longer: "uniform"
  vocab_file: eir_tutorials/e_pretraining/02_mini_foundation/data/vocab.txt

model_config:
  embedding_dim: 64

imdb_output.yaml

output_info:
  output_source: eir_tutorials/e_pretraining/02_mini_foundation/data/IMDB/imdb_labels.csv
  output_name: imdb_output
  output_type: tabular

output_type_info:
  target_cat_columns:
    - Sentiment

To kick off the training for IMDB from scratch, run the following command:

eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_input.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_output.yaml \
--imdb_globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation_imdb_from_scratch

The performance can be evaluated through these generated plots:

../../_images/training_curve_LOSS_1_text_from_scratch1.png

This serves as our baseline, which we’ll aim to improve in the next section by using the mini foundation model.

D - Using the Mini Foundation Model for IMDB

In this section, we’ll use the pre-trained mini foundation model as a starting point for training our IMDB sentiment analysis model. Specifically, we will only load the text encoder part of the mini foundation model while other parts of the IMDB model will be trained from scratch.

While the configuration files remain the same, there is a slight change in the training command:

eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_input.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/imdb/imdb_output.yaml \
--imdb_globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation_imdb_from_pretrained \
--imdb_input.pretrained_config.model_path=eir_tutorials/e_pretraining/02_mini_foundation/data/02_mini_foundation/saved_models/02_mini_foundation_model_18000_perf-average=0.0809.pt \
--imdb_input.pretrained_config.load_module_name=text

Let’s examine the performance improvements, if any:

../../_images/training_curve_LOSS_2_text_from_pretrain.png

In this specific case, the training and validation losses are very marginally lower compared to the baseline. This indicates that the mini foundation model didn’t contribute significantly to enhancing the model’s performance for IMDB sentiment analysis. One reason could be that the text data each model is trained on is very different, with the mini foundation model being trained on somewhat robotic image captions, while the IMDB model is trained on various movie reviews.

Note

You might notice that the the pre-trained model was trained for more iterations, this was due to early stopping being activated earlier in the model trained from scratch, which might simply be due to randomness. Hence, the fact that the pre-trained model performs slightly better might be due to the fact that it was trained for more iterations, not necessarily because of the pre-training.

While the performance improvements are not significant in the text case, we will not give up on our mini foundation model just yet. Let’s see how well the image encoder part of the mini foundation model performs when used for image classification.

E - Establishing a CIFAR10 Baseline

Just like for the IMDB case, we will first establish a baseline.

Here are the configurations for the CIFAR10 baseline:

cifar_globals.yaml

output_folder: eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation
valid_size: 1024
n_saved_models: 1
checkpoint_interval: 100
plot_skip_steps: 0
sample_interval: 100
memory_dataset: true
dataloader_workers: 0
n_epochs: 20
batch_size: 64
lr: 0.0005
optimizer: "adabelief"
device: "mps"

cifar_input.yaml

input_info:
  input_source: eir_tutorials/e_pretraining/02_mini_foundation/data/CIFAR10/images
  input_name: image_input
  input_type: image

model_config:
  model_type: cnn
  model_init_config:
    channel_exp_base: 5
    kernel_width: 2
    down_stride_width: 2
    kernel_height: 2
    down_stride_height: 2

cifar_output.yaml

output_info:
  output_source: eir_tutorials/e_pretraining/02_mini_foundation/data/CIFAR10/images_classes.csv
  output_name: cifar_output
  output_type: tabular

output_type_info:
  target_cat_columns:
    - Class

To initiate the training for CIFAR10 from scratch, execute the following command:

eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_input.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_output.yaml \
--cifar_globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation_cifar_from_scratch

Training curve:

This will serve as our baseline for CIFAR10, which we will compare against the model that uses the image encoder from the mini foundation model in the next section.

F - Using the Mini Foundation Model for CIFAR10

In this section, we’ll use the pre-trained mini foundation model for CIFAR10 image classification. Specifically, we’ll load only the image encoder from the mini foundation model, while the rest of the CIFAR10 model will be trained from scratch.

Again, the configuration files for this step are the same as in the baseline, with one change in the training command:

eirtrain \
--global_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_globals.yaml \
--input_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_input.yaml \
--fusion_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_fusion.yaml \
--output_configs eir_tutorials/e_pretraining/02_mini_foundation/conf/cifar/cifar_output.yaml \
--cifar_globals.output_folder=eir_tutorials/tutorial_runs/e_pretraining/02_mini_foundation_cifar_from_pretrained \
--cifar_input.pretrained_config.model_path=eir_tutorials/e_pretraining/02_mini_foundation/data/02_mini_foundation/saved_models/02_mini_foundation_model_18000_perf-average=0.0809.pt \
--cifar_input.pretrained_config.load_module_name=image_input

Now, let’s review the impact on performance:

../../_images/training_curve_LOSS_4_image_from_pretrain.png

In contrast to the text-based IMDB model, the CIFAR10 model shows improvements in both the speed of convergence (e.g., the loss at iteration 1500 is lower for the pre-trained model than the model trained from scratch) and the final performance when initialized with the image encoder from the mini foundation model.

These results suggest that the image encoder from the mini foundation model can be transferred to image classification, indicating that one can successfully train and, in a modular fashion, transfer parts of a model to other tasks.

Thank you very much for reading this tutorial!