Sequence Models

This page contains the list of external sequence models that can be used with EIR, coming from the excellent Transformers library.

There are 3 ways to use these models:

Configure and train specific architectures (e.g. BERT with chosen number of layers) from scratch.
Train a specific architecture (e.g. bert-base-uncased) from scratch.
Use a pre-trained model (e.g. bert-base-uncased) and fine-tune it.

Please refer to this page for a complete list of pre-defined architectures, with the option of using pre-trained weights.

Configurable Models

The following models can be configured and trained from scratch.

The model type is specified in the model_type field of the configuration, while the model specific configuration is specified in the model_init_config field.

For example, the LongFormer architecture includes the num_attention_heads and num_hidden_layers parameters, and can be configured as follows:

input_configurable_sequence_model.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/04_pretrained_sequence_tutorial/data/IMDB/IMDB_Reviews
  input_name: imdb_reviews_longformer
  input_type: sequence

input_type_info:
        sampling_strategy_if_longer: "uniform"
        max_length: 512
        split_on: " "
        min_freq: 10
        tokenizer: "basic_english"
        tokenizer_language: "en"

model_config:
      model_type: longformer
      pretrained_model: false
      position: embed
      pool: avg
      model_init_config:
            num_hidden_layers: 2
            hidden_size: 32
            num_attention_heads: 2
            intermediate_size: 32
            attention_window: 64
            max_position_embeddings: 1024

Pretrained Models

We can also fine-tune or train a specific architecture from scratch. For example, a tiny-bert model like so:

input_pre_trained_sequence_model.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/04_pretrained_sequence_tutorial/data/IMDB/IMDB_Reviews
  input_name: imdb_reviews_tiny_bert
  input_type: sequence

input_type_info:
  sampling_strategy_if_longer: "uniform"
  max_length: 512
  split_on: " "
  min_freq: 10

model_config:
    model_type: "prajjwal1/bert-tiny"
    pretrained_model: true
    freeze_pretrained_model: false
    position: embed
    pool: avg

Below is a list of the configurable models that can be used with EIR.

class transformers.models.albert.configuration_albert.AlbertConfig(vocab_size=30000, embedding_size=128, hidden_size=4096, num_hidden_layers=12, num_hidden_groups=1, num_attention_heads=64, intermediate_size=16384, inner_group_num=1, hidden_act='gelu_new', hidden_dropout_prob=0, attention_probs_dropout_prob=0, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, classifier_dropout_prob=0.1, position_embedding_type='absolute', pad_token_id=0, bos_token_id=2, eos_token_id=3, **kwargs)

The ALBERT model was proposed in ALBERT: A Lite BERT for Self-supervised Learning of Language Representations by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:

Splitting the embedding matrix into two smaller matrices.
Using repeating layers split among groups.

The abstract from the paper is the following:

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

Tips:

ALBERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it’s more logical to have H >> E. Also, the embedding matrix is large since it’s V x E (V being the vocab size). If E < H, it has less parameters.
Layers are split in groups that share parameters (to save memory).

Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.

This model was contributed by lysandre. This model jax version was contributed by kamalkraj. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30000):: Vocabulary size of the ALBERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling AlbertModel or TFAlbertModel.
embedding_size (int, optional, defaults to 128):: Dimensionality of vocabulary embeddings.
hidden_size (int, optional, defaults to 4096):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_hidden_groups (int, optional, defaults to 1):: Number of groups for the hidden layers, parameters in the same group are shared.
num_attention_heads (int, optional, defaults to 64):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 16384):: The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
inner_group_num (int, optional, defaults to 1):: The number of inner repetition of attention and ffn.
hidden_act (str or Callable, optional, defaults to “gelu_new”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling AlbertModel or TFAlbertModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
classifier_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for attached classifiers.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
pad_token_id (int, optional, defaults to 0):: Padding token id.
bos_token_id (int, optional, defaults to 2):: Beginning of stream token id.
eos_token_id (int, optional, defaults to 3):: End of stream token id.

class transformers.models.bart.configuration_bart.BartConfig(vocab_size=50265, max_position_embeddings=1024, encoder_layers=12, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=12, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, activation_function='gelu', d_model=1024, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, classifier_dropout=0.0, scale_embedding=False, use_cache=True, num_labels=3, pad_token_id=1, bos_token_id=0, eos_token_id=2, is_encoder_decoder=True, decoder_start_token_id=2, forced_eos_token_id=2, **kwargs)

The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.

According to the abstract,

Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.

Tips:

BART is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of the following transformations are applied on the pretraining tasks for the encoder:
- mask random tokens (like in BERT)
- delete random tokens
- mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
- permute sentences
- rotate the document to make it start at a specific token

This model was contributed by sshleifer. The Authors’ code can be found here.

#Args:

vocab_size (int, optional, defaults to 50265):: Vocabulary size of the BART model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BartModel or TFBartModel.
d_model (int, optional, defaults to 1024):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 12):: Number of encoder layers.
decoder_layers (int, optional, defaults to 12):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
classifier_dropout (float, optional, defaults to 0.0):: The dropout ratio for classifier.
max_position_embeddings (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
scale_embedding (bool, optional, defaults to False):: Scale embeddings by diving by sqrt(d_model).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).
num_labels (int, optional, defaults to 3):: The number of labels to use in BartForSequenceClassification.
forced_eos_token_id (int, optional, defaults to 2):: The id of the token to force as the last generated token when max_length is reached. Usually set to eos_token_id.

class transformers.models.bert.configuration_bert.BertConfig(vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, position_embedding_type='absolute', use_cache=True, classifier_dropout=None, **kwargs)

The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

The abstract from the paper is the following:

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Tips:

BERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.
Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually 15%) is masked by:
- a special mask token with probability 0.8
- a random token different from the one masked with probability 0.1
- the same token with probability 0.1
The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50% they are not related. The model has to predict if the sentences are consecutive or not.

This model was contributed by thomwolf. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BertModel or TFBertModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
is_decoder (bool, optional, defaults to False):: Whether the model is used as a decoder or not. If False, the model is used as an encoder.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
classifier_dropout (float, optional):: The dropout ratio for the classification head.

class transformers.models.bert_generation.configuration_bert_generation.BertGenerationConfig(vocab_size=50358, hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, intermediate_size=4096, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, bos_token_id=2, eos_token_id=1, position_embedding_type='absolute', use_cache=True, **kwargs)

The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using EncoderDecoderModel as proposed in Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.

The abstract from the paper is the following:

Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion.

class transformers.models.big_bird.configuration_big_bird.BigBirdConfig(vocab_size=50358, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu_new', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=4096, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, use_cache=True, pad_token_id=0, bos_token_id=1, eos_token_id=2, sep_token_id=66, attention_type='block_sparse', use_bias=True, rescale_embeddings=False, block_size=64, num_random_blocks=3, classifier_dropout=None, **kwargs)

The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it has been shown that applying sparse, global, and random attention approximates full attention, while being computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context, BigBird has shown improved performance on various long document NLP tasks, such as question answering and summarization, compared to BERT or RoBERTa.

The abstract from the paper is the following:

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

Tips:

For an in-detail explanation on how BigBird’s attention works, see this blog post.
BigBird comes with 2 implementations: original_full & block_sparse. For the sequence length < 1024, using original_full is advised as there is no benefit in using block_sparse attention.
The code currently uses window size of 3 blocks and 2 global blocks.
Sequence length must be divisible by block size.
Current implementation supports only ITC.
Current implementation doesn’t support num_random_blocks = 0
BigBird is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

This model was contributed by vasudevgupta. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 50358):: Vocabulary size of the BigBird model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BigBirdModel.
hidden_size (int, optional, defaults to 768):: Dimension of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “gelu_new”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 4096):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 1024 or 2048 or 4096).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling BigBirdModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
is_decoder (bool, optional, defaults to False):: Whether the model is used as a decoder or not. If False, the model is used as an encoder.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
attention_type (str, optional, defaults to “block_sparse”): Whether to use block sparse attention (with n complexity) as introduced in paper or original attention layer (with n^2 complexity). Possible values are “original_full” and “block_sparse”.
use_bias (bool, optional, defaults to True): Whether to use bias in query, key, value.
rescale_embeddings (bool, optional, defaults to False): Whether to rescale embeddings with (hidden_size ** 0.5).
block_size (int, optional, defaults to 64): Size of each block. Useful only when attention_type == “block_sparse”.
num_random_blocks (int, optional, defaults to 3): Each query is going to attend these many number of random blocks. Useful only when attention_type == “block_sparse”.
classifier_dropout (float, optional):: The dropout ratio for the classification head.

class transformers.models.bigbird_pegasus.configuration_bigbird_pegasus.BigBirdPegasusConfig(vocab_size=96103, max_position_embeddings=4096, encoder_layers=16, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=16, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, use_cache=True, is_encoder_decoder=True, activation_function='gelu_new', d_model=1024, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, decoder_start_token_id=2, classifier_dropout=0.0, scale_embedding=True, pad_token_id=0, bos_token_id=2, eos_token_id=1, attention_type='block_sparse', block_size=64, num_random_blocks=3, use_bias=False, **kwargs)

The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it has been shown that applying sparse, global, and random attention approximates full attention, while being computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context, BigBird has shown improved performance on various long document NLP tasks, such as question answering and summarization, compared to BERT or RoBERTa.

The abstract from the paper is the following:

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

Tips:

For an in-detail explanation on how BigBird’s attention works, see this blog post.
BigBird comes with 2 implementations: original_full & block_sparse. For the sequence length < 1024, using original_full is advised as there is no benefit in using block_sparse attention.
The code currently uses window size of 3 blocks and 2 global blocks.
Sequence length must be divisible by block size.
Current implementation supports only ITC.
Current implementation doesn’t support num_random_blocks = 0.
BigBirdPegasus uses the PegasusTokenizer.
BigBird is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

The original code can be found here.

Args:

vocab_size (int, optional, defaults to 96103):: Vocabulary size of the BigBirdPegasus model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BigBirdPegasusModel.
d_model (int, optional, defaults to 1024):: Dimension of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 16):: Number of encoder layers.
decoder_layers (int, optional, defaults to 16):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimension of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimension of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu_new”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
classifier_dropout (float, optional, defaults to 0.0):: The dropout ratio for classifier.
max_position_embeddings (int, optional, defaults to 4096):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 1024 or 2048 or 4096).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).
attention_type (str, optional, defaults to “block_sparse”): Whether to use block sparse attention (with n complexity) as introduced in paper or original attention layer (with n^2 complexity) in encoder. Possible values are “original_full” and “block_sparse”.
use_bias (bool, optional, defaults to False): Whether to use bias in query, key, value.
block_size (int, optional, defaults to 64): Size of each block. Useful only when attention_type == “block_sparse”.
num_random_blocks (int, optional, defaults to 3): Each query is going to attend these many number of random blocks. Useful only when attention_type == “block_sparse”.
scale_embeddings (bool, optional, defaults to True): Whether to rescale embeddings with (hidden_size ** 0.5).

class transformers.models.biogpt.configuration_biogpt.BioGptConfig(vocab_size=42384, hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, intermediate_size=4096, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=1024, initializer_range=0.02, layer_norm_eps=1e-12, scale_embedding=True, use_cache=True, layerdrop=0.0, activation_dropout=0.0, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs)

The BioGPT model was proposed in `BioGPT: generative pre-trained transformer for biomedical text generation and mining: <https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9>`__ by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.

The abstract from the paper is the following:

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.

Tips:

BioGPT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
The model can take the past_key_values (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.

This model was contributed by kamalkraj. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 42384):: Vocabulary size of the BioGPT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BioGptModel.
hidden_size (int, optional, defaults to 1024):: Dimension of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 24):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 4096):: Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
scale_embedding (bool, optional, defaults to True):: Scale embeddings by diving by sqrt(d_model).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
layerdrop (float, optional, defaults to 0.0):: Please refer to the paper about LayerDrop: https://arxiv.org/abs/1909.11556 for further details
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
pad_token_id (int, optional, defaults to 1):: Padding token id.
bos_token_id (int, optional, defaults to 0):: Beginning of stream token id.
eos_token_id (int, optional, defaults to 2):: End of stream token id.

class transformers.models.blenderbot.configuration_blenderbot.BlenderbotConfig(vocab_size=8008, max_position_embeddings=128, encoder_layers=2, encoder_ffn_dim=10240, encoder_attention_heads=32, decoder_layers=24, decoder_ffn_dim=10240, decoder_attention_heads=32, encoder_layerdrop=0.0, decoder_layerdrop=0.0, use_cache=True, is_encoder_decoder=True, activation_function='gelu', d_model=2560, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, decoder_start_token_id=1, scale_embedding=False, pad_token_id=0, bos_token_id=1, eos_token_id=2, encoder_no_repeat_ngram_size=3, forced_eos_token_id=2, **kwargs)

The Blender chatbot model was proposed in Recipes for building an open-domain chatbot Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.

The abstract of the paper is the following:

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.

Tips:

Blenderbot is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

This model was contributed by sshleifer. The authors’ code can be found here .

Args:

vocab_size (int, optional, defaults to 50265):: Vocabulary size of the Blenderbot model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BlenderbotModel or TFBlenderbotModel.
d_model (int, optional, defaults to 1024):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 12):: Number of encoder layers.
decoder_layers (int, optional, defaults to 12):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
max_position_embeddings (int, optional, defaults to 128):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
scale_embedding (bool, optional, defaults to False):: Scale embeddings by diving by sqrt(d_model).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models)
forced_eos_token_id (int, optional, defaults to 2):: The id of the token to force as the last generated token when max_length is reached. Usually set to eos_token_id.

class transformers.models.blenderbot_small.configuration_blenderbot_small.BlenderbotSmallConfig(vocab_size=50265, max_position_embeddings=512, encoder_layers=8, encoder_ffn_dim=2048, encoder_attention_heads=16, decoder_layers=8, decoder_ffn_dim=2048, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, use_cache=True, is_encoder_decoder=True, activation_function='gelu', d_model=512, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, decoder_start_token_id=1, scale_embedding=False, pad_token_id=0, bos_token_id=1, eos_token_id=2, forced_eos_token_id=2, **kwargs)

The Blender chatbot model was proposed in Recipes for building an open-domain chatbot Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.

The abstract of the paper is the following:

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.

Tips:

Blenderbot Small is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

This model was contributed by patrickvonplaten. The authors’ code can be found here.

Args:

vocab_size (int, optional, defaults to 50265):: Vocabulary size of the BlenderbotSmall model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BlenderbotSmallModel or TFBlenderbotSmallModel.
d_model (int, optional, defaults to 512):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 8):: Number of encoder layers.
decoder_layers (int, optional, defaults to 8):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 2048):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 2048):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
scale_embedding (bool, optional, defaults to False):: Scale embeddings by diving by sqrt(d_model).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models)
forced_eos_token_id (int, optional, defaults to 2):: The id of the token to force as the last generated token when max_length is reached. Usually set to eos_token_id.

class transformers.models.bloom.configuration_bloom.BloomConfig(vocab_size=250880, hidden_size=64, n_layer=2, n_head=8, layer_norm_epsilon=1e-05, initializer_range=0.02, use_cache=True, bos_token_id=1, eos_token_id=2, apply_residual_connection_post_layernorm=False, hidden_dropout=0.0, attention_dropout=0.0, pretraining_tp=1, slow_but_exact=False, **kwargs)

The BLOOM model has been proposed with its various versions through the BigScience Workshop. BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact. The architecture of BLOOM is essentially similar to GPT3 (auto-regressive model for next token prediction), but has been trained on 46 different languages and 13 programming languages. Several smaller versions of the models have been trained on the same dataset. BLOOM is available in the following versions:

Args:

vocab_size (int, optional, defaults to 250880):: Vocabulary size of the Bloom model. Defines the maximum number of different tokens that can be represented by the inputs_ids passed when calling BloomModel. Check this discussion on how the vocab_size has been defined.
hidden_size (int, optional, defaults to 64):: Dimensionality of the embeddings and hidden states.
n_layer (int, optional, defaults to 2):: Number of hidden layers in the Transformer encoder.
n_head (int, optional, defaults to 8):: Number of attention heads for each attention layer in the Transformer encoder.
layer_norm_epsilon (float, optional, defaults to 1e-5):: The epsilon to use in the layer normalization layers.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
apply_residual_connection_post_layernorm (bool, optional, defaults to False):: If enabled, use the layer norm of the hidden states as the residual in the transformer blocks
hidden_dropout (float, optional, defaults to 0.1):: Dropout rate of the dropout function on the bias dropout.
attention_dropout (float, optional, defaults to 0.1):: Dropout rate applied to the attention probs
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).
pretraining_tp (int, optional, defaults to 1):: Experimental feature. Tensor parallelism rank used during pretraining with Megatron. Please refer to this document to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to this issue. Note also that this is enabled only when slow_but_exact=True.
slow_but_exact (bool, optional, defaults to False):: Experimental feature. Whether to use slow but exact implementation of the attention mechanism. While merging the TP rank tensors, due to slicing operations the results may be slightly different between the model trained on Megatron and our model. Please refer to this issue. A solution to obtain more accurate results is to enable this feature. Enabling this will hurt the computational time of the inference. Will be probably resolved in the future once the main model has been fine-tuned with TP_rank=1.

class transformers.models.camembert.configuration_camembert.CamembertConfig(vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=1, bos_token_id=0, eos_token_id=2, position_embedding_type='absolute', use_cache=True, classifier_dropout=None, **kwargs)

The CamemBERT model was proposed in CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook’s RoBERTa model released in 2019. It is a model trained on 138GB of French text.

The abstract from the paper is the following:

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. Aiming to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.

Tips:

This implementation is the same as RoBERTa. Refer to the documentation of RoBERTa for usage examples as well as the information relative to the inputs and outputs.

This model was contributed by camembert. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling CamembertModel or TFCamembertModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling CamembertModel or TFCamembertModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
is_decoder (bool, optional, defaults to False):: Whether the model is used as a decoder or not. If False, the model is used as an encoder.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
classifier_dropout (float, optional):: The dropout ratio for the classification head.

class transformers.models.llama.configuration_llama.LlamaConfig(vocab_size=32000, hidden_size=4096, intermediate_size=11008, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=None, hidden_act='silu', max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, pad_token_id=None, bos_token_id=1, eos_token_id=2, pretraining_tp=1, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, **kwargs)

The Code Llama model was proposed in Code Llama: Open Foundation Models for Code by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.

The abstract from the paper is the following:

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

Check out all Code Llama models here and the officially released ones in the codellama org.

The Llama2 family models, on which Code Llama is based, were trained using bfloat16, but the original inference uses float16. Let’s look at the different precisions:

float32: PyTorch convention on model initialization is to load models in float32, no matter with which dtype the model weights were stored. transformers also follows this convention for consistency with PyTorch. This will be picked by default. If you want the AutoModel API to cast the load the checkpoints with the storage weights type, you must specify torch_dtype=”auto”, e.g. model = AutoModelForCausalLM.from_pretrained(“path”, torch_dtype = “auto”).
bfloat16: Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning.
float16: We recommend running inference using this precision, as it’s usually faster than bfloat16, and evaluation metrics show no discernible degradation with respect to bfloat16. You can also run inference using bfloat16, and we recommend you check inference results with both float16 and bfloat16 after fine-tuning.

As mentioned above, the dtype of the storage weights is mostly irrelevant unless you are using torch_dtype=”auto” when initializing a model using. The reason is that the model will first be downloaded (using the dtype of the checkpoints online) and then will be casted to the default dtype of torch (becomes torch.float32). If there is a specified torch_dtype, it will be used instead.

</Tip>

Tips:

These models have the same architecture as the Llama2 models
The infilling task is supported out of the box. You should be using the tokenizer.fill_token where you want your input to be filled.
The model conversion script is the same as for the Llama2 family:

Here is a sample usage ```bash python src/transformers/models/llama/convert_llama_weights_to_hf.py

–input_dir /path/to/downloaded/llama/weights –model_size 7B –output_dir /output/path

``` Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM).

After conversion, the model and tokenizer can be loaded via:

>>> from transformers import LlamaForCausalLM, CodeLlamaTokenizer

>>> tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")
>>> model = LlamaForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf")
>>> PROMPT = '''def remove_non_ascii(s: str) -> str:
    """ <FILL_ME>
    return result
'''
>>> input_ids = tokenizer(PROMPT, return_tensors="pt")`"input_ids"]
>>> generated_ids = model.generate(input_ids, max_new_tokens=128)

>>> filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
>>> print(PROMPT.replace("<FILL_ME>", filling))
def remove_non_ascii(s: str) -> str:
    """ Remove non-ASCII characters from a string.

Args:
s: The string to remove non-ASCII characters from.

Returns:
The string with non-ASCII characters removed.

“”” result = “” for c in s:

if ord(c) < 128:
result += c

return result

If you only want the infilled part:

>>> from transformers import pipeline
>>> import torch

>>> generator = pipeline("text-generation",model="codellama/CodeLlama-7b-hf",torch_dtype=torch.float16, device_map="auto")
>>> generator('def remove_non_ascii(s: str) -> str:\n    """ <FILL_ME>\n    return result', max_new_tokens = 128, return_type = 1)

Under the hood, the tokenizer [automatically splits by <FILL_ME> <https://huggingface.co/docs/transformers/main/model_doc/code_llama#transformers.CodeLlamaTokenizer.fill_token>`__ to create a formatted input string that follows the original training pattern. This is more robust than preparing the pattern yourself: it avoids pitfalls, such as token glueing, that are very hard to debug. To see how much CPU and GPU memory you need for this model or others, try this calculator which can help determine that value.

The LLaMA tokenizer is a BPE model based on sentencepiece. One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. “Banana”), the tokenizer does not prepend the prefix space to the string.

This model was contributed by ArthurZucker. The original code of the authors can be found here.

Args:

vocab_size (int, optional, defaults to 32000):: Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling LlamaModel
hidden_size (int, optional, defaults to 4096):: Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 11008):: Dimension of the MLP representations.
num_hidden_layers (int, optional, defaults to 32):: Number of hidden layers in the Transformer decoder.
num_attention_heads (int, optional, defaults to 32):: Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (int, optional):: This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout `this paper. If it is not specified, will default to num_attention_heads.
hidden_act (str or function, optional, defaults to “silu”):: The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int, optional, defaults to 2048):: The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens, Llama 2 up to 4096, CodeLlama up to 16384.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float, optional, defaults to 1e-06):: The epsilon used by the rms normalization layers.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
pad_token_id (int, optional):: Padding token id.
bos_token_id (int, optional, defaults to 1):: Beginning of stream token id.
eos_token_id (int, optional, defaults to 2):: End of stream token id.
pretraining_tp (int, optional, defaults to 1):: Experimental feature. Tensor parallelism rank used during pretraining. Please refer to this document to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to this issue.
tie_word_embeddings (bool, optional, defaults to False):: Whether to tie weight embeddings
rope_theta (float, optional, defaults to 10000.0):: The base period of the RoPE embeddings.
rope_scaling (Dict, optional):: Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is {“type”: strategy name, “factor”: scaling factor}. When using this flag, don’t update max_position_embeddings to the expected new maximum. See the following thread for more information on how these scaling strategies behave: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an experimental feature, subject to breaking API changes in future versions.
attention_bias (bool, defaults to False, optional, defaults to False):: Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.

>>> from transformers import LlamaModel, LlamaConfig

>>> # Initializing a LLaMA llama-7b style configuration
>>> configuration = LlamaConfig()

>>> # Initializing a model from the llama-7b style configuration
>>> model = LlamaModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

class transformers.models.codegen.configuration_codegen.CodeGenConfig(vocab_size=50400, n_positions=2048, n_ctx=2048, n_embd=4096, n_layer=28, n_head=16, rotary_dim=64, n_inner=None, activation_function='gelu_new', resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0, layer_norm_epsilon=1e-05, initializer_range=0.02, use_cache=True, bos_token_id=50256, eos_token_id=50256, tie_word_embeddings=False, **kwargs)

The CodeGen model was proposed in A Conversational Paradigm for Program Synthesis by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.

CodeGen is an autoregressive language model for program synthesis trained sequentially on The Pile, BigQuery, and BigPython.

The abstract from the paper is the following:

Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI’s Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: `this https URL <https://github.com/salesforce/codegen>`__.

This model was contributed by Hiroaki Hayashi. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 50400):: Vocabulary size of the CodeGen model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling CodeGenModel.
n_positions (int, optional, defaults to 2048):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (int, optional, defaults to 2048):: This attribute is used in CodeGenModel.__init__ without any real effect.
n_embd (int, optional, defaults to 4096):: Dimensionality of the embeddings and hidden states.
n_layer (int, optional, defaults to 28):: Number of hidden layers in the Transformer encoder.
n_head (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
rotary_dim (int, optional, defaults to 64):: Number of dimensions in the embedding that Rotary Position Embedding is applied to.
n_inner (int, optional):: Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd
activation_function (str, optional, defaults to “gelu_new”):: Activation function, to be selected in the list [“relu”, “silu”, “gelu”, “tanh”, “gelu_new”].
resid_pdrop (float, optional, defaults to 0.0):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (int, optional, defaults to 0.0):: The dropout ratio for the embeddings.
attn_pdrop (float, optional, defaults to 0.0):: The dropout ratio for the attention.
layer_norm_epsilon (float, optional, defaults to 1e-05):: The epsilon to use in the layer normalization layers.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).
bos_token_id (int, optional, defaults to 50256):: Beginning of stream token id.
eos_token_id (int, optional, defaults to 50256):: End of stream token id.
tie_word_embeddings (bool, optional, defaults to False):: Whether the model’s input and output word embeddings should be tied. Note that this is only relevant if the model has a output word embedding layer.

class transformers.models.cohere.configuration_cohere.CohereConfig(vocab_size=256000, hidden_size=8192, intermediate_size=22528, logit_scale=0.0625, num_hidden_layers=40, num_attention_heads=64, num_key_value_heads=None, hidden_act='silu', max_position_embeddings=8192, initializer_range=0.02, layer_norm_eps=1e-05, use_cache=True, pad_token_id=0, bos_token_id=5, eos_token_id=255001, tie_word_embeddings=True, rope_theta=10000.0, attention_bias=False, attention_dropout=0.0, **kwargs)

The Cohere Command-R model was proposed in the blogpost Command-R: Retrieval Augmented Generation at Production Scale by the Cohere Team.

The abstract from the paper is the following:

Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise. Today, we are introducing Command-R, a new LLM aimed at large-scale production workloads. Command-R targets the emerging “scalable” category of models that balance high efficiency with strong accuracy, enabling companies to move beyond proof of concept, and into production.

*Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with our industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts: - Strong accuracy on RAG and Tool Use - Low latency, and high throughput - Longer 128k context and lower pricing - Strong capabilities across 10 key languages - Model weights available on HuggingFace for research and evaluation

Checkout model checkpoints here. This model was contributed by Saurabh Dash and Ahmet Üstün. The code of the implementation in Hugging Face is based on GPT-NeoX here.

Args:

vocab_size (int, optional, defaults to 256000):: Vocabulary size of the Cohere model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling CohereModel
hidden_size (int, optional, defaults to 8192):: Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 22528):: Dimension of the MLP representations.
logit_scale (float, optional, defaults to 0.0625):: The scaling factor for the output logits.
num_hidden_layers (int, optional, defaults to 40):: Number of hidden layers in the Transformer decoder.
num_attention_heads (int, optional, defaults to 64):: Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (int, optional):: This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout `this paper. If it is not specified, will default to num_attention_heads.
hidden_act (str or function, optional, defaults to “silu”):: The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int, optional, defaults to 8192):: The maximum sequence length that this model might ever be used with.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-05):: The epsilon used by the layer normalization.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
pad_token_id (int, optional, defaults to 0):: Padding token id.
bos_token_id (int, optional, defaults to 5):: Beginning of stream token id.
eos_token_id (int, optional, defaults to 255001):: End of stream token id.
tie_word_embeddings (bool, optional, defaults to True):: Whether to tie weight embeddings
rope_theta (float, optional, defaults to 10000.0):: The base period of the RoPE embeddings.
attention_bias (bool, defaults to False, optional, defaults to False):: Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.

>>> from transformers import CohereModel, CohereConfig

>>> # Initializing a Cohere model configuration
>>> configuration = CohereConfig()

>>> # Initializing a model from the Cohere configuration
>>> model = CohereModel(configuration) 

>>> # Accessing the model configuration
>>> configuration = model.config 

class transformers.models.ctrl.configuration_ctrl.CTRLConfig(vocab_size=246534, n_positions=256, n_embd=1280, dff=8192, n_layer=48, n_head=16, resid_pdrop=0.1, embd_pdrop=0.1, layer_norm_epsilon=1e-06, initializer_range=0.02, use_cache=True, **kwargs)

CTRL model was proposed in CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. It’s a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).

The abstract from the paper is the following:

Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution.

Tips:

CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences or links to generate coherent text. Refer to the original implementation for more information.
CTRL is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be observed in the run_generation.py example script.
The PyTorch models can take the past_key_values as input, which is the previously computed key/value attention pairs. TensorFlow models accepts past as input. Using the past_key_values value prevents the model from re-computing pre-computed values in the context of text generation. See the ``forward``(model_doc/ctrl#transformers.CTRLModel.forward) method for more information on the usage of this argument.

This model was contributed by keskarnitishr. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 246534):: Vocabulary size of the CTRL model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling CTRLModel or TFCTRLModel.
n_positions (int, optional, defaults to 256):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_embd (int, optional, defaults to 1280):: Dimensionality of the embeddings and hidden states.
dff (int, optional, defaults to 8192):: Dimensionality of the inner dimension of the feed forward networks (FFN).
n_layer (int, optional, defaults to 48):: Number of hidden layers in the Transformer encoder.
n_head (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
resid_pdrop (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (int, optional, defaults to 0.1):: The dropout ratio for the embeddings.
layer_norm_epsilon (float, optional, defaults to 1e-06):: The epsilon to use in the layer normalization layers
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).

class transformers.models.data2vec.configuration_data2vec_text.Data2VecTextConfig(vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=1, bos_token_id=0, eos_token_id=2, position_embedding_type='absolute', use_cache=True, classifier_dropout=None, **kwargs)

This is the configuration class to store the configuration of a Data2VecTextModel and Data2VecTextModel. It is used to instantiate a Data2VecText model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Data2VecText facebook/data2vec-text-base architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the DATA2VEC model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling Data2VecModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling Data2VecModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
is_decoder (bool, optional, defaults to False):: Whether the model is used as a decoder or not. If False, the model is used as an encoder.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
classifier_dropout (float, optional):: The dropout ratio for the classification head.

class transformers.models.deberta.configuration_deberta.DebertaConfig(vocab_size=50265, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=0, initializer_range=0.02, layer_norm_eps=1e-07, relative_attention=False, max_relative_positions=-1, pad_token_id=0, position_biased_input=True, pos_att_type=None, pooler_dropout=0, pooler_hidden_act='gelu', **kwargs)

The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.

The abstract from the paper is the following:

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.

This model was contributed by DeBERTa. This model TF 2.0 implementation was contributed by kamalkraj . The original code can be found here.

Arguments:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the DeBERTa model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling DebertaModel or TFDebertaModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu”, “gelu”, “tanh”, “gelu_fast”, “mish”, “linear”, “sigmoid” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling DebertaModel or TFDebertaModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
relative_attention (bool, optional, defaults to False):: Whether use relative position encoding.
max_relative_positions (int, optional, defaults to 1):: The range of relative positions [-max_position_embeddings, max_position_embeddings]. Use the same value as max_position_embeddings.
pad_token_id (int, optional, defaults to 0):: The value used to pad input_ids.
position_biased_input (bool, optional, defaults to True):: Whether add absolute position embedding to content embedding.
pos_att_type (List[str], optional):: The type of relative position attention, it can be a combination of [“p2c”, “c2p”], e.g. [“p2c”], [“p2c”, “c2p”].
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.

class transformers.models.deberta_v2.configuration_deberta_v2.DebertaV2Config(vocab_size=128100, hidden_size=1536, num_hidden_layers=24, num_attention_heads=24, intermediate_size=6144, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=0, initializer_range=0.02, layer_norm_eps=1e-07, relative_attention=False, max_relative_positions=-1, pad_token_id=0, position_biased_input=True, pos_att_type=None, pooler_dropout=0, pooler_hidden_act='gelu', **kwargs)

The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.

The abstract from the paper is the following:

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.

The following information is visible directly on the original implementation repository. DeBERTa v2 is the second version of the DeBERTa model. It includes the 1.5B model used for the SuperGLUE single-model submission and achieving 89.9, versus human baseline 89.8. You can find more details about this submission in the authors’ blog

New in v2:

Vocabulary In v2 the tokenizer is changed to use a new vocabulary of size 128K built from the training data. Instead of a GPT2-based tokenizer, the tokenizer is now sentencepiece-based tokenizer.
nGiE(nGram Induced Input Encoding) The DeBERTa-v2 model uses an additional convolution layer aside with the first transformer layer to better learn the local dependency of input tokens.
Sharing position projection matrix with content projection matrix in attention layer Based on previous experiments, this can save parameters without affecting the performance.
Apply bucket to encode relative positions The DeBERTa-v2 model uses log bucket to encode relative positions similar to T5.
900M model & 1.5B model Two additional model sizes are available: 900M and 1.5B, which significantly improves the performance of downstream tasks.

This model was contributed by DeBERTa. This model TF 2.0 implementation was contributed by kamalkraj. The original code can be found here.

Arguments:

vocab_size (int, optional, defaults to 128100):: Vocabulary size of the DeBERTa-v2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling DebertaV2Model.
hidden_size (int, optional, defaults to 1536):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 24):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 24):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 6144):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu”, “gelu”, “tanh”, “gelu_fast”, “mish”, “linear”, “sigmoid” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 0):: The vocabulary size of the token_type_ids passed when calling DebertaModel or TFDebertaModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-7):: The epsilon used by the layer normalization layers.
relative_attention (bool, optional, defaults to True):: Whether use relative position encoding.
max_relative_positions (int, optional, defaults to -1):: The range of relative positions [-max_position_embeddings, max_position_embeddings]. Use the same value as max_position_embeddings.
pad_token_id (int, optional, defaults to 0):: The value used to pad input_ids.
position_biased_input (bool, optional, defaults to False):: Whether add absolute position embedding to content embedding.
pos_att_type (List[str], optional):: The type of relative position attention, it can be a combination of [“p2c”, “c2p”], e.g. [“p2c”], [“p2c”, “c2p”], [“p2c”, “c2p”].
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.

class transformers.models.distilbert.configuration_distilbert.DistilBertConfig(vocab_size=30522, max_position_embeddings=512, sinusoidal_pos_embds=False, n_layers=6, n_heads=12, dim=768, hidden_dim=3072, dropout=0.1, attention_dropout=0.1, activation='gelu', initializer_range=0.02, qa_dropout=0.1, seq_classif_dropout=0.2, pad_token_id=0, **kwargs)

The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

The abstract from the paper is the following:

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Tips:

DistilBERT doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or ``SEP]`).
DistilBERT doesn’t have options to select the input positions (position_ids input). This could be added if necessary though, just let us know if you need this option.
Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it’s been trained to predict the same probabilities as the larger model. The actual objective is a combination of:
- finding the same probabilities as the teacher model
- predicting the masked tokens correctly (but no next-sentence objective)
- a cosine similarity between the hidden states of the student and the teacher model

This model was contributed by [victorsanh <https://huggingface.co/victorsanh>`__. This model jax version was contributed by kamalkraj. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling DistilBertModel or TFDistilBertModel.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
sinusoidal_pos_embds (boolean, optional, defaults to False):: Whether to use sinusoidal positional embeddings.
n_layers (int, optional, defaults to 6):: Number of hidden layers in the Transformer encoder.
n_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
dim (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
hidden_dim (int, optional, defaults to 3072):: The size of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
activation (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
qa_dropout (float, optional, defaults to 0.1):: The dropout probabilities used in the question answering model DistilBertForQuestionAnswering.
seq_classif_dropout (float, optional, defaults to 0.2):: The dropout probabilities used in the sequence classification and the multiple choice model DistilBertForSequenceClassification.

class transformers.models.electra.configuration_electra.ElectraConfig(vocab_size=30522, embedding_size=128, hidden_size=256, num_hidden_layers=12, num_attention_heads=4, intermediate_size=1024, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, summary_type='first', summary_use_proj=True, summary_activation='gelu', summary_last_dropout=0.1, pad_token_id=0, position_embedding_type='absolute', use_cache=True, classifier_dropout=None, **kwargs)

The ELECTRA model was proposed in the paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ELECTRA is a new pretraining approach which trains two transformer models: the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence.

The abstract from the paper is the following:

Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with `MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

Tips:

ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller, while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection layer is used.
ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA has to predict which token is an original and which one has been replaced. Like for GAN training, the small language model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a traditional GAN setting) then the ELECTRA model is trained for a few steps.
The ELECTRA checkpoints saved using [Google Research’s implementation <https://github.com/google-research/electra>`__ contain both the generator and discriminator. The conversion script requires the user to name which model to export into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all available ELECTRA models, however. This means that the discriminator may be loaded in the ElectraForMaskedLM model, and the generator may be loaded in the ElectraForPreTraining model (the classification head will be randomly initialized as it doesn’t exist in the generator).

This model was contributed by lysandre. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30522):

Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling ElectraModel or TFElectraModel.

embedding_size (int, optional, defaults to 128):

Dimensionality of the encoder layers and the pooler layer.

hidden_size (int, optional, defaults to 256):

Dimensionality of the encoder layers and the pooler layer.

num_hidden_layers (int, optional, defaults to 12):

Number of hidden layers in the Transformer encoder.

num_attention_heads (int, optional, defaults to 4):

Number of attention heads for each attention layer in the Transformer encoder.

intermediate_size (int, optional, defaults to 1024):

Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

hidden_act (str or Callable, optional, defaults to “gelu”):

The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.

hidden_dropout_prob (float, optional, defaults to 0.1):

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

attention_probs_dropout_prob (float, optional, defaults to 0.1):

The dropout ratio for the attention probabilities.

max_position_embeddings (int, optional, defaults to 512):

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

type_vocab_size (int, optional, defaults to 2):

The vocabulary size of the token_type_ids passed when calling ElectraModel or TFElectraModel.

initializer_range (float, optional, defaults to 0.02):

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

layer_norm_eps (float, optional, defaults to 1e-12):

The epsilon used by the layer normalization layers.

summary_type (str, optional, defaults to “first”):

Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

Has to be one of the following options:

“last”: Take the last token hidden state (like XLNet).

“first”: Take the first token hidden state (like BERT).

“mean”: Take the mean of all tokens hidden states.

“cls_index”: Supply a Tensor of classification token position (like GPT/GPT-2).

“attn”: Not implemented now, use multi-head attention.

summary_use_proj (bool, optional, defaults to True):

Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

Whether or not to add a projection after the vector extraction.

summary_activation (str, optional):

Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

Pass “gelu” for a gelu activation to the output, any other value will result in no activation.

summary_last_dropout (float, optional, defaults to 0.0):

Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

The dropout ratio to be used after the projection and activation.

position_embedding_type (str, optional, defaults to “absolute”):

Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

use_cache (bool, optional, defaults to True):

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

classifier_dropout (float, optional):

The dropout ratio for the classification head.

class transformers.models.ernie.configuration_ernie.ErnieConfig(vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, task_type_vocab_size=3, use_task_id=False, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, position_embedding_type='absolute', use_cache=True, classifier_dropout=None, **kwargs)

ERNIE is a series of powerful models proposed by baidu, especially in Chinese tasks, including ERNIE1.0, ERNIE2.0, ERNIE3.0, ERNIE-Gram, ERNIE-health, etc.

These models are contributed by nghuyong and the official code can be found in PaddleNLP (in PaddlePaddle).

#Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the ERNIE model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling ErnieModel or TFErnieModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling ErnieModel or TFErnieModel.
task_type_vocab_size (int, optional, defaults to 3):: The vocabulary size of the task_type_ids for ERNIE2.0/ERNIE3.0 model
use_task_id (bool, optional, defaults to False):: Whether or not the model support task_type_ids
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
pad_token_id (int, optional, defaults to 0):: Padding token id.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
classifier_dropout (float, optional):: The dropout ratio for the classification head.

class transformers.models.falcon.configuration_falcon.FalconConfig(vocab_size=65024, hidden_size=4544, num_hidden_layers=32, num_attention_heads=71, layer_norm_epsilon=1e-05, initializer_range=0.02, use_cache=True, hidden_dropout=0.0, attention_dropout=0.0, num_kv_heads=None, alibi=False, new_decoder_architecture=False, multi_query=True, parallel_attn=True, bias=False, max_position_embeddings=2048, rope_theta=10000.0, rope_scaling=None, bos_token_id=11, eos_token_id=11, **kwargs)

Falcon is a class of causal decoder-only models built by TII. The largest Falcon checkpoints have been trained on >=1T tokens of text, with a particular emphasis on the RefinedWeb corpus. They are made available under the Apache 2.0 license.

Falcon’s architecture is modern and optimized for inference, with multi-query attention and support for efficient attention variants like FlashAttention. Both ‘base’ models trained only as causal language models as well as ‘instruct’ models that have received further fine-tuning are available.

Falcon models are (as of 2023) some of the largest and most powerful open-source language models, and consistently rank highly in the OpenLLM leaderboard.

Args:

vocab_size (int, optional, defaults to 65024):: Vocabulary size of the Falcon model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling FalconModel
hidden_size (int, optional, defaults to 4544):: Dimension of the hidden representations.
num_hidden_layers (int, optional, defaults to 32):: Number of hidden layers in the Transformer decoder.
num_attention_heads (int, optional, defaults to 71):: Number of attention heads for each attention layer in the Transformer encoder.
layer_norm_epsilon (float, optional, defaults to 1e-05):: The epsilon used by the layer normalization layers.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (bool, optional, defaults to True):: Whether the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
hidden_dropout (float, optional, defaults to 0.0):: The dropout probability for MLP layers.
attention_dropout (float, optional, defaults to 0.0):: The dropout probability for attention layers.
num_kv_heads (int, optional):: Number of key-value heads to use per attention layer. If unset, defaults to the same value as num_attention_heads.
alibi (bool, optional, defaults to False):: Whether to use ALiBi positional biases during self-attention.
new_decoder_architecture (bool, optional, defaults to False):: Whether to use the new (Falcon-40B) decoder architecture. If True, the multi_query and parallel_attn arguments are ignored, as the new decoder always uses parallel attention.
multi_query (bool, optional, defaults to True):: Whether to use multi-query attention in the decoder. Ignored when new_decoder_architecture is True.
parallel_attn (bool, optional, defaults to True):: Whether to compute attention in parallel with the feedforward layer. If False, they are consecutive instead, as in the original Transformer architecture. Ignored when new_decoder_architecture is True.
bias (bool, optional, defaults to False):: Whether to use bias on Linear layers.
max_position_embeddings (int, optional, defaults to 2048):: The maximum sequence length that this model might ever be used with, when alibi is False. Pretrained Falcon models with RoPE support up to 2048 tokens.
rope_theta (float, optional, defaults to 10000.0):: The base period of the RoPE embeddings.
rope_scaling (Dict, optional):: Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is {“type”: strategy name, “factor”: scaling factor}. When using this flag, don’t update max_position_embeddings to the expected new maximum. See the following thread for more information on how these scaling strategies behave: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an experimental feature, subject to breaking API changes in future versions.
bos_token_id (int, optional, defaults to 11):: The id of the “beginning-of-sequence” token.
eos_token_id (int, optional, defaults to 11):: The id of the “end-of-sequence” token.

class transformers.models.flaubert.configuration_flaubert.FlaubertConfig(pre_norm=False, layerdrop=0.0, vocab_size=30145, emb_dim=2048, n_layers=12, n_heads=16, dropout=0.1, attention_dropout=0.1, gelu_activation=True, sinusoidal_embeddings=False, causal=False, asm=False, n_langs=1, use_lang_emb=True, max_position_embeddings=512, embed_init_std=0.02209708691207961, layer_norm_eps=1e-12, init_std=0.02, bos_index=0, eos_index=1, pad_index=2, unk_index=3, mask_index=5, is_encoder=True, summary_type='first', summary_use_proj=True, summary_activation=None, summary_proj_to_labels=True, summary_first_dropout=0.1, start_n_top=5, end_n_top=5, mask_token_id=0, lang_id=0, pad_token_id=2, bos_token_id=0, **kwargs)

The FlauBERT model was proposed in the paper FlauBERT: Unsupervised Language Model Pre-training for French by Hang Le et al. It’s a transformer model pretrained using a masked language modeling (MLM) objective (like BERT).

The abstract from the paper is the following:

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP.

This model was contributed by formiel. The original code can be found here.

Tips: - Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).

Args:

pre_norm (bool, optional, defaults to False):

Whether to apply the layer normalization before or after the feed forward layer following the attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)

layerdrop (float, optional, defaults to 0.0):

Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand with Structured Dropout. ICLR 2020)

vocab_size (int, optional, defaults to 30145):

Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling FlaubertModel or TFFlaubertModel.

emb_dim (int, optional, defaults to 2048):

Dimensionality of the encoder layers and the pooler layer.

n_layer (int, optional, defaults to 12):

Number of hidden layers in the Transformer encoder.

n_head (int, optional, defaults to 16):

Number of attention heads for each attention layer in the Transformer encoder.

dropout (float, optional, defaults to 0.1):

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

attention_dropout (float, optional, defaults to 0.1):

The dropout probability for the attention mechanism

gelu_activation (bool, optional, defaults to True):

Whether or not to use a gelu activation instead of relu.

sinusoidal_embeddings (bool, optional, defaults to False):

Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.

causal (bool, optional, defaults to False):

Whether or not the model should behave in a causal manner. Causal models use a triangular attention mask in order to only attend to the left-side context instead if a bidirectional context.

asm (bool, optional, defaults to False):

Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction layer.

n_langs (int, optional, defaults to 1):

The number of languages the model handles. Set to 1 for monolingual models.

use_lang_emb (bool, optional, defaults to True)

Whether to use language embeddings. Some models use additional language embeddings, see the multilingual models page for information on how to use them.

max_position_embeddings (int, optional, defaults to 512):

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

embed_init_std (float, optional, defaults to 2048^-0.5):

The standard deviation of the truncated_normal_initializer for initializing the embedding matrices.

init_std (int, optional, defaults to 50257):

The standard deviation of the truncated_normal_initializer for initializing all weight matrices except the embedding matrices.

layer_norm_eps (float, optional, defaults to 1e-12):

The epsilon used by the layer normalization layers.

bos_index (int, optional, defaults to 0):

The index of the beginning of sentence token in the vocabulary.

eos_index (int, optional, defaults to 1):

The index of the end of sentence token in the vocabulary.

pad_index (int, optional, defaults to 2):

The index of the padding token in the vocabulary.

unk_index (int, optional, defaults to 3):

The index of the unknown token in the vocabulary.

mask_index (int, optional, defaults to 5):

The index of the masking token in the vocabulary.

is_encoder(bool, optional, defaults to True):

Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.

summary_type (string, optional, defaults to “first”):

Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

Has to be one of the following options:

“last”: Take the last token hidden state (like XLNet).

“first”: Take the first token hidden state (like BERT).

“mean”: Take the mean of all tokens hidden states.

“cls_index”: Supply a Tensor of classification token position (like GPT/GPT-2).

“attn”: Not implemented now, use multi-head attention.

summary_use_proj (bool, optional, defaults to True):

Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

Whether or not to add a projection after the vector extraction.

summary_activation (str, optional):

Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

Pass “tanh” for a tanh activation to the output, any other value will result in no activation.

summary_proj_to_labels (bool, optional, defaults to True):

Used in the sequence classification and multiple choice models.

Whether the projection outputs should have config.num_labels or config.hidden_size classes.

summary_first_dropout (float, optional, defaults to 0.1):

Used in the sequence classification and multiple choice models.

The dropout ratio to be used after the projection and activation.

start_n_top (int, optional, defaults to 5):

Used in the SQuAD evaluation script.

end_n_top (int, optional, defaults to 5):

Used in the SQuAD evaluation script.

mask_token_id (int, optional, defaults to 0):

Model agnostic parameter to identify masked tokens when generating text in an MLM context.

lang_id (int, optional, defaults to 1):

The ID of the language used by the model. This parameter is used when generating text in a given language.

class transformers.models.fnet.configuration_fnet.FNetConfig(vocab_size=32000, hidden_size=768, num_hidden_layers=12, intermediate_size=3072, hidden_act='gelu_new', hidden_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=4, initializer_range=0.02, layer_norm_eps=1e-12, use_tpu_fourier_optimizations=False, tpu_short_seq_length=512, pad_token_id=3, bos_token_id=1, eos_token_id=2, **kwargs)

The FNet model was proposed in FNet: Mixing Tokens with Fourier Transforms by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT model with a fourier transform which returns only the real parts of the transform. The model is significantly faster than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97% accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model. The abstract from the paper is the following:

We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that “mix” input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the “efficient” Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.

Tips on usage:

The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum sequence length for fine-tuning and inference.

This model was contributed by gchhablani. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 32000):: Vocabulary size of the FNet model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling FNetModel or TFFNetModel.
hidden_size (int, optional, defaults to 768):: Dimension of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “gelu_new”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 4):: The vocabulary size of the token_type_ids passed when calling FNetModel or TFFNetModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
use_tpu_fourier_optimizations (bool, optional, defaults to False):: Determines whether to use TPU optimized FFTs. If True, the model will favor axis-wise FFTs transforms. Set to False for GPU/CPU hardware, in which case n-dimensional FFTs are used.
tpu_short_seq_length (int, optional, defaults to 512):: The sequence length that is expected by the model when using TPUs. This will be used to initialize the DFT matrix only when use_tpu_fourier_optimizations is set to True and the input sequence is shorter than or equal to 4096 tokens.

class transformers.models.gemma.configuration_gemma.GemmaConfig(vocab_size=256000, hidden_size=3072, intermediate_size=24576, num_hidden_layers=28, num_attention_heads=16, num_key_value_heads=16, head_dim=256, hidden_act='gelu_pytorch_tanh', hidden_activation=None, max_position_embeddings=8192, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, pad_token_id=0, eos_token_id=1, bos_token_id=2, tie_word_embeddings=True, rope_theta=10000.0, attention_bias=False, attention_dropout=0.0, **kwargs)

The Gemma model was proposed in Gemma: Open Models Based on Gemini Technology and Research by Gemma Team, Google. Gemma models are trained on 6T tokens, and released with 2 versions, 2b and 7b.

The abstract from the paper is the following:

This work introduces Gemma, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations

Tips:

The original checkpoints can be converted using the conversion script src/transformers/models/gemma/convert_gemma_weights_to_hf.py

This model was contributed by Arthur Zucker, Younes Belkada, Sanchit Gandhi, Pedro Cuenca.

Args:

vocab_size (int, optional, defaults to 256000):: Vocabulary size of the Gemma model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GemmaModel
hidden_size (int, optional, defaults to 3072):: Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 24576):: Dimension of the MLP representations.
num_hidden_layers (int, optional, defaults to 28):: Number of hidden layers in the Transformer decoder.
num_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (int, optional, defaults to 16):: This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout `this paper. If it is not specified, will default to num_attention_heads.
head_dim (int, optional, defaults to 256):: The attention head dimension.
hidden_act (str or function, optional, defaults to “gelu_pytorch_tanh”):: The legacy activation function. It is overwritten by the hidden_activation.
hidden_activation (str or function, optional):: The non-linear activation function (function or string) in the decoder. Will default to “gelu_pytorch_tanh” if not specified. “gelu_pytorch_tanh” uses an approximation of the “gelu” activation function.
max_position_embeddings (int, optional, defaults to 8192):: The maximum sequence length that this model might ever be used with.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float, optional, defaults to 1e-06):: The epsilon used by the rms normalization layers.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
pad_token_id (int, optional, defaults to 0):: Padding token id.
eos_token_id (int, optional, defaults to 1):: End of stream token id.
bos_token_id (int, optional, defaults to 2):: Beginning of stream token id.
tie_word_embeddings (bool, optional, defaults to True):: Whether to tie weight embeddings
rope_theta (float, optional, defaults to 10000.0):: The base period of the RoPE embeddings.
attention_bias (bool, defaults to False, optional, defaults to False):: Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.

>>> from transformers import GemmaModel, GemmaConfig

>>> # Initializing a Gemma gemma-7b style configuration
>>> configuration = GemmaConfig()

>>> # Initializing a model from the gemma-7b style configuration
>>> model = GemmaModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

class transformers.models.git.configuration_git.GitConfig(vision_config=None, vocab_size=30522, hidden_size=768, num_hidden_layers=6, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=1024, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, position_embedding_type='absolute', use_cache=True, tie_word_embeddings=False, bos_token_id=101, eos_token_id=102, num_image_with_embedding=None, **kwargs)

The GIT model was proposed in GIT: A Generative Image-to-text Transformer for Vision and Language by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. GIT is a decoder-only Transformer that leverages CLIP’s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on image captioning and visual question answering benchmarks.

The abstract from the paper is the following:

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.

Tips:

GIT is implemented in a very similar way to GPT-2, the only difference being that the model is also conditioned on pixel_values.
One can use GitProcessor to prepare images for the model, and the generate method for autoregressive generation.

GIT architecture. Taken from the <a href=”https://arxiv.org/abs/2205.14100” target=”_blank”>original paper</a>.

This model was contributed by nielsr. The original code can be found here.

Args:

vision_config (dict, optional):: Dictionary of configuration options used to initialize GitVisionConfig.
vocab_size (int, optional, defaults to 30522):: Vocabulary size of the GIT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GitModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 6):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).
num_image_with_embedding (int, optional):: The number of temporal embeddings to add, in case the model is used for video captioning/VQA.

class transformers.models.gpt2.configuration_gpt2.GPT2Config(vocab_size=50257, n_positions=1024, n_embd=768, n_layer=12, n_head=12, n_inner=None, activation_function='gelu_new', resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, layer_norm_epsilon=1e-05, initializer_range=0.02, summary_type='cls_index', summary_use_proj=True, summary_activation=None, summary_proj_to_labels=True, summary_first_dropout=0.1, scale_attn_weights=True, use_cache=True, bos_token_id=50256, eos_token_id=50256, scale_attn_by_inverse_layer_idx=False, reorder_and_upcast_attn=False, **kwargs)

The GPT-Sw3 model was first proposed in Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.

Since that first paper the authors have extended their work and trained new models on their new 1.2TB corpora named The Nordic Pile.

GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.

This model was contributed by AI Sweden.

The implementation uses the GPT2Model coupled with our GPTSw3Tokenizer. This means that AutoTokenizer and AutoModelForCausalLM map to our tokenizer implementation and the corresponding GPT2 model implementation respectively. Note that sentencepiece is required to use our tokenizer and can be installed with: pip install transformers[sentencepiece] or pip install sentencepiece

class transformers.models.gpt2.configuration_gpt2.GPT2Config(vocab_size=50257, n_positions=1024, n_embd=768, n_layer=12, n_head=12, n_inner=None, activation_function='gelu_new', resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, layer_norm_epsilon=1e-05, initializer_range=0.02, summary_type='cls_index', summary_use_proj=True, summary_activation=None, summary_proj_to_labels=True, summary_first_dropout=0.1, scale_attn_weights=True, use_cache=True, bos_token_id=50256, eos_token_id=50256, scale_attn_by_inverse_layer_idx=False, reorder_and_upcast_attn=False, **kwargs)

OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever from OpenAI. It’s a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

The abstract from the paper is the following:

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset`1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Tips:

GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script.
The model can take the past_key_values (for PyTorch) or past (for TF) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the GPT2Model.forward method, or for TF the past argument of the TFGPT2Model.call method for more information on its usage.
Enabling the scale_attn_by_inverse_layer_idx and reorder_and_upcast_attn flags will apply the training stability improvements from [Mistral <https://github.com/stanford-crfm/mistral/>`__ (for PyTorch only).

Write With Transformer is a webapp created and hosted by Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five different sizes: small, medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.

This model was contributed by thomwolf. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 50257):

Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GPT2Model or TFGPT2Model.

n_positions (int, optional, defaults to 1024):

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

n_embd (int, optional, defaults to 768):

Dimensionality of the embeddings and hidden states.

n_layer (int, optional, defaults to 12):

Number of hidden layers in the Transformer encoder.

n_head (int, optional, defaults to 12):

Number of attention heads for each attention layer in the Transformer encoder.

n_inner (int, optional):

Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd

activation_function (str, optional, defaults to “gelu_new”):

Activation function, to be selected in the list [“relu”, “silu”, “gelu”, “tanh”, “gelu_new”].

resid_pdrop (float, optional, defaults to 0.1):

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

embd_pdrop (float, optional, defaults to 0.1):

The dropout ratio for the embeddings.

attn_pdrop (float, optional, defaults to 0.1):

The dropout ratio for the attention.

layer_norm_epsilon (float, optional, defaults to 1e-05):

The epsilon to use in the layer normalization layers.

initializer_range (float, optional, defaults to 0.02):

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

summary_type (string, optional, defaults to “cls_index”):

Argument used when doing sequence summary, used in the models GPT2DoubleHeadsModel and TFGPT2DoubleHeadsModel.

Has to be one of the following options:

“last”: Take the last token hidden state (like XLNet).

“first”: Take the first token hidden state (like BERT).

“mean”: Take the mean of all tokens hidden states.

“cls_index”: Supply a Tensor of classification token position (like GPT/GPT-2).

“attn”: Not implemented now, use multi-head attention.

summary_use_proj (bool, optional, defaults to True):

Argument used when doing sequence summary, used in the models GPT2DoubleHeadsModel and TFGPT2DoubleHeadsModel.

Whether or not to add a projection after the vector extraction.

summary_activation (str, optional):

Argument used when doing sequence summary. Used in for the multiple choice head in GPT2DoubleHeadsModel.

Pass “tanh” for a tanh activation to the output, any other value will result in no activation.

summary_proj_to_labels (bool, optional, defaults to True):

Argument used when doing sequence summary, used in the models GPT2DoubleHeadsModel and TFGPT2DoubleHeadsModel.

Whether the projection outputs should have config.num_labels or config.hidden_size classes.

summary_first_dropout (float, optional, defaults to 0.1):

Argument used when doing sequence summary, used in the models GPT2DoubleHeadsModel and TFGPT2DoubleHeadsModel.

The dropout ratio to be used after the projection and activation.

scale_attn_weights (bool, optional, defaults to True):

Scale attention weights by dividing by sqrt(hidden_size)..

use_cache (bool, optional, defaults to True):

Whether or not the model should return the last key/values attentions (not used by all models).

bos_token_id (int, optional, defaults to 50256):

Id of the beginning of sentence token in the vocabulary.

eos_token_id (int, optional, defaults to 50256):

Id of the end of sentence token in the vocabulary.

scale_attn_by_inverse_layer_idx (bool, optional, defaults to False):

Whether to additionally scale attention weights by 1 / layer_idx + 1.

reorder_and_upcast_attn (bool, optional, defaults to False):

Whether to scale keys (K) prior to computing attention (dot-product) and upcast attention dot-product/softmax to float() when training with mixed precision.

class transformers.models.gpt_bigcode.configuration_gpt_bigcode.GPTBigCodeConfig(vocab_size=50257, n_positions=1024, n_embd=768, n_layer=12, n_head=12, n_inner=None, activation_function='gelu_pytorch_tanh', resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, layer_norm_epsilon=1e-05, initializer_range=0.02, scale_attn_weights=True, use_cache=True, bos_token_id=50256, eos_token_id=50256, attention_softmax_in_fp32=True, scale_attention_softmax_in_fp32=True, multi_query=True, **kwargs)

The GPTBigCode model was proposed in SantaCoder: don’t reach for the stars! by BigCode. The listed authors are: Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.

The abstract from the paper is the following:uery

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at `this https URL. <https://huggingface.co/bigcode>`__

The model is a an optimized GPT2 model with support for Multi-Query Attention.

Args:

vocab_size (int, optional, defaults to 50257):: Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GPTBigCodeModel.
n_positions (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_embd (int, optional, defaults to 768):: Dimensionality of the embeddings and hidden states.
n_layer (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
n_head (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
n_inner (int, optional, defaults to None):: Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd
activation_function (str, optional, defaults to “gelu_pytorch_tanh”):: Activation function, to be selected in the list [“relu”, “silu”, “gelu”, “tanh”, “gelu_new”, “gelu_pytorch_tanh”].
resid_pdrop (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (float, optional, defaults to 0.1):: The dropout ratio for the embeddings.
attn_pdrop (float, optional, defaults to 0.1):: The dropout ratio for the attention.
layer_norm_epsilon (float, optional, defaults to 1e-5):: The epsilon to use in the layer normalization layers.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
scale_attn_weights (bool, optional, defaults to True):: Scale attention weights by dividing by sqrt(hidden_size)..
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).
attention_softmax_in_fp32 (bool, optional, defaults to True):: Whether to call the fused softmax in float32.
scale_attention_softmax_in_fp32 (bool, optional, defaults to True):: Whether to scale the attention softmax in float32.
attention_type (bool, optional, defaults to True):: Whether to use Multi-Query Attion (True) or Multi-Head Attention (False).

class transformers.models.gpt_neox.configuration_gpt_neox.GPTNeoXConfig(vocab_size=50432, hidden_size=6144, num_hidden_layers=44, num_attention_heads=64, intermediate_size=24576, hidden_act='gelu', rotary_pct=0.25, rotary_emb_base=10000, attention_dropout=0.0, hidden_dropout=0.0, classifier_dropout=0.1, max_position_embeddings=2048, initializer_range=0.02, layer_norm_eps=1e-05, use_cache=True, bos_token_id=0, eos_token_id=2, tie_word_embeddings=False, use_parallel_residual=True, rope_scaling=None, attention_bias=True, **kwargs)

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe GPT-NeoX-20B’s architecture and training and evaluate its performance on a range of language-understanding, mathematics, and knowledge-based tasks. We find that GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated five-shot than similarly sized GPT-3 and FairSeq models. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.

Development of the model was led by Sid Black, Stella Biderman and Eric Hallahan, and the model was trained with generous the support of CoreWeave.

GPT-NeoX-20B was trained with fp16, thus it is recommended to initialize the model as follows:

model = GPTNeoXForCausalLM.from_pretrained(“EleutherAI/gpt-neox-20b”).half().cuda()

GPT-NeoX-20B also has a different tokenizer from the one used in GPT-J-6B and GPT-Neo. The new tokenizer allocates additional tokens to whitespace characters, making the model more suitable for certain tasks like code generation.

#Args:

vocab_size (int, optional, defaults to 50432):

Vocabulary size of the GPTNeoX model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GPTNeoXModel.

hidden_size (int, optional, defaults to 6144):

Dimension of the encoder layers and the pooler layer.

num_hidden_layers (int, optional, defaults to 44):

Number of hidden layers in the Transformer encoder.

num_attention_heads (int, optional, defaults to 64):

Number of attention heads for each attention layer in the Transformer encoder.

intermediate_size (int, optional, defaults to 24576):

Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

hidden_act (str or function, optional, defaults to “gelu”):

The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” are supported.

rotary_pct (float, optional, defaults to 0.25):

percentage of hidden dimensions to allocate to rotary embeddings

rotary_emb_base (int, optional, defaults to 10000)

base for computing rotary embeddings frequency

attention_dropout (float, optional, defaults to 0.0):

The dropout ratio probability of the attention score.

hidden_dropout (float, optional, defaults to 0.0):

The dropout ratio of (1) the word embeddings, (2) the post-attention hidden states, and (3) the post-mlp hidden states.

classifier_dropout (float, optional, defaults to 0.1):

Argument used when doing token classification, used in the model GPTNeoXForTokenClassification.

The dropout ratio for the hidden layer.

max_position_embeddings (int, optional, defaults to 2048):

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

initializer_range (float, optional, defaults to 1e-5):

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

layer_norm_eps (float, optional, defaults to 1e-12):

The epsilon used by the layer normalization layers.

use_cache (bool, optional, defaults to True):

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

use_parallel_residual (bool, optional, defaults to True):

Whether to use a “parallel” formulation in each Transformer layer, which can provide a slight training speedup at large scales (e.g. 20B).

rope_scaling (Dict, optional):

Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is {“type”: strategy name, “factor”: scaling factor}. When using this flag, don’t update max_position_embeddings to the expected new maximum. See the following thread for more information on how these scaling strategies behave: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an experimental feature, subject to breaking API changes in future versions.

attention_bias (bool, optional, defaults to True):

Whether to use a bias in the query, key, value and output projection layers during self-attention.

class transformers.models.gpt_neox_japanese.configuration_gpt_neox_japanese.GPTNeoXJapaneseConfig(vocab_size=32000, hidden_size=2560, num_hidden_layers=32, num_attention_heads=32, intermediate_multiple_size=4, hidden_act='gelu', rotary_pct=1.0, rotary_emb_base=10000, max_position_embeddings=2048, initializer_range=0.02, layer_norm_eps=1e-05, use_cache=True, bos_token_id=31996, eos_token_id=31999, attention_dropout=0.1, hidden_dropout=0.0, **kwargs)

We introduce GPT-NeoX-Japanese, which is an autoregressive language model for Japanese, trained on top of https://github.com/EleutherAI/gpt-neox. Japanese is a unique language with its large vocabulary and a combination of hiragana, katakana, and kanji writing scripts. To address this distinct structure of the Japanese language, we use a special sub-word tokenizer. We are very grateful to tanreinama for open-sourcing this incredibly helpful tokenizer. Following the recommendations from Google’s research on PaLM, we have removed bias parameters from transformer blocks, achieving better model performance. Please refer this article in detail.

Development of the model was led by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori from ABEJA, Inc.. For more information on this model-building activity, please refer here (ja).

#Args:

vocab_size (int, optional, defaults to 32000):: Vocabulary size of the GPTNeoXJapanese model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GPTNeoXJapanese.
hidden_size (int, optional, defaults to 2560):: Dimension of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 32):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 32):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_multiple_size (int, optional, defaults to 4):: Dimension of the “intermediate” layer in the Transformer encoder is calculated by hidden_size * intermediate_multiple_size.
hidden_act (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler.
rotary_pct (float, optional, defaults to 1.00):: percentage of hidden dimensions to allocate to rotary embeddings
rotary_emb_base (int, optional, defaults to 10000): base for computing rotary embeddings frequency
max_position_embeddings (int, optional, defaults to 2048):: The maximum sequence length that this model might ever be used with.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-5):: The epsilon used by the layer normalization layers.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
attention_dropout (float, optional, defaults to 0.1):: The dropout ratio for the attention.
hidden_dropout (float, optional, defaults to 0.0):: The dropout ratio for the hidden layer.

class transformers.models.gptj.configuration_gptj.GPTJConfig(vocab_size=50400, n_positions=2048, n_embd=4096, n_layer=28, n_head=16, rotary_dim=64, n_inner=None, activation_function='gelu_new', resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0, layer_norm_epsilon=1e-05, initializer_range=0.02, use_cache=True, bos_token_id=50256, eos_token_id=50256, tie_word_embeddings=False, **kwargs)

The GPT-J model was released in the kingoflolz/mesh-transformer-jax repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like causal language model trained on the Pile dataset.

This model was contributed by Stella Biderman.

Tips:

To load GPT-J in float32 one would need at least 2x model size RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB RAM to just load the model. To reduce the RAM usage there are a few options. The torch_dtype argument can be used to initialize the model in half-precision on a CUDA device only. There is also a fp16 branch which stores the fp16 weights, which could be used to further minimize the RAM usage:

>>> from transformers import GPTJForCausalLM
>>> import torch

>>> device = "cuda"
>>> model = GPTJForCausalLM.from_pretrained(
...     "EleutherAI/gpt-j-6B",
...     revision="float16",
...     torch_dtype=torch.float16,
... ).to(device)

The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam optimizer for example makes four copies of the model: model, gradients, average and squared average of the gradients. So it would need at least 4x model size GPU memory, even with mixed precision as gradient updates are in fp32. This is not including the activations and data batches, which would again require some more GPU RAM. So one should explore solutions such as DeepSpeed, to train/fine-tune the model. Another option is to use the original codebase to train/fine-tune the model on TPU and then convert the model to Transformers format for inference. Instructions for that could be found here
Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer. These extra tokens are added for the sake of efficiency on TPUs. To avoid the mismatch between embedding matrix size and vocab size, the tokenizer for GPT-J contains 143 extra tokens <|extratoken_1|>… <|extratoken_143|>, so the vocab_size of tokenizer also becomes 50400.

#Args:

vocab_size (int, optional, defaults to 50400):: Vocabulary size of the GPT-J model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GPTJModel.
n_positions (int, optional, defaults to 2048):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_embd (int, optional, defaults to 4096):: Dimensionality of the embeddings and hidden states.
n_layer (int, optional, defaults to 28):: Number of hidden layers in the Transformer encoder.
n_head (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
rotary_dim (int, optional, defaults to 64):: Number of dimensions in the embedding that Rotary Position Embedding is applied to.
n_inner (int, optional, defaults to None):: Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd
activation_function (str, optional, defaults to “gelu_new”):: Activation function, to be selected in the list [“relu”, “silu”, “gelu”, “tanh”, “gelu_new”].
resid_pdrop (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (int, optional, defaults to 0.1):: The dropout ratio for the embeddings.
attn_pdrop (float, optional, defaults to 0.1):: The dropout ratio for the attention.
layer_norm_epsilon (float, optional, defaults to 1e-5):: The epsilon to use in the layer normalization layers.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).

class transformers.models.ibert.configuration_ibert.IBertConfig(vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=1, bos_token_id=0, eos_token_id=2, position_embedding_type='absolute', quant_mode=False, force_dequant='none', **kwargs)

The I-BERT model was proposed in I-BERT: Integer-only BERT Quantization by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer. It’s a quantized version of RoBERTa running inference up to four times faster.

The abstract from the paper is the following:

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4 - 4.0x for INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has been open-sourced.

This model was contributed by kssteven. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the I-BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling IBertModel
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling IBertModel
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
quant_mode (bool, optional, defaults to False):: Whether to quantize the model or not.
force_dequant (str, optional, defaults to “none”):: Force dequantize specific nonlinear layer. Dequatized layers are then executed with full precision. “none”, “gelu”, “softmax”, “layernorm” and “nonlinear” are supported. As deafult, it is set as “none”, which does not dequantize any layers. Please specify “gelu”, “softmax”, or “layernorm” to dequantize GELU, Softmax, or LayerNorm, respectively. “nonlinear” will dequantize all nonlinear layers, i.e., GELU, Softmax, and LayerNorm.

class transformers.models.imagegpt.configuration_imagegpt.ImageGPTConfig(vocab_size=513, n_positions=1024, n_embd=512, n_layer=24, n_head=8, n_inner=None, activation_function='quick_gelu', resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, layer_norm_epsilon=1e-05, initializer_range=0.02, scale_attn_weights=True, use_cache=True, tie_word_embeddings=False, scale_attn_by_inverse_layer_idx=False, reorder_and_upcast_attn=False, **kwargs)

The ImageGPT model was proposed in Generative Pretraining from Pixels by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like model trained to predict the next pixel value, allowing for both unconditional and conditional image generation.

The abstract from the paper is the following:

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.

Summary of the approach. Taken from the original paper.

This model was contributed by nielsr, based on this issue. The original code can be found here.

Tips:

ImageGPT is almost exactly the same as GPT-2, with the exception that a different activation function is used (namely “quick gelu”), and the layer normalization layers don’t mean center the inputs. ImageGPT also doesn’t have tied input- and output embeddings.
As the time- and memory requirements of the attention mechanism of Transformers scales quadratically in the sequence length, the authors pre-trained ImageGPT on smaller input resolutions, such as 32x32 and 64x64. However, feeding a sequence of 32x32x3=3072 tokens from 0..255 into a Transformer is still prohibitively large. Therefore, the authors applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special “start of sentence” (SOS) token, used at the beginning of every sequence. One can use ImageGPTImageProcessor to prepare images for the model.
Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly performant image features useful for downstream tasks, such as image classification. The authors showed that the features in the middle of the network are the most performant, and can be used as-is to train a linear model (such as a sklearn logistic regression model for example). This is also referred to as “linear probing”. Features can be easily obtained by first forwarding the image through the model, then specifying output_hidden_states=True, and then average-pool the hidden states at whatever layer you like.
Alternatively, one can further fine-tune the entire model on a downstream dataset, similar to BERT. For this, you can use ImageGPTForImageClassification.
ImageGPT comes in different sizes: there’s ImageGPT-small, ImageGPT-medium and ImageGPT-large. The authors did also train an XL variant, which they didn’t release. The differences in size are summarized in the following table:

|---|—|---|—|---|—| | MiT-b0 | [2, 2, 2, 2] | [32, 64, 160, 256] | 256 | 3.7 | 70.5 | | MiT-b1 | [2, 2, 2, 2] | [64, 128, 320, 512] | 256 | 14.0 | 78.7 | | MiT-b2 | [3, 4, 6, 3] | [64, 128, 320, 512] | 768 | 25.4 | 81.6 | | MiT-b3 | [3, 4, 18, 3] | [64, 128, 320, 512] | 768 | 45.2 | 83.1 | | MiT-b4 | [3, 8, 27, 3] | [64, 128, 320, 512] | 768 | 62.6 | 83.6 | | MiT-b5 | [3, 6, 40, 3] | [64, 128, 320, 512] | 768 | 82.0 | 83.8 |

Args:

vocab_size (int, optional, defaults to 512):: Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling ImageGPTModel or TFImageGPTModel.
n_positions (int, optional, defaults to 32*32):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_embd (int, optional, defaults to 512):: Dimensionality of the embeddings and hidden states.
n_layer (int, optional, defaults to 24):: Number of hidden layers in the Transformer encoder.
n_head (int, optional, defaults to 8):: Number of attention heads for each attention layer in the Transformer encoder.
n_inner (int, optional, defaults to None):: Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd
activation_function (str, optional, defaults to “quick_gelu”):: Activation function (can be one of the activation functions defined in src/transformers/activations.py). Defaults to “quick_gelu”.
resid_pdrop (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (int, optional, defaults to 0.1):: The dropout ratio for the embeddings.
attn_pdrop (float, optional, defaults to 0.1):: The dropout ratio for the attention.
layer_norm_epsilon (float, optional, defaults to 1e-5):: The epsilon to use in the layer normalization layers.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
scale_attn_weights (bool, optional, defaults to True):: Scale attention weights by dividing by sqrt(hidden_size)..
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).
scale_attn_by_inverse_layer_idx (bool, optional, defaults to False):: Whether to additionally scale attention weights by 1 / layer_idx + 1.
reorder_and_upcast_attn (bool, optional, defaults to False):: Whether to scale keys (K) prior to computing attention (dot-product) and upcast attention dot-product/softmax to float() when training with mixed precision.

class transformers.models.layoutlm.configuration_layoutlm.LayoutLMConfig(vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, position_embedding_type='absolute', use_cache=True, max_2d_position_embeddings=1024, **kwargs)

The LayoutLM model was proposed in the paper LayoutLM: Pre-training of Text and Layout for Document Image Understanding by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It’s a simple but effective pretraining method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results on several downstream tasks:

form understanding: the FUNSD dataset (a collection of 199 annotated forms comprising more than 30,000 words).
receipt understanding: the SROIE dataset (a collection of 626 receipts for training and 347 receipts for testing).
document image classification: the RVL-CDIP dataset (a collection of 400,000 images belonging to one of 16 classes).

The abstract from the paper is the following:

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words’ visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).

Tips:

In addition to input_ids, ~transformers.LayoutLMModel.forward also expects the input bbox, which are the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such as Google’s Tesseract (there’s a Python wrapper available). Each bounding box should be in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000 scale. To normalize, you can use the following function:

def normalize_bbox(bbox, width, height):

return `: int(1000 * (bbox[0] / width)), int(1000 * (bbox[1] / height)), int(1000 * (bbox[2] / width)), int(1000 * (bbox[3] / height)),

]

Here, width and height correspond to the width and height of the original document in which the token occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:

from PIL import Image

# Document can be a png, jpg, etc. PDFs must be converted to images. image = Image.open(name_of_your_document).convert(“RGB”)

width, height = image.size

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the LayoutLM model. Defines the different tokens that can be represented by the inputs_ids passed to the forward method of LayoutLMModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed into LayoutLMModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
pad_token_id (int, optional, defaults to 0):: The value used to pad input_ids.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to [Self-Attention with Relative Position Representations (Shaw et al.) <https://arxiv.org/abs/1803.02155>`__. For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
max_2d_position_embeddings (int, optional, defaults to 1024):: The maximum value that the 2D position embedding might ever used. Typically set this to something large just in case (e.g., 1024).

class transformers.models.led.configuration_led.LEDConfig(vocab_size=50265, max_encoder_position_embeddings=16384, max_decoder_position_embeddings=1024, encoder_layers=12, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=12, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, use_cache=True, is_encoder_decoder=True, activation_function='gelu', d_model=1024, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, decoder_start_token_id=2, classifier_dropout=0.0, pad_token_id=1, bos_token_id=0, eos_token_id=2, attention_window: List[int] | int = 512, **kwargs)

The LED model was proposed in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.

The abstract from the paper is the following:

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.

Tips:

LEDForConditionalGeneration is an extension of BartForConditionalGeneration exchanging the traditional self-attention layer with Longformer’s chunked self-attention layer. LEDTokenizer is an alias of BartTokenizer.
LED works very well on long-range sequence-to-sequence tasks where the input_ids largely exceed a length of 1024 tokens.
LED pads the input_ids to be a multiple of config.attention_window if required. Therefore a small speed-up is gained, when LEDTokenizer is used with the pad_to_multiple_of argument.
LED makes use of global attention by means of the global_attention_mask (see LongformerModel). For summarization, it is advised to put global attention only on the first <s> token. For question answering, it is advised to put global attention on all tokens of the question.
To fine-tune LED on all 16384, gradient checkpointing can be enabled in case training leads to out-of-memory (OOM) errors. This can be done by executing model.gradient_checkpointing_enable().

Moreover, the use_cache=False
flag can be used to disable the caching mechanism to save memory.

A notebook showing how to evaluate LED, can be accessed here.
A notebook showing how to fine-tune LED, can be accessed here.
LED is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

This model was contributed by patrickvonplaten.

Args:

vocab_size (int, optional, defaults to 50265):: Vocabulary size of the LED model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling LEDModel or TFLEDModel.
d_model (int, optional, defaults to 1024):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 12):: Number of encoder layers.
decoder_layers (int, optional, defaults to 12):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
classifier_dropout (float, optional, defaults to 0.0):: The dropout ratio for classifier.
max_encoder_position_embeddings (int, optional, defaults to 16384):: The maximum sequence length that the encoder might ever be used with.
max_decoder_position_embeddings (int, optional, defaults to 16384):: The maximum sequence length that the decoder might ever be used with.
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models)

class transformers.models.llama.configuration_llama.LlamaConfig(vocab_size=32000, hidden_size=4096, intermediate_size=11008, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=None, hidden_act='silu', max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, pad_token_id=None, bos_token_id=1, eos_token_id=2, pretraining_tp=1, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, **kwargs)

The LLaMA model was proposed in LLaMA: Open and Efficient Foundation Language Models by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. It is a collection of foundation language models ranging from 7B to 65B parameters.

The abstract from the paper is the following:

*We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community. *

Tips:

Weights for the LLaMA models can be obtained from by filling out this form
After downloading the weights, they will need to be converted to the Hugging Face Transformers format using the conversion script. The script can be called with the following (example) command:

```bash python src/transformers/models/llama/convert_llama_weights_to_hf.py

–input_dir /path/to/downloaded/llama/weights –model_size 7B –output_dir /output/path

```

After conversion, the model and tokenizer can be loaded via:

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(“/output/path”) model = LlamaForCausalLM.from_pretrained(“/output/path”)

Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 65B model, it’s thus 130GB of RAM needed.

The LLaMA tokenizer is a BPE model based on sentencepiece. One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. “Banana”), the tokenizer does not prepend the prefix space to the string.

This model was contributed by zphang with contributions from BlackSamorez. The code of the implementation in Hugging Face is based on GPT-NeoX here. The original code of the authors can be found here.

Args:

vocab_size (int, optional, defaults to 32000):: Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling LlamaModel
hidden_size (int, optional, defaults to 4096):: Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 11008):: Dimension of the MLP representations.
num_hidden_layers (int, optional, defaults to 32):: Number of hidden layers in the Transformer decoder.
num_attention_heads (int, optional, defaults to 32):: Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (int, optional):: This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout `this paper. If it is not specified, will default to num_attention_heads.
hidden_act (str or function, optional, defaults to “silu”):: The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int, optional, defaults to 2048):: The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens, Llama 2 up to 4096, CodeLlama up to 16384.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float, optional, defaults to 1e-06):: The epsilon used by the rms normalization layers.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
pad_token_id (int, optional):: Padding token id.
bos_token_id (int, optional, defaults to 1):: Beginning of stream token id.
eos_token_id (int, optional, defaults to 2):: End of stream token id.
pretraining_tp (int, optional, defaults to 1):: Experimental feature. Tensor parallelism rank used during pretraining. Please refer to this document to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to this issue.
tie_word_embeddings (bool, optional, defaults to False):: Whether to tie weight embeddings
rope_theta (float, optional, defaults to 10000.0):: The base period of the RoPE embeddings.
rope_scaling (Dict, optional):: Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is {“type”: strategy name, “factor”: scaling factor}. When using this flag, don’t update max_position_embeddings to the expected new maximum. See the following thread for more information on how these scaling strategies behave: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an experimental feature, subject to breaking API changes in future versions.
attention_bias (bool, defaults to False, optional, defaults to False):: Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.

>>> from transformers import LlamaModel, LlamaConfig

>>> # Initializing a LLaMA llama-7b style configuration
>>> configuration = LlamaConfig()

>>> # Initializing a model from the llama-7b style configuration
>>> model = LlamaModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

class transformers.models.longformer.configuration_longformer.LongformerConfig(attention_window: List[int] | int = 512, sep_token_id: int = 2, pad_token_id: int = 1, bos_token_id: int = 0, eos_token_id: int = 2, vocab_size: int = 30522, hidden_size: int = 768, num_hidden_layers: int = 12, num_attention_heads: int = 12, intermediate_size: int = 3072, hidden_act: str = 'gelu', hidden_dropout_prob: float = 0.1, attention_probs_dropout_prob: float = 0.1, max_position_embeddings: int = 512, type_vocab_size: int = 2, initializer_range: float = 0.02, layer_norm_eps: float = 1e-12, onnx_export: bool = False, **kwargs)

The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.

The abstract from the paper is the following:

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.

Tips:

Since the Longformer is based on RoBERTa, it doesn’t have token_type_ids. You don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or </s>).
A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g., what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the local attention section for more information.

This model was contributed by beltagy. The Authors’ code can be found here.

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the Longformer model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling LongformerModel or TFLongformerModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling LongformerModel or TFLongformerModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
attention_window (int or List[int], optional, defaults to 512):: Size of an attention window around each token. If an int, use the same size for all layers. To specify a different window size for each layer, use a List[int] where len(attention_window) == num_hidden_layers.

class transformers.models.longt5.configuration_longt5.LongT5Config(vocab_size=32128, d_model=512, d_kv=64, d_ff=2048, num_layers=6, num_decoder_layers=None, num_heads=8, local_radius=127, global_block_size=16, relative_attention_num_buckets=32, relative_attention_max_distance=128, dropout_rate=0.1, layer_norm_epsilon=1e-06, initializer_factor=1.0, feed_forward_proj='relu', is_encoder_decoder=True, encoder_attention_type='local', use_cache=True, pad_token_id=0, eos_token_id=1, **kwargs)

The LongT5 model was proposed in LongT5: Efficient Text-To-Text Transformer for Long Sequences by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung and Yinfei Yang. It’s an encoder-decoder transformer pre-trained in a text-to-text denoising generative setting. LongT5 model is an extension of T5 model, and it enables using one of the two different efficient attention mechanisms - (1) Local attention, or (2) Transient-Global attention.

The abstract from the paper is the following:

Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {em Transient Global} (TGlobal), which mimics ETC’s local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.

Tips:

LongT5ForConditionalGeneration is an extension of T5ForConditionalGeneration exchanging the traditional

encoder self-attention layer with efficient either local attention or transient-global (tglobal) attention. - Unlike the T5 model, LongT5 does not use a task prefix. Furthermore, it uses a different pre-training objective inspired by the pre-training of PegasusForConditionalGeneration. - LongT5 model is designed to work efficiently and very well on long-range sequence-to-sequence tasks where the input sequence exceeds commonly used 512 tokens. It is capable of handling input sequences of a length up to 16,384 tokens. - For Local Attention, the sparse sliding-window local attention operation allows a given token to attend only r tokens to the left and right of it (with r=127 by default). Local Attention does not introduce any new parameters to the model. The complexity of the mechanism is linear in input sequence length l: O(l*r). - Transient Global Attention is an extension of the Local Attention. It, furthermore, allows each input token to interact with all other tokens in the layer. This is achieved via splitting an input sequence into blocks of a fixed length k (with a default k=16). Then, a global token for such a block is obtained via summing and normalizing the embeddings of every token in the block. Thanks to this, the attention allows each token to attend to both nearby tokens like in Local attention, and also every global token like in the case of standard global attention (transient represents the fact the global tokens are constructed dynamically within each attention operation). As a consequence, TGlobal attention introduces a few new parameters – global relative position biases and a layer normalization for global token’s embedding. The complexity of this mechanism is O(l(r + l/k)). - An example showing how to evaluate a fine-tuned LongT5 model on the pubmed dataset is below.

>>> import evaluate
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer, LongT5ForConditionalGeneration

>>> dataset = load_dataset("scientific_papers", "pubmed", split="validation")
>>> model = (
...     LongT5ForConditionalGeneration.from_pretrained("Stancld/longt5-tglobal-large-16384-pubmed-3k_steps")
...     .to("cuda")
...     .half()
... )
>>> tokenizer = AutoTokenizer.from_pretrained("Stancld/longt5-tglobal-large-16384-pubmed-3k_steps")

>>> def generate_answers(batch):
...     inputs_dict = tokenizer(
...         batch`"article"], max_length=16384, padding="max_length", truncation=True, return_tensors="pt"
...     )
...     input_ids = inputs_dict.input_ids.to("cuda")
...     attention_mask = inputs_dict.attention_mask.to("cuda")
...     output_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=512, num_beams=2)
...     batch["predicted_abstract"] = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
...     return batch

>>> result = dataset.map(generate_answer, batched=True, batch_size=2)
>>> rouge = evaluate.load("rouge")
>>> rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"])

This model was contributed by [stancld <https://huggingface.co/stancld>`__. The original code can be found here.

Arguments:

vocab_size (int, optional, defaults to 32128):: Vocabulary size of the LongT5 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling LongT5Model.
d_model (int, optional, defaults to 512):: Size of the encoder layers and the pooler layer.
d_kv (int, optional, defaults to 64):: Size of the key, query, value projections per attention head. d_kv has to be equal to d_model // num_heads.
d_ff (int, optional, defaults to 2048):: Size of the intermediate feed forward layer in each LongT5Block.
num_layers (int, optional, defaults to 6):: Number of hidden layers in the Transformer encoder.
num_decoder_layers (int, optional):: Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not set.
num_heads (int, optional, defaults to 8):: Number of attention heads for each attention layer in the Transformer encoder.
local_radius (int, optional, defaults to 127): Number of tokens to the left/right for each token to locally self-attend in a local attention mechanism.
global_block_size (int, optional, defaults to 16): Lenght of blocks an input sequence is divided into for a global token representation. Used only for encoder_attention_type = “transient-global”.
relative_attention_num_buckets (int, optional, defaults to 32):: The number of buckets to use for each attention layer.
relative_attention_max_distance (int, optional, defaults to 128):: The maximum distance of the longer sequences for the bucket separation.
dropout_rate (float, optional, defaults to 0.1):: The ratio for all dropout layers.
layer_norm_eps (float, optional, defaults to 1e-6):: The epsilon used by the layer normalization layers.
initializer_factor (float, optional, defaults to 1):: A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).
feed_forward_proj (string, optional, defaults to “relu”):: Type of feed forward layer to be used. Should be one of “relu” or “gated-gelu”. LongT5v1.1 uses the “gated-gelu” feed forward projection. Original LongT5 implementation uses “gated-gelu”.
encoder_attention_type (string, optional, defaults to “local”):: Type of encoder attention to be used. Should be one of “local” or “transient-global”, which are supported by LongT5 implementation.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).

class transformers.models.luke.configuration_luke.LukeConfig(vocab_size=50267, entity_vocab_size=500000, hidden_size=768, entity_emb_size=256, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, use_entity_aware_attention=True, classifier_dropout=None, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs)

The LUKE model was proposed in LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto. It is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps improve performance on various downstream tasks involving reasoning about entities such as named entity recognition, extractive and cloze-style question answering, entity typing, and relation classification.

The abstract from the paper is the following:

Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering).

Tips:

This implementation is the same as RobertaModel with the addition of entity embeddings as well as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities.
LUKE treats entities as input tokens; therefore, it takes entity_ids, entity_attention_mask, entity_token_type_ids and entity_position_ids as extra input. You can obtain those using LukeTokenizer.
LukeTokenizer takes entities and entity_spans (character-based start and end positions of the entities in the input text) as extra input. entities typically consist of `MASK] entities or Wikipedia entities. The brief description when inputting these entities are as follows:
- Inputting [MASK] entities to compute entity representations: The [MASK] entity is used to mask entities to be predicted during pretraining. When LUKE receives the [MASK] entity, it tries to predict the original entity by gathering the information about the entity from the input text. Therefore, the [MASK] entity can be used to address downstream tasks requiring the information of entities in text such as entity typing, relation classification, and named entity recognition.
- Inputting Wikipedia entities to compute knowledge-enhanced token representations: LUKE learns rich information (or knowledge) about Wikipedia entities during pretraining and stores the information in its entity embedding. By using Wikipedia entities as input tokens, LUKE outputs token representations enriched by the information stored in the embeddings of these entities. This is particularly effective for tasks requiring real-world knowledge, such as question answering.
There are three head models for the former use case:
- LukeForEntityClassification, for tasks to classify a single entity in an input text such as entity typing, e.g. the [Open Entity dataset <https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html>`__. This model places a linear head on top of the output entity representation.
- LukeForEntityPairClassification, for tasks to classify the relationship between two entities such as relation classification, e.g. the TACRED dataset. This model places a linear head on top of the concatenated output representation of the pair of given entities.
- LukeForEntitySpanClassification, for tasks to classify the sequence of entity spans, such as named entity recognition (NER). This model places a linear head on top of the output entity representations. You can address NER using this model by inputting all possible entity spans in the text to the model.
LukeTokenizer has a task argument, which enables you to easily create an input to these head models by specifying task=”entity_classification”, task=”entity_pair_classification”, or task=”entity_span_classification”. Please refer to the example code of each head models.

A demo notebook on how to fine-tune LukeForEntityPairClassification for relation classification can be found here.

There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with the HuggingFace implementation of LUKE. They can be found here.

class transformers.models.m2m_100.configuration_m2m_100.M2M100Config(vocab_size=128112, max_position_embeddings=1024, encoder_layers=12, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=12, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.05, decoder_layerdrop=0.05, use_cache=True, is_encoder_decoder=True, activation_function='relu', d_model=1024, dropout=0.1, attention_dropout=0.1, activation_dropout=0.0, init_std=0.02, decoder_start_token_id=2, scale_embedding=True, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs)

The M2M100 model was proposed in Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

The abstract from the paper is the following:

Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

This model was contributed by valhalla.

#Args:

vocab_size (int, optional, defaults to 50265):: Vocabulary size of the M2M100 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling M2M100Model or
d_model (int, optional, defaults to 1024):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 12):: Number of encoder layers.
decoder_layers (int, optional, defaults to 12):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
classifier_dropout (float, optional, defaults to 0.0):: The dropout ratio for classifier.
max_position_embeddings (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).

class transformers.models.mamba.configuration_mamba.MambaConfig(vocab_size=50280, hidden_size=768, state_size=16, num_hidden_layers=32, layer_norm_epsilon=1e-05, pad_token_id=0, bos_token_id=0, eos_token_id=0, expand=2, conv_kernel=4, use_bias=False, use_conv_bias=True, hidden_act='silu', initializer_range=0.1, residual_in_fp32=True, time_step_rank='auto', time_step_scale=1.0, time_step_min=0.001, time_step_max=0.1, time_step_init_scheme='random', time_step_floor=0.0001, rescale_prenorm_residual=False, use_cache=True, **kwargs)

The Mamba model was proposed in Mamba: Linear-Time Sequence Modeling with Selective State Spaces by Albert Gu and Tri Dao.

This model is a new paradigm architecture based on state-space-models. You can read more about the intuition behind these here.

The abstract from the paper is the following:

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Tips:

Mamba is a new state space model architecture that rivals the classic Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.
Mamba stacks mixer layers, which are the equivalent of Attention layers. The core logic of mamba is held in the MambaMixer class.
Two implementations cohabit: one is optimized and uses fast cuda kernels, while the other one is naive but can run on any device!
The current implementation leverages the original cuda kernels: the equivalent of flash attention for Mamba are hosted in the ``mamba-ssm``(https://github.com/state-spaces/mamba) and the ``causal_conv1d``(https://github.com/Dao-AILab/causal-conv1d) repositories. Make sure to install them if your hardware supports them!
Contributions to make the naive path faster are welcome 🤗

This model was contributed by ArthurZ. The original code can be found here.

# Usage

#Args:

vocab_size (int, optional, defaults to 50280):: Vocabulary size of the MAMBA model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MambaModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the embeddings and hidden states.

state_size (int, optional, defaults to 16): shape of the state space latents. num_hidden_layers (int, optional, defaults to 32):

Number of hidden layers in the model.

layer_norm_epsilon (float, optional, defaults to 1e-05):: The epsilon to use in the layer normalization layers.
pad_token_id (int, optional, defaults to 0):: Padding token id.
bos_token_id (int, optional, defaults to 0):: The id of the beginning of sentence token in the vocabulary.
eos_token_id (int, optional, defaults to 0):: The id of the end of sentence token in the vocabulary.

expand (int, optional, defaults to 2): Expanding factor used to determine the intermediate size. conv_kernel (int, optional, defaults to 4): Size of the convolution kernel. use_bias (bool, optional, defaults to False):

Whether or not to use bias in [“in_proj”, “out_proj”] of the mixer block

use_conv_bias (bool, optional, defaults to True):: Whether or not to use bias in the convolution layer of the mixer block.
hidden_act (str, optional, defaults to “silu”):: The non-linear activation function (function or string) in the decoder.
initializer_range (float, optional, defaults to 0.1):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
residual_in_fp32 (bool, optional, defaults to True):: Whether or not residuals should be in float32. If set to False residuals will keep the same dtype as the rest of the model
time_step_rank (Union[int,str], optional, defaults to “auto”):: Rank of the the discretization projection matrix. “auto” means that it will default to math.ceil(self.hidden_size / 16)
time_step_scale (float, optional, defaults to 1.0):: Scale used used to scale dt_proj.bias.
time_step_min (float, optional, defaults to 0.001):: Minimum time_step used to bound dt_proj.bias.
time_step_max (float, optional, defaults to 0.1):: Maximum time_step used to bound dt_proj.bias.
time_step_init_scheme (float, optional, defaults to “random”):: Init scheme used for dt_proj.weight. Should be one of [“random”,”uniform”]
time_step_floor (float, optional, defaults to 0.0001):: Minimum clamping value of the dt_proj.bias layer initialization.
rescale_prenorm_residual (bool, optional, defaults to False):: Whether or not to rescale out_proj weights when initializing.
use_cache (bool, optional, defaults to True):: Whether or not the cache should be used.

class transformers.models.marian.configuration_marian.MarianConfig(vocab_size=58101, decoder_vocab_size=None, max_position_embeddings=1024, encoder_layers=12, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=12, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, use_cache=True, is_encoder_decoder=True, activation_function='gelu', d_model=1024, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, decoder_start_token_id=58100, scale_embedding=False, pad_token_id=58100, eos_token_id=0, forced_eos_token_id=0, share_encoder_decoder_embeddings=True, **kwargs)

A framework for translation models, using the same models as BART. Translations should be similar, but not identical to output in the test set linked to in each model card. This model was contributed by sshleifer.

Args:

vocab_size (int, optional, defaults to 58101):: Vocabulary size of the Marian model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MarianModel or TFMarianModel.
d_model (int, optional, defaults to 1024):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 12):: Number of encoder layers.
decoder_layers (int, optional, defaults to 12):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
max_position_embeddings (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
scale_embedding (bool, optional, defaults to False):: Scale embeddings by diving by sqrt(d_model).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models)
forced_eos_token_id (int, optional, defaults to 0):: The id of the token to force as the last generated token when max_length is reached. Usually set to eos_token_id.

class transformers.models.markuplm.configuration_markuplm.MarkupLMConfig(vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, bos_token_id=0, eos_token_id=2, max_xpath_tag_unit_embeddings=256, max_xpath_subs_unit_embeddings=1024, tag_pad_id=216, subs_pad_id=1001, xpath_unit_hidden_size=32, max_depth=50, position_embedding_type='absolute', use_cache=True, classifier_dropout=None, **kwargs)

The MarkupLM model was proposed in MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. MarkupLM is BERT, but applied to HTML pages instead of raw text documents. The model incorporates additional embedding layers to improve performance, similar to LayoutLM.

The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains state-of-the-art results on 2 important benchmarks: - WebSRC, a dataset for Web-Based Structural Reading Comprehension (a bit like SQuAD but for web pages) - SWDE, a dataset for information extraction from web pages (basically named-entity recogntion on web pages)

The abstract from the paper is the following:

Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available.

Tips: - In addition to input_ids, ~MarkupLMModel.forward expects 2 additional inputs, namely xpath_tags_seq and xpath_subs_seq. These are the XPATH tags and subscripts respectively for each token in the input sequence. - One can use MarkupLMProcessor to prepare all data for the model. Refer to the usage guide for more info. - Demo notebooks can be found here.

MarkupLM architecture. Taken from the <a href=”https://arxiv.org/abs/2110.08518”>original paper.</a>

This model was contributed by nielsr. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the MarkupLM model. Defines the different tokens that can be represented by the inputs_ids passed to the forward method of MarkupLMModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed into MarkupLMModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
max_tree_id_unit_embeddings (int, optional, defaults to 1024):: The maximum value that the tree id unit embedding might ever use. Typically set this to something large just in case (e.g., 1024).
max_xpath_tag_unit_embeddings (int, optional, defaults to 256):: The maximum value that the xpath tag unit embedding might ever use. Typically set this to something large just in case (e.g., 256).
max_xpath_subs_unit_embeddings (int, optional, defaults to 1024):: The maximum value that the xpath subscript unit embedding might ever use. Typically set this to something large just in case (e.g., 1024).
tag_pad_id (int, optional, defaults to 216):: The id of the padding token in the xpath tags.
subs_pad_id (int, optional, defaults to 1001):: The id of the padding token in the xpath subscripts.
xpath_tag_unit_hidden_size (int, optional, defaults to 32):: The hidden size of each tree id unit. One complete tree index will have (50*xpath_tag_unit_hidden_size)-dim.
max_depth (int, optional, defaults to 50):: The maximum depth in xpath.

class transformers.models.mbart.configuration_mbart.MBartConfig(vocab_size=50265, max_position_embeddings=1024, encoder_layers=12, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=12, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, use_cache=True, is_encoder_decoder=True, activation_function='gelu', d_model=1024, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, classifier_dropout=0.0, scale_embedding=False, pad_token_id=1, bos_token_id=0, eos_token_id=2, forced_eos_token_id=2, **kwargs)

of MBart

The MBart model was presented in Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.

This model was contributed by valhalla. The Authors’ code can be found here

#Args:

vocab_size (int, optional, defaults to 50265):: Vocabulary size of the MBART model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MBartModel or TFMBartModel.
d_model (int, optional, defaults to 1024):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 12):: Number of encoder layers.
decoder_layers (int, optional, defaults to 12):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
classifier_dropout (float, optional, defaults to 0.0):: The dropout ratio for classifier.
max_position_embeddings (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
scale_embedding (bool, optional, defaults to False):: Scale embeddings by diving by sqrt(d_model).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models)
forced_eos_token_id (int, optional, defaults to 2):: The id of the token to force as the last generated token when max_length is reached. Usually set to eos_token_id.

class transformers.models.mega.configuration_mega.MegaConfig(vocab_size=30522, hidden_size=128, num_hidden_layers=4, intermediate_size=256, ema_projection_size=16, bidirectional=True, shared_representation_size=64, use_chunking=False, chunk_size=-1, truncation=None, normalize_before_mega=True, normalization_type='scalenorm', norm_affine=True, activation='silu', attention_activation='softmax', dropout_prob=0.1, hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, use_feature_dropout=False, use_normalized_ffn=True, nffn_hidden_size=256, normalize_before_ffn=True, nffn_activation_dropout_prob=0.1, max_positions=2048, add_token_type_embeddings=False, type_vocab_size=2, initializer_range=0.02, ema_delta_alpha_range=0.2, ema_beta_range=0.02, ema_gamma_omega_range=1.0, pad_token_id=1, bos_token_id=0, eos_token_id=2, relative_positional_bias='rotary', classifier_dropout=None, use_cache=True, add_lm_hidden_dense_layer=True, **kwargs)

The MEGA model was proposed in Mega: Moving Average Equipped Gated Attention by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. MEGA proposes a new approach to self-attention with each encoder layer having a multi-headed exponential moving average in addition to a single head of standard dot-product attention, giving the attention mechanism stronger positional biases. This allows MEGA to perform competitively to Transformers on standard benchmarks including LRA while also having significantly fewer parameters. MEGA’s compute efficiency allows it to scale to very long sequences, making it an attractive option for long-document NLP tasks.

The abstract from the paper is the following:

*The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models. *

Tips:

MEGA can perform quite well with relatively few parameters. See Appendix D in the MEGA paper for examples of architectural specs which perform well in various settings. If using MEGA as a decoder, be sure to set bidirectional=False to avoid errors with default bidirectional.
Mega-chunk is a variant of mega that reduces time and spaces complexity from quadratic to linear. Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size

This model was contributed by mnaylor. The original code can be found here.

Implementation Notes:

The original implementation of MEGA had an inconsistent expectation of attention masks for padding and causal self-attention between the softmax attention and Laplace/squared ReLU method. This implementation addresses that inconsistency.
The original implementation did not include token type embeddings; this implementation adds support for these, with the option controlled by MegaConfig.add_token_type_embeddings

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the Mega model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MegaModel.
hidden_size (int, optional, defaults to 128):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 4):: Number of hidden layers in the Mega encoder.
intermediate_size (int, optional, defaults to 256):: Dimensionality of the hidden size (self-attention value projection) within the Mega encoder
ema_projection_size (int, optional, defaults to 16):: Dimensionality of the MegaMultiDimensionDampedEma
bidirectional (bool, optional, defaults to True):: Whether the MegaMultiDimensionDampedEma used in Mega’s self-attention should work bidirectionally (True) or unidirectionally (False). Bidirectional EMA is incompatible with causal decoding, so this should be False if you intend to use the model as a decoder.
shared_representation_size (int, optional, defaults to 64):: Dimensionality of the linear projection for shared representation of self-attention queries and keys
use_chunking (bool, optional, defaults to False):: Whether to chunk inputs for linear self-attention complexity (described as Mega-chunk in the paper)
chunk_size (int, optional, defaults to -1):: If use_chunking is set to True, determines the size of the chunks to apply to the input sequence. If chunking is used, input sequences must be padded to a multiple of chunk_size
truncation (int, optional):: If specified, the sequence length for which to truncate MegaMultiDimensionDampedEma
normalize_before_mega (bool, optional, defaults to True):: Whether to normalize before (True) or after (False) passing through Mega encoder blocks
normalization_type (str, optional, defaults to “scalenorm”):: Type of normalization to use in Mega encoder blocks. Choose one of “scalenorm”, “layernorm”, “rmsnorm”, “batchnorm”, or “syncbatchnorm” (GPU required for syncbatchnorm)
norm_affine (bool, optional, defaults to True):: If True, applies a parameterized affine transformation to inputs during normalization
activation (str, optional, defaults to “silu”):: Activation function to apply within Mega encoder blocks. Choose one of “silu”, “relu”, “linear”, “gelu”, or “gelu_accurate”
attention_activation (str, optional, defaults to “softmax”):: Activation function to apply for single-headed self-attention (a la Transformer). Choose one of “softmax”, “laplace”, or “relu2”
dropout_prob (float, optional, defaults to 0.1):: The dropout probability for EMA self-attention
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
use_feature_dropout (bool, optional, defaults to False):: Whether to use feature-based (True) or standard dropout (False)
use_normalized_ffn (bool, optional, defaults to True):: Whether to use the normalized feed-forward sub-layer in Mega blocks (True) or pass Mega encoder output as-is (False)
nffn_hidden_size (int, optional, defaults to 256):: If using the normalized feed-forward network (NFFN) layer within Mega (use_normalized_ffn = True), this is the hidden size of the NFFN
normalize_before_ffn (bool, optional, defaults to True):: Whether to normalize before (True) or after (False) the feed-forward portion of NFFN
nffn_activation_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the NFFN component.
max_positions (int, optional, defaults to 2048):: The maximum sequence length to use for positional representations. For “simple” relative positional bias, this is a hard limit on input length; “rotary” relative positional bias will extrapolate to longer sequences
add_token_type_embeddings (bool, optional, defaults to True):: Whether to account for token types in embeddings. Left as optional to maintain compatibility with original implementation while adding support for token types.
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling MegaModel. Only used if add_token_type_embeddings = True
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
ema_delta_alpha_range (float, optional, defaults to 0.2):: The standard deviation for initializing the delta (damping factor) and alpha (decay factor) parameters in MegaMultiDimensionDampedEma.
ema_beta_range (float, optional, defaults to 0.02):: The standard deviation for initializing the beta parameter (expansion matrix) in MegaMultiDimensionDampedEma.
ema_gamma_omega_range (float, optional, defaults to 1.0):: The standard deviation for initializing the gamma (projection matrix) and omega (residual weight) parameters in MultiDimensionEMA.
relative_positional_bias (str, optional, defaults to “rotary”):: Type of relative positional encoding. Choose one of “rotary” or “simple”. If “simple” is selected, max_positions is used as a limit on input size, while “rotary” extrapolates beyond max_positions.
is_decoder (bool, optional, defaults to False):: Whether the model is used as a decoder or not. If False, the model is used as an encoder.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
classifier_dropout (float, optional):: The dropout ratio for the classification head.
add_lm_hidden_dense_layer (bool, optional, defaults to True):: Whether to include a hidden layer for projection between encoder outputs and LM heads (True) or pass hidden states directly to LM head (False). Remains optional for compatibility with original implementation

class transformers.models.megatron_bert.configuration_megatron_bert.MegatronBertConfig(vocab_size=29056, hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, intermediate_size=4096, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, position_embedding_type='absolute', use_cache=True, **kwargs)

The MegatronBERT model was proposed in Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.

The abstract from the paper is the following:

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).

Tips:

We have provided pretrained BERT-345M checkpoints for use to evaluate or finetuning downstream tasks.

To access these checkpoints, first sign up for and setup the NVIDIA GPU Cloud (NGC) Registry CLI. Further documentation for downloading models can be found in the NGC documentation.

Alternatively, you can directly download the checkpoints using:

BERT-345M-uncased:

`bash wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0_1_uncased.zip `

BERT-345M-cased:

`bash wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0_1_cased.zip `

Once you have obtained the checkpoints from NVIDIA GPU Cloud (NGC), you have to convert them to a format that will easily be loaded by Hugging Face Transformers and our port of the BERT code.

The following commands allow you to do the conversion. We assume that the folder models/megatron_bert contains megatron_bert_345m_v0_1_{cased, uncased}.zip and that the commands are run from inside that folder:

`bash python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpoint.py megatron_bert_345m_v0_1_uncased.zip `

`bash python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpoint.py megatron_bert_345m_v0_1_cased.zip `

This model was contributed by jdemouth. The original code can be found here. That repository contains a multi-GPU and multi-node implementation of the Megatron Language models. In particular, it contains a hybrid model parallel approach using “tensor parallel” and “pipeline parallel” techniques.

Args:

vocab_size (int, optional, defaults to 29056):: Vocabulary size of the MEGATRON_BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MegatronBertModel.
hidden_size (int, optional, defaults to 1024):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 24):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling MegatronBertModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”. For positional embeddings use “absolute”. For more information on “relative_key”, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on “relative_key_query”, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
is_decoder (bool, optional, defaults to False):: Whether the model is used as a decoder or not. If False, the model is used as an encoder.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

class transformers.models.mixtral.configuration_mixtral.MixtralConfig(vocab_size=32000, hidden_size=4096, intermediate_size=14336, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=8, hidden_act='silu', max_position_embeddings=131072, initializer_range=0.02, rms_norm_eps=1e-05, use_cache=True, pad_token_id=None, bos_token_id=1, eos_token_id=2, tie_word_embeddings=False, rope_theta=1000000.0, sliding_window=None, attention_dropout=0.0, num_experts_per_tok=2, num_local_experts=8, output_router_logits=False, router_aux_loss_coef=0.001, **kwargs)

Mixtral-8x7B is Mistral AI’s second Large Language Model (LLM).

The Mixtral model was proposed by the Mistral AI team.

It was introduced in the Mixtral of Experts blogpost with the following introduction:

Today, the team is proud to release Mixtral 8x7B, a high-quality sparse mixture of experts models (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.

Tips:

The model needs to be converted using the conversion script.
If the model is quantized to 4bits, a single A100 is enough to fit the entire 45B model.

This model was contributed by Younes Belkada and Arthur Zucker . The original code can be found here.

#Args:

vocab_size (int, optional, defaults to 32000):: Vocabulary size of the Mixtral model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MixtralModel
hidden_size (int, optional, defaults to 4096):: Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 14336):: Dimension of the MLP representations.
num_hidden_layers (int, optional, defaults to 32):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 32):: Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (int, optional, defaults to 8):: This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout `this paper. If it is not specified, will default to 8.
hidden_act (str or function, optional, defaults to “silu”):: The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int, optional, defaults to 4096*32):: The maximum sequence length that this model might ever be used with. Mixtral’s sliding window attention allows sequence of up to 4096*32 tokens.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float, optional, defaults to 1e-05):: The epsilon used by the rms normalization layers.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
pad_token_id (int, optional):: The id of the padding token.
bos_token_id (int, optional, defaults to 1):: The id of the “beginning-of-sequence” token.
eos_token_id (int, optional, defaults to 2):: The id of the “end-of-sequence” token.
tie_word_embeddings (bool, optional, defaults to False):: Whether the model’s input and output word embeddings should be tied.
rope_theta (float, optional, defaults to 1000000.0):: The base period of the RoPE embeddings.
sliding_window (int, optional):: Sliding window attention window size. If not specified, will default to 4096.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
num_experts_per_tok (int, optional, defaults to 2):: The number of experts to root per-token, can be also interpreted as the top-p routing parameter
num_local_experts (int, optional, defaults to 8):: Number of experts per Sparse MLP layer.
output_router_logits (bool, optional, defaults to False):: Whether or not the router logits should be returned by the model. Enabeling this will also allow the model to output the auxiliary loss. See `here <>`__ for more details
router_aux_loss_coef (float, optional, defaults to 0.001):: The aux loss factor for the total loss.

>>> from transformers import MixtralModel, MixtralConfig

>>> # Initializing a Mixtral 7B style configuration
>>> configuration = MixtralConfig()

>>> # Initializing a model from the Mixtral 7B style configuration
>>> model = MixtralModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

class transformers.models.mobilebert.configuration_mobilebert.MobileBertConfig(vocab_size=30522, hidden_size=512, num_hidden_layers=24, num_attention_heads=4, intermediate_size=512, hidden_act='relu', hidden_dropout_prob=0.0, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, embedding_size=128, trigram_input=True, use_bottleneck=True, intra_bottleneck_size=128, use_bottleneck_attention=False, key_query_shared_bottleneck=True, num_feedforward_networks=4, normalization_type='no_norm', classifier_activation=True, classifier_dropout=None, **kwargs)

The MobileBERT model was proposed in MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. It’s a bidirectional transformer based on the BERT model, which is compressed and accelerated using several approaches.

The abstract from the paper is the following:

Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).

Tips:

MobileBERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.

This model was contributed by vshampor. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30522):: Vocabulary size of the MobileBERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MobileBertModel or TFMobileBertModel.
hidden_size (int, optional, defaults to 512):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 24):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 4):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 512):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “relu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.0):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling MobileBertModel or TFMobileBertModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
pad_token_id (int, optional, defaults to 0):: The ID of the token in the word embedding to use as padding.
embedding_size (int, optional, defaults to 128):: The dimension of the word embedding vectors.
trigram_input (bool, optional, defaults to True):: Use a convolution of trigram as input.
use_bottleneck (bool, optional, defaults to True):: Whether to use bottleneck in BERT.
intra_bottleneck_size (int, optional, defaults to 128):: Size of bottleneck layer output.
use_bottleneck_attention (bool, optional, defaults to False):: Whether to use attention inputs from the bottleneck transformation.
key_query_shared_bottleneck (bool, optional, defaults to True):: Whether to use the same linear transformation for query&key in the bottleneck.
num_feedforward_networks (int, optional, defaults to 4):: Number of FFNs in a block.
normalization_type (str, optional, defaults to “no_norm”):: The normalization type in MobileBERT.
classifier_dropout (float, optional):: The dropout ratio for the classification head.

class transformers.models.mpnet.configuration_mpnet.MPNetConfig(vocab_size=30527, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, initializer_range=0.02, layer_norm_eps=1e-12, relative_attention_num_buckets=32, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs)

The MPNet model was proposed in MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.

MPNet adopts a novel pre-training method, named masked and permuted language modeling, to inherit the advantages of masked language modeling and permuted language modeling for natural language understanding.

The abstract from the paper is the following:

BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.

Tips:

MPNet doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. just separate your segments with the separation token tokenizer.sep_token (or ``sep]`).

The original code can be found [here <https://github.com/microsoft/MPNet>`__.

Args:

vocab_size (int, optional, defaults to 30527):: Vocabulary size of the MPNet model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MPNetModel or TFMPNetModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
relative_attention_num_buckets (int, optional, defaults to 32):: The number of buckets to use for each attention layer.

class transformers.models.mpt.configuration_mpt.MptConfig(d_model: int = 2048, n_heads: int = 16, n_layers: int = 24, expansion_ratio: int = 4, max_seq_len: int = 2048, vocab_size: int = 50368, resid_pdrop: float = 0.0, layer_norm_epsilon: float = 1e-05, emb_pdrop: float = 0.0, learned_pos_emb: bool = True, attn_config: transformers.models.mpt.configuration_mpt.MptAttentionConfig = None, init_device: str = 'cpu', logit_scale: float | str | NoneType = None, no_bias: bool = True, verbose: int = 0, embedding_fraction: float = 1.0, norm_type: str = 'low_precision_layernorm', use_cache: bool = False, initializer_range=0.02, **kwargs)

The MPT model was proposed by the MosaicML team and released with multiple sizes and finetuned variants. The MPT models is a series of open source and commercially usable LLMs pre-trained on 1T tokens.

MPT models are GPT-style decoder-only transformers with several improvements: performance-optimized layer implementations, architecture changes that provide greater training stability, and the elimination of context length limits by replacing positional embeddings with ALiBi.

MPT base: MPT base pre-trained models on next token prediction
MPT instruct: MPT base models fine-tuned on instruction based tasks
MPT storywriter: MPT base models fine-tuned for 2500 steps on 65k-token excerpts of fiction books contained in the books3 corpus, this enables the model to handle very long sequences

The original code is available at the ``llm-foundry``(https://github.com/mosaicml/llm-foundry/tree/main) repository.

Read more about it in the release blogpost

Tips:

Learn more about some techniques behind training of the model in this section of llm-foundry repository
If you want to use the advanced version of the model (triton kernels, direct flash attention integration), you can still use the original model implementation by adding trust_remote_code=True when calling from_pretrained.
Fine-tuning Notebook on how to fine-tune MPT-7B on a free Google Colab instance to turn the model into a Chatbot.

Args:

d_model (int, optional, defaults to 2048):: Dimensionality of the embeddings and hidden states.
n_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
n_layers (int, optional, defaults to 24):: Number of hidden layers in the Transformer encoder.
expansion_ratio (int, optional, defaults to 4):: The ratio of the up/down scale in the MLP.
max_seq_len (int, optional, defaults to 2048):: The maximum sequence length of the model.
vocab_size (int, optional, defaults to 50368):: Vocabulary size of the Mpt model. Defines the maximum number of different tokens that can be represented by the inputs_ids passed when calling MptModel. Check this discussion on how the vocab_size has been defined.
resid_pdrop (float, optional, defaults to 0.0):: The dropout probability applied to the attention output before combining with residual.
layer_norm_epsilon (float, optional, defaults to 1e-05):: The epsilon to use in the layer normalization layers.
emb_pdrop (float, optional, defaults to 0.0):: The dropout probability for the embedding layer.
learned_pos_emb (bool, optional, defaults to True):: Whether to use learned positional embeddings.
attn_config (dict, optional):: A dictionary used to configure the model’s attention module.
init_device (str, optional, defaults to “cpu”):: The device to use for parameter initialization. Defined for backward compatibility
logit_scale (float, optional):: If not None, scale the logits by this value.
no_bias (bool, optional, defaults to True):: Whether to use bias in all linear layers.
verbose (int, optional, defaults to 0):: The verbosity level to use for logging. Used in the previous versions of MPT models for logging. This argument is deprecated.
embedding_fraction (float, optional, defaults to 1.0):: The fraction to scale the gradients of the embedding layer by.
norm_type (str, optional, defaults to “low_precision_layernorm”):: Type of layer norm to use. All MPT models uses the same layer norm implementation. Defined for backward compatibility.
use_cache (bool, optional, defaults to False):: Whether or not the model should return the last key/values attentions (not used by all models).
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

class transformers.models.mra.configuration_mra.MraConfig(vocab_size=50265, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=1, initializer_range=0.02, layer_norm_eps=1e-05, position_embedding_type='absolute', block_per_row=4, approx_mode='full', initial_prior_first_n_blocks=0, initial_prior_diagonal_n_blocks=0, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs)

The MRA model was proposed in Multi Resolution Analysis (MRA) for Approximate Self-Attention by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, and Vikas Singh.

The abstract from the paper is the following:

Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combinations thereof. In this paper, we revisit classical Multiresolution Analysis (MRA) concepts such as Wavelets, whose potential value in this setting remains underexplored thus far. We show that simple approximations based on empirical feedback and design choices informed by modern hardware and implementation challenges, eventually yield a MRA-based approach for self-attention with an excellent performance profile across most criteria of interest. We undertake an extensive set of experiments and demonstrate that this multi-resolution scheme outperforms most efficient self-attention proposals and is favorable for both short and long sequences. Code is available at https://github.com/mlpen/mra-attention.

This model was contributed by novice03. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 50265):: Vocabulary size of the Mra model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MraModel.
hidden_size (int, optional, defaults to 768):: Dimension of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 1):: The vocabulary size of the token_type_ids passed when calling MraModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-5):: The epsilon used by the layer normalization layers.
position_embedding_type (str, optional, defaults to “absolute”):: Type of position embedding. Choose one of “absolute”, “relative_key”, “relative_key_query”.
block_per_row (int, optional, defaults to 4):: Used to set the budget for the high resolution scale.
approx_mode (str, optional, defaults to “full”):: Controls whether both low and high resolution approximations are used. Set to “full” for both low and high resolution and “sparse” for only low resolution.
initial_prior_first_n_blocks (int, optional, defaults to 0):: The initial number of blocks for which high resolution is used.
initial_prior_diagonal_n_blocks (int, optional, defaults to 0):: The number of diagonal blocks for which high resolution is used.

class transformers.models.mvp.configuration_mvp.MvpConfig(vocab_size=50267, max_position_embeddings=1024, encoder_layers=12, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=12, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, activation_function='gelu', d_model=1024, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, classifier_dropout=0.0, scale_embedding=False, use_cache=True, pad_token_id=1, bos_token_id=0, eos_token_id=2, is_encoder_decoder=True, decoder_start_token_id=2, forced_eos_token_id=2, use_prompt=False, prompt_length=100, prompt_mid_dim=800, **kwargs)

The MVP model was proposed in MVP: Multi-task Supervised Pre-training for Natural Language Generation by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.

According to the abstract,

MVP follows a standard Transformer encoder-decoder architecture.
MVP is supervised pre-trained using labeled datasets.
MVP also has task-specific soft prompts to stimulate the model’s capacity in performing a certain task.
MVP is specially designed for natural language generation and can be adapted to a wide range of generation tasks, including but not limited to summarization, data-to-text generation, open-ended dialogue system, story generation, question answering, question generation, task-oriented dialogue system, commonsense generation, paraphrase generation, text style transfer, and text simplification. Our model can also be adapted to natural language understanding tasks such as sequence classification and (extractive) question answering.

Tips: - We have released a series of models here, including MVP, MVP with task-specific prompts, and multi-task pre-trained variants. - If you want to use a model without prompts (standard Transformer), you can load it through MvpForConditionalGeneration.from_pretrained(‘RUCAIBox/mvp’). - If you want to use a model with task-specific prompts, such as summarization, you can load it through MvpForConditionalGeneration.from_pretrained(‘RUCAIBox/mvp-summarization’). - Our model supports lightweight prompt tuning following Prefix-tuning with method set_lightweight_tuning().

This model was contributed by Tianyi Tang. The detailed information and instructions can be found here.

Args:

vocab_size (int, optional, defaults to 50267):: Vocabulary size of the MVP model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MvpModel.
d_model (int, optional, defaults to 1024):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 12):: Number of encoder layers.
decoder_layers (int, optional, defaults to 12):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
classifier_dropout (float, optional, defaults to 0.0):: The dropout ratio for classifier.
max_position_embeddings (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
scale_embedding (bool, optional, defaults to False):: Scale embeddings by diving by sqrt(d_model).
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).
forced_eos_token_id (int, optional, defaults to 2):: The id of the token to force as the last generated token when max_length is reached. Usually set to eos_token_id.
use_prompt (bool, optional, defaults to False):: Whether or not to use prompt.
prompt_length (int, optional, defaults to 100):: The length of prompt.
prompt_mid_dim (int, optional, defaults to 800):: Dimensionality of the “intermediate” layer in prompt.

class transformers.models.nezha.configuration_nezha.NezhaConfig(vocab_size=21128, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, max_relative_position=64, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, classifier_dropout=0.1, pad_token_id=0, bos_token_id=2, eos_token_id=3, use_cache=True, **kwargs)

The Nezha model was proposed in NEZHA: Neural Contextualized Representation for Chinese Language Understanding by Junqiu Wei et al.

The abstract from the paper is the following:

The pre-trained language models have achieved great successes in various natural language understanding (NLU) tasks due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora. In this technical report, we present our practice of pre-training language models named NEZHA (NEural contextualiZed representation for CHinese lAnguage understanding) on Chinese corpora and finetuning for the Chinese NLU tasks. The current version of NEZHA is based on BERT with a collection of proven improvements, which include Functional Relative Positional Encoding as an effective positional encoding scheme, Whole Word Masking strategy, Mixed Precision Training and the LAMB Optimizer in training the models. The experimental results show that NEZHA achieves the state-of-the-art performances when finetuned on several representative Chinese tasks, including named entity recognition (People’s Daily NER), sentence matching (LCQMC), Chinese sentiment classification (ChnSenti) and natural language inference (XNLI).

This model was contributed by sijunhe. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 21128):: Vocabulary size of the NEZHA model. Defines the different tokens that can be represented by the inputs_ids passed to the forward method of NezhaModel.
hidden_size (int, optional, defaults to 768):: Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: The dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed into NezhaModel.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.
classifier_dropout (float, optional, defaults to 0.1):: The dropout ratio for attached classifiers.
is_decoder (bool, optional, defaults to False):: Whether the model is used as a decoder or not. If False, the model is used as an encoder.

class transformers.models.nllb_moe.configuration_nllb_moe.NllbMoeConfig(vocab_size=128112, max_position_embeddings=1024, encoder_layers=12, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=12, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.05, decoder_layerdrop=0.05, use_cache=True, is_encoder_decoder=True, activation_function='relu', d_model=1024, dropout=0.1, attention_dropout=0.1, activation_dropout=0.0, init_std=0.02, decoder_start_token_id=2, scale_embedding=True, router_bias=False, router_dtype='float32', router_ignore_padding_tokens=False, num_experts=128, expert_capacity=64, encoder_sparse_step=4, decoder_sparse_step=4, router_z_loss_coef=0.001, router_aux_loss_coef=0.001, second_expert_policy='all', normalize_router_prob_before_dropping=False, batch_prioritized_routing=False, moe_eval_capacity_token_fraction=1.0, moe_token_dropout=0.2, pad_token_id=1, bos_token_id=0, eos_token_id=2, output_router_logits=False, **kwargs)

The NLLB model was presented in No Language Left Behind: Scaling Human-Centered Machine Translation by Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang.

The abstract of the paper is the following:

Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.

Tips:

M2M100ForConditionalGeneration is the base model for both NLLB and NLLB MoE
The NLLB-MoE is very similar to the NLLB model, but it’s feed forward layer is based on the implementation of SwitchTransformers.
The tokenizer is the same as the NLLB models.

This model was contributed by Arthur Zucker. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 50265):: Vocabulary size of the NllbMoe model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling NllbMoeModel or
d_model (int, optional, defaults to 1024):: Dimensionality of the layers and the pooler layer.
encoder_layers (int, optional, defaults to 12):: Number of encoder layers.
decoder_layers (int, optional, defaults to 12):: Number of decoder layers.
encoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int, optional, defaults to 16):: Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
encoder_ffn_dim (int, optional, defaults to 4096):: Dimensionality of the “intermediate” (often named feed-forward) layer in encoder.
activation_function (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.
dropout (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0):: The dropout ratio for the attention probabilities.
activation_dropout (float, optional, defaults to 0.0):: The dropout ratio for activations inside the fully connected layer.
classifier_dropout (float, optional, defaults to 0.0):: The dropout ratio for classifier.
max_position_embeddings (int, optional, defaults to 1024):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
init_std (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
decoder_layerdrop (float, optional, defaults to 0.0):: The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
second_expert_policy ( str, optional, default to “all”):: The policy used for the sampling the probability of being sampled to a second expert for each token.
normalize_router_prob_before_dropping (bool, optional, defaults to True):: Whether or not to normalize the router probabilities before applying a mask based on the experts capacity (capacity dropping).
batch_prioritized_routing (bool, optional, defaults to True):: Whether or not to orders the tokens by their router probabilities before capacity dropping. This means that the tokens that have the highest probabilities will be routed before other tokens that might be further in the sequence.
moe_eval_capacity_token_fraction (float, optional, defaults to 1.0):: Fraction of tokens as capacity during validation, if set to negative, uses the same as training. Should be in range: (0.0, 1.0].
num_experts (int, optional, defaults to 128):: Number of experts for each NllbMoeSparseMlp layer.
expert_capacity (int, optional, defaults to 64):: Number of tokens that can be stored in each expert.
encoder_sparse_step (int, optional, defaults to 4):: Frequency of the sparse layers in the encoder. 4 means that one out of 4 layers will be sparse.
decoder_sparse_step (int, optional, defaults to 4):: Frequency of the sparse layers in the decoder. 4 means that one out of 4 layers will be sparse.
router_dtype (str, optional, default to “float32”):: The dtype used for the routers. It is preferable to keep the dtype to “float32” as specified in the selective precision discussion in the paper.
router_ignore_padding_tokens (bool, optional, defaults to False):: Whether to ignore padding tokens when routing. if False, the padding tokens are not routed to any experts.
router_bias (bool, optional, defaults to False):: Whether or not the classifier of the router should have a bias.
moe_token_dropout (float, optional, defualt ot 0.2):: Masking rate for MoE expert output masking (EOM), which is implemented via a Dropout2d on the expert outputs.
output_router_logits (bool, optional, defaults to False):: Whether or not to return the router logits. Only set to True to get the auxiliary loss when training.
use_cache (bool, optional, defaults to True):: Whether or not the model should return the last key/values attentions (not used by all models).

class transformers.models.nystromformer.configuration_nystromformer.NystromformerConfig(vocab_size=30000, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu_new', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=510, type_vocab_size=2, segment_means_seq_len=64, num_landmarks=64, conv_kernel_size=65, inv_coeff_init_option=False, initializer_range=0.02, layer_norm_eps=1e-05, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs)

The Nyströmformer model was proposed in *Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention* by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.

The abstract from the paper is the following:

Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences – a topic being actively studied in the community. To address this limitation, we propose Nyströmformer – a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nyström method to approximate standard self-attention with O(n) complexity. The scalability of Nyströmformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nyströmformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nyströmformer performs favorably relative to other efficient self-attention methods. Our code is available at this https URL.

This model was contributed by novice03. The original code can be found here.

Args:

vocab_size (int, optional, defaults to 30000):: Vocabulary size of the Nystromformer model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling NystromformerModel.
hidden_size (int, optional, defaults to 768):: Dimension of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12):: Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12):: Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072):: Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to “gelu”):: The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” are supported.
hidden_dropout_prob (float, optional, defaults to 0.1):: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1):: The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512):: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2):: The vocabulary size of the token_type_ids passed when calling NystromformerModel.
segment_means_seq_len (int, optional, defaults to 64):: Sequence length used in segment-means.
num_landmarks (int, optional, defaults to 64):: The number of landmark (or Nystrom) points to use in Nystrom approximation of the softmax self-attention matrix.
conv_kernel_size (int, optional, defaults to 65):: The kernel size of depthwise convolution used in Nystrom approximation.
inv_coeff_init_option (bool, optional, defaults to False):: Whether or not to use exact coefficient computation for the initial values for the iterative method of calculating the Moore-Penrose inverse of a matrix.
initializer_range (float, optional, defaults to 0.02):: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12):: The epsilon used by the layer normalization layers.

class transformers.models.openai.configuration_openai.OpenAIGPTConfig(vocab_size=40478, n_positions=512, n_embd=768, n_layer=12, n_head=12, afn='gelu', resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, layer_norm_epsilon=1e-05, initializer_range=0.02, summary_type='cls_index', summary_use_proj=True, summary_activation=None, summary_proj_to_labels=True, summary_first_dropout=0.1, **kwargs)

OpenAI GPT model was proposed in Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It’s a causal (unidirectional) transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.

The abstract from the paper is the following:

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied.

Tips:

GPT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script.

Write With Transformer is a webapp created and hosted by Hugging Face showcasing the generative capabilities of several models. GPT is one of them.

This model was contributed by thomwolf. The original code can be found here.

Note:

If you want to reproduce the original tokenization process of the OpenAI GPT paper, you will need to install ftfy and SpaCy:

`bash pip install spacy ftfy==4.4.3 python -m spacy download en `

If you don’t install ftfy and SpaCy, the OpenAIGPTTokenizer will default to tokenize using BERT’s BasicTokenizer followed by Byte-Pair Encoding (which should be fine for most usage, don’t worry).

Args:

vocab_size (int, optional, defaults to 40478):

Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling OpenAIGPTModel or TFOpenAIGPTModel.

n_positions (int, optional, defaults to 512):

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

n_embd (int, optional, defaults to 768):

Dimensionality of the embeddings and hidden states.

n_layer (int, optional, defaults to 12):

Number of hidden layers in the Transformer encoder.

n_head (int, optional, defaults to 12):

Number of attention heads for each attention layer in the Transformer encoder.

afn (str or Callable, optional, defaults to “gelu”):

The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “silu” and “gelu_new” are supported.

resid_pdrop (float, optional, defaults to 0.1):

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

embd_pdrop (int, optional, defaults to 0.1):

The dropout ratio for the embeddings.

attn_pdrop (float, optional, defaults to 0.1):

The dropout ratio for the attention.

layer_norm_epsilon (float, optional, defaults to 1e-05):

The epsilon to use in the layer normalization layers

initializer_range (float, optional, defaults to 0.02):

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

summary_type (str, optional, defaults to “cls_index”):

Argument used when doing sequence summary, used in the models OpenAIGPTDoubleHeadsModel and OpenAIGPTDoubleHeadsModel.