Image Models

This page contains the list of external image models that can be used with EIR, coming from the great timm library.

There are 3 ways to use these models:

Configure and train specific architectures (e.g. ResNet with chosen number of layers) from scratch.
Train a specific architecture (e.g. resnet18) from scratch.
Use a pre-trained model (e.g. resnet18) and fine-tune it.

Please refer to this page for more detailed information about configurable architectures, and this page for a list of pre-defined architectures, with the option of using pre-trained weights.

Configurable Models

The following models can be configured and trained from scratch.

The model type is specified in the model_type field of the configuration, while the model specific configuration is specified in the model_init_config field.

For example, the ResNet architecture includes the layers and block parameters, and can be configured as follows:

input_configurable_image_model.yaml

input_info:
  input_source: eir_tutorials/a_using_eir/05_image_tutorial/data/hot_dog_not_hot_dog/food_images
  input_name: hot_dog
  input_type: image

input_type_info:
  mixing_subtype: "cutmix"
  size:
    - 64

model_config:
  model_type: "ResNet"
  model_init_config:
    layers: [1, 1, 1, 1]
    block: "BasicBlock"

interpretation_config:
    num_samples_to_interpret: 30

class timm.models.beit.Beit(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int | ~typing.Tuple[int, int] = 16, in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 768, depth: int = 12, num_heads: int = 12, qkv_bias: bool = True, mlp_ratio: float = 4.0, swiglu_mlp: bool = False, scale_mlp: bool = False, drop_rate: float = 0.0, pos_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_layer: ~typing.Callable = <class 'timm.layers.norm.LayerNorm'>, init_values: float | None = None, use_abs_pos_emb: bool = True, use_rel_pos_bias: bool = False, use_shared_rel_pos_bias: bool = False, head_init_scale: float = 0.001): Vision Transformer with support for patch or hybrid CNN input stage

class timm.models.byobnet.ByobNet(cfg: ByoModelCfg, num_classes: int = 1000, in_chans: int = 3, global_pool: str = 'avg', output_stride: int = 32, img_size: int | Tuple[int, int] | None = None, drop_rate: float = 0.0, drop_path_rate: float = 0.0, zero_init_last: bool = True, **kwargs)

‘Bring-your-own-blocks’ Net

A flexible network backbone that allows building model stem + blocks via dataclass cfg definition w/ factory functions for module instantiation.

Current assumption is that both stem and blocks are in conv-bn-act order (w/ block ending in act).

class timm.models.cait.Cait(img_size=224, patch_size=16, in_chans=3, num_classes=1000, global_pool='token', embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, block_layers=<class 'timm.models.cait.LayerScaleBlock'>, block_layers_token=<class 'timm.models.cait.LayerScaleBlockClassAttn'>, patch_layer=<class 'timm.layers.patch_embed.PatchEmbed'>, norm_layer=functools.partial(<class 'torch.nn.modules.normalization.LayerNorm'>, eps=1e-06), act_layer=<class 'torch.nn.modules.activation.GELU'>, attn_block=<class 'timm.models.cait.TalkingHeadAttn'>, mlp_block=<class 'timm.layers.mlp.Mlp'>, init_values=0.0001, attn_block_token_only=<class 'timm.models.cait.ClassAttn'>, mlp_block_token_only=<class 'timm.layers.mlp.Mlp'>, depth_token_only=2, mlp_ratio_token_only=4.0)

class timm.models.coat.CoaT(img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dims=(64, 128, 320, 512), serial_depths=(3, 4, 6, 3), parallel_depth=0, num_heads=8, mlp_ratios=(4, 4, 4, 4), qkv_bias=True, drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'timm.layers.norm.LayerNorm'>, return_interm_layers=False, out_features=None, crpe_window=None, global_pool='token'): CoaT class.

class timm.models.convit.ConVit(img_size=224, patch_size=16, in_chans=3, num_classes=1000, global_pool='token', embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=False, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, hybrid_backbone=None, norm_layer=<class 'timm.layers.norm.LayerNorm'>, local_up_to_layer=3, locality_strength=1.0, use_pos_embed=True): Vision Transformer with support for patch or hybrid CNN input stage

class timm.models.convmixer.ConvMixer(dim, depth, kernel_size=9, patch_size=7, in_chans=3, num_classes=1000, global_pool='avg', drop_rate=0.0, act_layer=<class 'torch.nn.modules.activation.GELU'>, **kwargs)

class timm.models.convnext.ConvNeXt(in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', output_stride: int = 32, depths: Tuple[int, ...] = (3, 3, 9, 3), dims: Tuple[int, ...] = (96, 192, 384, 768), kernel_sizes: int | Tuple[int, ...] = 7, ls_init_value: float | None = 1e-06, stem_type: str = 'patch', patch_size: int = 4, head_init_scale: float = 1.0, head_norm_first: bool = False, head_hidden_size: int | None = None, conv_mlp: bool = False, conv_bias: bool = True, use_grn: bool = False, act_layer: str | Callable = 'gelu', norm_layer: str | Callable | None = None, norm_eps: float | None = None, drop_rate: float = 0.0, drop_path_rate: float = 0.0): A PyTorch impl of : A ConvNet for the 2020s - https://arxiv.org/pdf/2201.03545.pdf

class timm.models.crossvit.CrossVit(img_size=224, img_scale=(1.0, 1.0), patch_size=(8, 16), in_chans=3, num_classes=1000, embed_dim=(192, 384), depth=((1, 3, 1), (1, 3, 1), (1, 3, 1)), num_heads=(6, 12), mlp_ratio=(2.0, 2.0, 4.0), multi_conv=False, crop_scale=False, qkv_bias=True, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=functools.partial(<class 'torch.nn.modules.normalization.LayerNorm'>, eps=1e-06), global_pool='token'): Vision Transformer with support for patch or hybrid CNN input stage

class timm.models.cspnet.CspNet(cfg: CspModelCfg, in_chans=3, num_classes=1000, output_stride=32, global_pool='avg', drop_rate=0.0, drop_path_rate=0.0, zero_init_last=True, **kwargs)

Cross Stage Partial base model.

Paper: CSPNet: A New Backbone that can Enhance Learning Capability of CNN - https://arxiv.org/abs/1911.11929 Ref Impl: https://github.com/WongKinYiu/CrossStagePartialNetworks

NOTE: There are differences in the way I handle the 1x1 ‘expansion’ conv in this impl vs the darknet impl. I did it this way for simplicity and less special cases.

class timm.models.davit.DaVit(in_chans=3, depths=(1, 1, 3, 1), embed_dims=(96, 192, 384, 768), num_heads=(3, 6, 12, 24), window_size=7, mlp_ratio=4, qkv_bias=True, norm_layer='layernorm2d', norm_layer_cl='layernorm', norm_eps=1e-05, attn_types=('spatial', 'channel'), ffn=True, cpe_act=False, drop_rate=0.0, drop_path_rate=0.0, num_classes=1000, global_pool='avg', head_norm_first=False)

DaViT: A PyTorch implementation of DaViT: Dual Attention Vision Transformers - https://arxiv.org/abs/2204.03645 Supports arbitrary input sizes and pyramid feature extraction

Parameters:

in_chans (int) – Number of input image channels. Default: 3
num_classes (int) – Number of classes for classification head. Default: 1000
depths (tuple(int)) – Number of blocks in each stage. Default: (1, 1, 3, 1)
embed_dims (tuple(int)) – Patch embedding dimension. Default: (96, 192, 384, 768)
num_heads (tuple(int)) – Number of attention heads in different layers. Default: (3, 6, 12, 24)
window_size (int) – Window size. Default: 7
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Default: 4
qkv_bias (bool) – If True, add a learnable bias to query, key, value. Default: True
drop_path_rate (float) – Stochastic depth rate. Default: 0.1
norm_layer (nn.Module) – Normalization layer. Default: nn.LayerNorm.

class timm.models.deit.VisionTransformerDistilled(*args, **kwargs)

Vision Transformer w/ Distillation Token and Head

Distillation token & head support for DeiT: Data-efficient Image Transformers

https://arxiv.org/abs/2012.12877

class timm.models.densenet.DenseNet(growth_rate=32, block_config=(6, 12, 24, 16), num_classes=1000, in_chans=3, global_pool='avg', bn_size=4, stem_type='', act_layer='relu', norm_layer='batchnorm2d', aa_layer=None, drop_rate=0.0, proj_drop_rate=0.0, memory_efficient=False, aa_stem_only=True)

Densenet-BC model class, based on “Densely Connected Convolutional Networks”

Parameters:

growth_rate (int) - how many filters to add each layer (k in paper)
block_config (list of 4 ints)
bn_size (int) – (i.e. bn_size * k features in the bottleneck layer)
drop_rate (float)
proj_drop_rate (float)
num_classes (int)
memory_efficient (bool) – but slower. Default: False. See “paper”

class timm.models.dla.DLA(levels, channels, output_stride=32, num_classes=1000, in_chans=3, global_pool='avg', cardinality=1, base_width=64, block=<class 'timm.models.dla.DlaBottle2neck'>, shortcut_root=False, drop_rate=0.0)

class timm.models.dpn.DPN(k_sec=(3, 4, 20, 3), inc_sec=(16, 32, 24, 128), k_r=96, groups=32, num_classes=1000, in_chans=3, output_stride=32, global_pool='avg', small=False, num_init_features=64, b=False, drop_rate=0.0, norm_layer='batchnorm2d', act_layer='relu', fc_act_layer='elu')

class timm.models.edgenext.EdgeNeXt(in_chans=3, num_classes=1000, global_pool='avg', dims=(24, 48, 88, 168), depths=(3, 3, 9, 3), global_block_counts=(0, 1, 1, 1), kernel_sizes=(3, 5, 7, 9), heads=(8, 8, 8, 8), d2_scales=(2, 2, 3, 4), use_pos_emb=(False, True, False, False), ls_init_value=1e-06, head_init_scale=1.0, expand_ratio=4, downsample_block=False, conv_bias=True, stem_type='patch', head_norm_first=False, act_layer=<class 'torch.nn.modules.activation.GELU'>, drop_path_rate=0.0, drop_rate=0.0)

class timm.models.efficientformer.EfficientFormer(depths, embed_dims=None, in_chans=3, num_classes=1000, global_pool='avg', downsamples=None, num_vit=0, mlp_ratios=4, pool_size=3, layer_scale_init_value=1e-05, act_layer=<class 'torch.nn.modules.activation.GELU'>, norm_layer=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, norm_layer_cl=<class 'torch.nn.modules.normalization.LayerNorm'>, drop_rate=0.0, proj_drop_rate=0.0, drop_path_rate=0.0, **kwargs)

class timm.models.efficientnet.EfficientNet(block_args, num_classes=1000, num_features=1280, in_chans=3, stem_size=32, fix_stem=False, output_stride=32, pad_type='', round_chs_fn=<function round_channels>, act_layer=None, norm_layer=None, se_layer=None, drop_rate=0.0, drop_path_rate=0.0, global_pool='avg')

A flexible and performant PyTorch implementation of efficient network architectures, including:

EfficientNet-V2 Small, Medium, Large, XL & B0-B3
EfficientNet B0-B8, L2
EfficientNet-EdgeTPU
EfficientNet-CondConv
MixNet S, M, L, XL
MnasNet A1, B1, and small
MobileNet-V2
FBNet C
Single-Path NAS Pixel1
TinyNet

class timm.models.efficientvit_mit.EfficientVit(in_chans=3, widths=(), depths=(), head_dim=32, expand_ratio=4, norm_layer=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, act_layer=<class 'torch.nn.modules.activation.Hardswish'>, global_pool='avg', head_widths=(), drop_rate=0.0, num_classes=1000)

class timm.models.efficientvit_msra.EfficientVitMsra(img_size=224, in_chans=3, num_classes=1000, embed_dim=(64, 128, 192), key_dim=(16, 16, 16), depth=(1, 2, 3), num_heads=(4, 4, 4), window_size=(7, 7, 7), kernels=(5, 5, 5, 5), down_ops=(('', 1), ('subsample', 2), ('subsample', 2)), global_pool='avg', drop_rate=0.0)

class timm.models.eva.Eva(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int | ~typing.Tuple[int, int] = 16, in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 768, depth: int = 12, num_heads: int = 12, qkv_bias: bool = True, qkv_fused: bool = True, mlp_ratio: float = 4.0, swiglu_mlp: bool = False, scale_mlp: bool = False, scale_attn_inner: bool = False, drop_rate: float = 0.0, pos_drop_rate: float = 0.0, patch_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_layer: ~typing.Callable = <class 'timm.layers.norm.LayerNorm'>, init_values: float | None = None, class_token: bool = True, use_abs_pos_emb: bool = True, use_rot_pos_emb: bool = False, use_post_norm: bool = False, dynamic_img_size: bool = False, dynamic_img_pad: bool = False, ref_feat_shape: int | ~typing.Tuple[int, int] | None = None, head_init_scale: float = 0.001)

Eva Vision Transformer w/ Abs & Rotary Pos Embed

This class implements the EVA and EVA02 models that were based on the BEiT ViT variant

EVA - abs pos embed, global avg pool
EVA02 - abs + rope pos embed, global avg pool, SwiGLU, scale Norm in MLP (ala normformer)

class timm.models.focalnet.FocalNet(in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 96, depths: ~typing.Tuple[int, ...] = (2, 2, 6, 2), mlp_ratio: float = 4.0, focal_levels: ~typing.Tuple[int, ...] = (2, 2, 2, 2), focal_windows: ~typing.Tuple[int, ...] = (3, 3, 3, 3), use_overlap_down: bool = False, use_post_norm: bool = False, use_post_norm_in_modulation: bool = False, normalize_modulator: bool = False, head_hidden_size: int | None = None, head_init_scale: float = 1.0, layerscale_value: float | None = None, drop_rate: bool = 0.0, proj_drop_rate: bool = 0.0, drop_path_rate: bool = 0.1, norm_layer: ~typing.Callable = functools.partial(<class 'timm.layers.norm.LayerNorm2d'>, eps=1e-05)): “ Focal Modulation Networks (FocalNets)

class timm.models.gcvit.GlobalContextVit(in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', img_size: Tuple[int, int] = 224, window_ratio: Tuple[int, ...] = (32, 32, 16, 32), window_size: Tuple[int, ...] = None, embed_dim: int = 64, depths: Tuple[int, ...] = (3, 4, 19, 5), num_heads: Tuple[int, ...] = (2, 4, 8, 16), mlp_ratio: float = 3.0, qkv_bias: bool = True, layer_scale: float | None = None, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, weight_init='', act_layer: str = 'gelu', norm_layer: str = 'layernorm2d', norm_layer_cl: str = 'layernorm', norm_eps: float = 1e-05)

class timm.models.ghostnet.GhostNet(cfgs, num_classes=1000, width=1.0, in_chans=3, output_stride=32, global_pool='avg', drop_rate=0.2, version='v1')

class timm.models.hgnet.HighPerfGpuNet(cfg, in_chans=3, num_classes=1000, global_pool='avg', use_last_conv=True, class_expand=2048, drop_rate=0.0, drop_path_rate=0.0, use_lab=False, **kwargs)

class timm.models.hrnet.HighResolutionNet(cfg, in_chans=3, num_classes=1000, output_stride=32, global_pool='avg', drop_rate=0.0, head='classification', **kwargs)

class timm.models.inception_resnet_v2.InceptionResnetV2(num_classes=1000, in_chans=3, drop_rate=0.0, output_stride=32, global_pool='avg', norm_layer='batchnorm2d', norm_eps=0.001, act_layer='relu')

class timm.models.inception_v3.InceptionV3(num_classes=1000, in_chans=3, drop_rate=0.0, global_pool='avg', aux_logits=False, norm_layer='batchnorm2d', norm_eps=0.001, act_layer='relu'): Inception-V3

class timm.models.inception_v4.InceptionV4(num_classes=1000, in_chans=3, output_stride=32, drop_rate=0.0, global_pool='avg', norm_layer='batchnorm2d', norm_eps=0.001, act_layer='relu')

class timm.models.levit.Levit(img_size=224, in_chans=3, num_classes=1000, embed_dim=(192,), key_dim=64, depth=(12,), num_heads=(3,), attn_ratio=2.0, mlp_ratio=2.0, stem_backbone=None, stem_stride=None, stem_type='s16', down_op='subsample', act_layer='hard_swish', attn_act_layer=None, use_conv=False, global_pool='avg', drop_rate=0.0, drop_path_rate=0.0)

Vision Transformer with support for patch or hybrid CNN input stage

NOTE: distillation is defaulted to True since pretrained weights use it, will cause problems w/ train scripts that don’t take tuple outputs,

class timm.models.maxxvit.MaxxVitCfg(embed_dim: Tuple[int, ...] = (96, 192, 384, 768), depths: Tuple[int, ...] = (2, 3, 5, 2), block_type: Tuple[Union[str, Tuple[str, ...]], ...] = ('C', 'C', 'T', 'T'), stem_width: Union[int, Tuple[int, int]] = 64, stem_bias: bool = False, conv_cfg: timm.models.maxxvit.MaxxVitConvCfg = <factory>, transformer_cfg: timm.models.maxxvit.MaxxVitTransformerCfg = <factory>, head_hidden_size: int = None, weight_init: str = 'vit_eff')

class timm.models.metaformer.MetaFormer(in_chans=3, num_classes=1000, global_pool='avg', depths=(2, 2, 6, 2), dims=(64, 128, 320, 512), token_mixers=<class 'timm.models.metaformer.Pooling'>, mlp_act=<class 'timm.models.metaformer.StarReLU'>, mlp_bias=False, drop_path_rate=0.0, proj_drop_rate=0.0, drop_rate=0.0, layer_scale_init_values=None, res_scale_init_values=(None, None, 1.0, 1.0), downsample_norm=<class 'timm.models.metaformer.LayerNorm2dNoBias'>, norm_layers=<class 'timm.models.metaformer.LayerNorm2dNoBias'>, output_norm=<class 'timm.layers.norm.LayerNorm2d'>, use_mlp_head=True, **kwargs)

A PyTorch impl ofMetaFormer Baselines for Vision -: https://arxiv.org/abs/2210.13452

Parameters:

in_chans (int) – Number of input image channels.
num_classes (int) – Number of classes for classification head.
global_pool – Pooling for classifier head.
depths (list or tuple) – Number of blocks at each stage.
dims (list or tuple) – Feature dimension at each stage.
token_mixers (list, tuple or token_fcn) – Token mixer for each stage.
mlp_act – Activation layer for MLP.
mlp_bias (boolean) – Enable or disable mlp bias term.
drop_path_rate (float) – Stochastic depth rate.
drop_rate (float) – Dropout rate.
layer_scale_init_values (list, tuple, float or None) – Init value for Layer Scale. None means not use the layer scale. Form: https://arxiv.org/abs/2103.17239.
res_scale_init_values (list, tuple, float or None) – Init value for res Scale on residual connections. None means not use the res scale. From: https://arxiv.org/abs/2110.09456.
downsample_norm (nn.Module) – Norm layer used in stem and downsampling layers.
norm_layers (list, tuple or norm_fcn) – Norm layers for each stage.
output_norm – Norm layer before classifier head.
use_mlp_head – Use MLP classification head.

class timm.models.mobilenetv3.MobileNetV3(block_args: ~typing.List[~typing.List[~typing.Dict[str, ~typing.Any]]], num_classes: int = 1000, in_chans: int = 3, stem_size: int = 16, fix_stem: bool = False, num_features: int = 1280, head_bias: bool = True, pad_type: str | int | ~typing.Tuple[int, int] = '', act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, se_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, se_from_exp: bool = True, round_chs_fn: ~typing.Callable = <function round_channels>, drop_rate: float = 0.0, drop_path_rate: float = 0.0, global_pool: str = 'avg')

MobiletNet-V3

Based on my EfficientNet implementation and building blocks, this model utilizes the MobileNet-v3 specific ‘efficient head’, where global pooling is done before the head convolution without a final batch-norm layer before the classifier.

Paper: Searching for MobileNetV3 - https://arxiv.org/abs/1905.02244

Other architectures utilizing MobileNet-V3 efficient head that are supported by this impl include:

HardCoRe-NAS - https://arxiv.org/abs/2102.11646 (defn in hardcorenas.py uses this class)
FBNet-V3 - https://arxiv.org/abs/2006.02049
LCNet - https://arxiv.org/abs/2109.15099

class timm.models.mvitv2.MultiScaleVit(cfg: MultiScaleVitCfg, img_size: Tuple[int, int] = (224, 224), in_chans: int = 3, global_pool: str | None = None, num_classes: int = 1000, drop_path_rate: float = 0.0, drop_rate: float = 0.0)

Improved Multiscale Vision Transformers for Classification and Detection Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik,

Christoph Feichtenhofer*

https://arxiv.org/abs/2112.01526

Multiscale Vision Transformers Haoqi Fan*, Bo Xiong*, Karttikeya Mangalam*, Yanghao Li*, Zhicheng Yan, Jitendra Malik,

Christoph Feichtenhofer*

https://arxiv.org/abs/2104.11227

class timm.models.nasnet.NASNetALarge(num_classes=1000, in_chans=3, stem_size=96, channel_multiplier=2, num_features=4032, output_stride=32, drop_rate=0.0, global_pool='avg', pad_type='same'): NASNetALarge (6 @ 4032)

class timm.models.nest.Nest(img_size=224, in_chans=3, patch_size=4, num_levels=3, embed_dims=(128, 256, 512), num_heads=(4, 8, 16), depths=(2, 2, 20), num_classes=1000, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.5, norm_layer=None, act_layer=None, pad_type='', weight_init='', global_pool='avg')

Nested Transformer (NesT)

A PyTorch impl ofAggregating Nested Transformers

https://arxiv.org/abs/2105.12723

class timm.models.nfnet.NormFreeNet(cfg: NfCfg, num_classes: int = 1000, in_chans: int = 3, global_pool: str = 'avg', output_stride: int = 32, drop_rate: float = 0.0, drop_path_rate: float = 0.0, **kwargs)

Normalization-Free Network

As described in : Characterizing signal propagation to close the performance gap in unnormalized ResNets

https://arxiv.org/abs/2101.08692

and High-Performance Large-Scale Image Recognition Without Normalization - https://arxiv.org/abs/2102.06171

This model aims to cover both the NFRegNet-Bx models as detailed in the paper’s code snippets and the (preact) ResNet models described earlier in the paper.

There are a few differences:

channels are rounded to be divisible by 8 by default (keep tensor core kernels happy),
this changes channel dim and param counts slightly from the paper models
activation correcting gamma constants are moved into the ScaledStdConv as it has less performance
impact in PyTorch when done with the weight scaling there. This likely wasn’t a concern in the JAX impl.
a config option gamma_in_act can be enabled to not apply gamma in StdConv as described above, but
apply it in each activation. This is slightly slower, numerically different, but matches official impl.
skipinit is disabled by default, it seems to have a rather drastic impact on GPU memory use and throughput
for what it is/does. Approx 8-10% throughput loss.

class timm.models.pit.PoolingVisionTransformer(img_size: int = 224, patch_size: int = 16, stride: int = 8, stem_type: str = 'overlap', base_dims: Sequence[int] = (48, 48, 48), depth: Sequence[int] = (2, 6, 4), heads: Sequence[int] = (2, 4, 8), mlp_ratio: float = 4, num_classes=1000, in_chans=3, global_pool='token', distilled=False, drop_rate=0.0, pos_drop_drate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0)

Pooling-based Vision Transformer

A PyTorch implement of ‘Rethinking Spatial Dimensions of Vision Transformers’

https://arxiv.org/abs/2103.16302

class timm.models.pnasnet.PNASNet5Large(num_classes=1000, in_chans=3, output_stride=32, drop_rate=0.0, global_pool='avg', pad_type='')

class timm.models.pvt_v2.PyramidVisionTransformerV2(in_chans=3, num_classes=1000, global_pool='avg', depths=(3, 4, 6, 3), embed_dims=(64, 128, 256, 512), num_heads=(1, 2, 4, 8), sr_ratios=(8, 4, 2, 1), mlp_ratios=(8.0, 8.0, 4.0, 4.0), qkv_bias=True, linear=False, drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'timm.layers.norm.LayerNorm'>)

class timm.models.regnet.RegNet(cfg: RegNetCfg, in_chans=3, num_classes=1000, output_stride=32, global_pool='avg', drop_rate=0.0, drop_path_rate=0.0, zero_init_last=True, **kwargs)

RegNet-X, Y, and Z Models

Paper: https://arxiv.org/abs/2003.13678 Original Impl: https://github.com/facebookresearch/pycls/blob/master/pycls/models/regnet.py

class timm.models.repghost.RepGhostNet(cfgs, num_classes=1000, width=1.0, in_chans=3, output_stride=32, global_pool='avg', drop_rate=0.2, reparam=True)

class timm.models.repvit.RepVit(in_chans=3, img_size=224, embed_dim=(48, ), depth=(2, ), mlp_ratio=2, global_pool='avg', kernel_size=3, num_classes=1000, act_layer=<class 'torch.nn.modules.activation.GELU'>, distillation=True, drop_rate=0.0, legacy=False)

class timm.models.resnet.ResNet(block: ~timm.models.resnet.BasicBlock | ~timm.models.resnet.Bottleneck, layers: ~typing.List[int], num_classes: int = 1000, in_chans: int = 3, output_stride: int = 32, global_pool: str = 'avg', cardinality: int = 1, base_width: int = 64, stem_width: int = 64, stem_type: str = '', replace_stem_pool: bool = False, block_reduce_first: int = 1, down_kernel_size: int = 1, avg_down: bool = False, act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.batchnorm.BatchNorm2d'>, aa_layer: ~typing.Type[~torch.nn.modules.module.Module] | None = None, drop_rate: float = 0.0, drop_path_rate: float = 0.0, drop_block_rate: float = 0.0, zero_init_last: bool = True, block_args: ~typing.Dict[str, ~typing.Any] | None = None)

ResNet / ResNeXt / SE-ResNeXt / SE-Net

This class implements all variants of ResNet, ResNeXt, SE-ResNeXt, and SENet that

have > 1 stride in the 3x3 conv layer of bottleneck
have conv-bn-act ordering

This ResNet impl supports a number of stem and downsample options based on the v1c, v1d, v1e, and v1s variants included in the MXNet Gluon ResNetV1b model. The C and D variants are also discussed in the ‘Bag of Tricks’ paper: https://arxiv.org/pdf/1812.01187. The B variant is equivalent to torchvision default.

ResNet variants (the same modifications can be used in SE/ResNeXt models as well):

normal, b - 7x7 stem, stem_width = 64, same as torchvision ResNet, NVIDIA ResNet ‘v1.5’, Gluon v1b
c - 3 layer deep 3x3 stem, stem_width = 32 (32, 32, 64)
d - 3 layer deep 3x3 stem, stem_width = 32 (32, 32, 64), average pool in downsample
e - 3 layer deep 3x3 stem, stem_width = 64 (64, 64, 128), average pool in downsample
s - 3 layer deep 3x3 stem, stem_width = 64 (64, 64, 128)
t - 3 layer deep 3x3 stem, stem width = 32 (24, 48, 64), average pool in downsample
tn - 3 layer deep 3x3 stem, stem width = 32 (24, 32, 64), average pool in downsample

ResNeXt

normal - 7x7 stem, stem_width = 64, standard cardinality and base widths
same c,d, e, s variants as ResNet can be enabled

SE-ResNeXt

normal - 7x7 stem, stem_width = 64
same c, d, e, s variants as ResNet can be enabled

SENet-154 - 3 layer deep 3x3 stem (same as v1c-v1s), stem_width = 64, cardinality=64,

reduction by 2 on width of first bottleneck convolution, 3x3 downsample convs after first block

class timm.models.resnetv2.ResNetV2(layers, channels=(256, 512, 1024, 2048), num_classes=1000, in_chans=3, global_pool='avg', output_stride=32, width_factor=1, stem_chs=64, stem_type='', avg_down=False, preact=True, act_layer=<class 'torch.nn.modules.activation.ReLU'>, norm_layer=functools.partial(<class 'timm.layers.norm_act.GroupNormAct'>, num_groups=32), conv_layer=<class 'timm.layers.std_conv.StdConv2d'>, drop_rate=0.0, drop_path_rate=0.0, zero_init_last=False): Implementation of Pre-activation (v2) ResNet mode.

class timm.models.rexnet.RexNet(in_chans=3, num_classes=1000, global_pool='avg', output_stride=32, initial_chs=16, final_chs=180, width_mult=1.0, depth_mult=1.0, se_ratio=0.08333333333333333, ch_div=1, act_layer='swish', dw_act_layer='relu6', drop_rate=0.2, drop_path_rate=0.0)

class timm.models.selecsls.SelecSls(cfg, num_classes=1000, in_chans=3, drop_rate=0.0, global_pool='avg')

SelecSls42 / SelecSls60 / SelecSls84

Parameters:

cfg (network config dictionary specifying block type, feature, and head args)
num_classes (int, default 1000) – Number of classification classes.
in_chans (int, default 3) – Number of input (color) channels.
drop_rate (float, default 0.) – Dropout probability before classifier, for training
global_pool (str, default 'avg') – Global pooling type. One of ‘avg’, ‘max’, ‘avgmax’, ‘catavgmax’

class timm.models.senet.SENet(block, layers, groups, reduction, drop_rate=0.2, in_chans=3, inplanes=64, input_3x3=False, downsample_kernel_size=1, downsample_padding=0, num_classes=1000, global_pool='avg')

class timm.models.sequencer.Sequencer2d(num_classes=1000, img_size=224, in_chans=3, global_pool='avg', layers=(4, 3, 8, 3), patch_sizes=(7, 2, 2, 1), embed_dims=(192, 384, 384, 384), hidden_sizes=(48, 96, 96, 96), mlp_ratios=(3.0, 3.0, 3.0, 3.0), block_layer=<class 'timm.models.sequencer.Sequencer2dBlock'>, rnn_layer=<class 'timm.models.sequencer.LSTM2d'>, mlp_layer=<class 'timm.layers.mlp.Mlp'>, norm_layer=functools.partial(<class 'torch.nn.modules.normalization.LayerNorm'>, eps=1e-06), act_layer=<class 'torch.nn.modules.activation.GELU'>, num_rnn_layers=1, bidirectional=True, union='cat', with_fc=True, drop_rate=0.0, drop_path_rate=0.0, nlhb=False, stem_norm=False)

class timm.models.swin_transformer.SwinTransformer(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int = 4, in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 96, depths: ~typing.Tuple[int, ...] = (2, 2, 6, 2), num_heads: ~typing.Tuple[int, ...] = (3, 6, 12, 24), head_dim: int | None = None, window_size: int | ~typing.Tuple[int, int] = 7, mlp_ratio: float = 4.0, qkv_bias: bool = True, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, embed_layer: ~typing.Callable = <class 'timm.layers.patch_embed.PatchEmbed'>, norm_layer: str | ~typing.Callable = <class 'torch.nn.modules.normalization.LayerNorm'>, weight_init: str = '', **kwargs)

Swin Transformer

A PyTorch impl ofSwin Transformer: Hierarchical Vision Transformer using Shifted Windows -: https://arxiv.org/pdf/2103.14030

class timm.models.swin_transformer_v2.SwinTransformerV2(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int = 4, in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 96, depths: ~typing.Tuple[int, ...] = (2, 2, 6, 2), num_heads: ~typing.Tuple[int, ...] = (3, 6, 12, 24), window_size: int | ~typing.Tuple[int, int] = 7, mlp_ratio: float = 4.0, qkv_bias: bool = True, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, norm_layer: ~typing.Callable = <class 'torch.nn.modules.normalization.LayerNorm'>, pretrained_window_sizes: ~typing.Tuple[int, ...] = (0, 0, 0, 0), **kwargs)

Swin Transformer V2

A PyTorch impl ofSwin Transformer V2: Scaling Up Capacity and Resolution

https://arxiv.org/abs/2111.09883

class timm.models.swin_transformer_v2_cr.SwinTransformerV2Cr(img_size: ~typing.Tuple[int, int] = (224, 224), patch_size: int = 4, window_size: int | None = None, img_window_ratio: int = 32, in_chans: int = 3, num_classes: int = 1000, embed_dim: int = 96, depths: ~typing.Tuple[int, ...] = (2, 2, 6, 2), num_heads: ~typing.Tuple[int, ...] = (3, 6, 12, 24), mlp_ratio: float = 4.0, init_values: float | None = 0.0, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.normalization.LayerNorm'>, extra_norm_period: int = 0, extra_norm_stage: bool = False, sequential_attn: bool = False, global_pool: str = 'avg', weight_init='skip', **kwargs: ~typing.Any)

Swin Transformer V2

A PyTorch impl ofSwin Transformer V2: Scaling Up Capacity and Resolution -: https://arxiv.org/pdf/2111.09883

Parameters:

img_size – Input resolution.
window_size – Window size. If None, img_size // window_div
img_window_ratio – Window size to image size ratio.
patch_size – Patch size.
in_chans – Number of input channels.
depths – Depth of the stage (number of layers).
num_heads – Number of attention heads to be utilized.
embed_dim – Patch embedding dimension.
num_classes – Number of output classes.
mlp_ratio – Ratio of the hidden dimension in the FFN to the input channels.
drop_rate – Dropout rate.
proj_drop_rate – Projection dropout rate.
attn_drop_rate – Dropout rate of attention map.
drop_path_rate – Stochastic depth rate.
norm_layer – Type of normalization layer to be utilized.
extra_norm_period – Insert extra norm layer on main branch every N (period) blocks in stage
extra_norm_stage – End each stage with an extra norm layer in main branch
sequential_attn – If true sequential self-attention is performed.

get_classifier() → Module: Method returns the classification head of the model. :returns: Current classification head :rtype: head (nn.Module)

reset_classifier(num_classes: int, global_pool: str | None = None) → None

Method results the classification head

Parameters:

num_classes (int) – Number of classes to be predicted
global_pool (str) – Unused

update_input_size(new_img_size: Tuple[int, int] | None = None, new_window_size: int | None = None, img_window_ratio: int = 32) → None

Method updates the image resolution to be processed and window size and so the pair-wise relative positions.

Parameters:

new_window_size (Optional[int]) – New window size, if None based on new_img_size // window_div
new_img_size (Optional[Tuple[int, int]]) – New input resolution, if None current resolution is used
img_window_ratio (int) – divisor for calculating window size from image size

class timm.models.tiny_vit.TinyVit(in_chans=3, num_classes=1000, global_pool='avg', embed_dims=(96, 192, 384, 768), depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_sizes=(7, 7, 14, 7), mlp_ratio=4.0, drop_rate=0.0, drop_path_rate=0.1, use_checkpoint=False, mbconv_expand_ratio=4.0, local_conv_size=3, act_layer=<class 'torch.nn.modules.activation.GELU'>)

class timm.models.tnt.TNT(img_size=224, patch_size=16, in_chans=3, num_classes=1000, global_pool='token', embed_dim=768, inner_dim=48, depth=12, num_heads_inner=4, num_heads_outer=12, mlp_ratio=4.0, qkv_bias=False, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, first_stride=4): Transformer in Transformer - https://arxiv.org/abs/2103.00112

class timm.models.tresnet.TResNet(layers, in_chans=3, num_classes=1000, width_factor=1.0, v2=False, global_pool='fast', drop_rate=0.0, drop_path_rate=0.0)

class timm.models.twins.Twins(img_size=224, patch_size=4, in_chans=3, num_classes=1000, global_pool='avg', embed_dims=(64, 128, 256, 512), num_heads=(1, 2, 4, 8), mlp_ratios=(4, 4, 4, 4), depths=(3, 4, 6, 3), sr_ratios=(8, 4, 2, 1), wss=None, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=functools.partial(<class 'torch.nn.modules.normalization.LayerNorm'>, eps=1e-06), block_cls=<class 'timm.models.twins.Block'>)

Twins Vision Transfomer (Revisiting Spatial Attention)

Adapted from PVT (PyramidVisionTransformer) class at https://github.com/whai362/PVT.git

class timm.models.vgg.VGG(cfg: ~typing.List[~typing.Any], num_classes: int = 1000, in_chans: int = 3, output_stride: int = 32, mlp_ratio: float = 1.0, act_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.ReLU'>, conv_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.conv.Conv2d'>, norm_layer: ~torch.nn.modules.module.Module = None, global_pool: str = 'avg', drop_rate: float = 0.0)

class timm.models.visformer.Visformer(img_size=224, patch_size=16, in_chans=3, num_classes=1000, init_channels=32, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4.0, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'timm.layers.norm.LayerNorm2d'>, attn_stage='111', use_pos_embed=True, spatial_conv='111', vit_stem=False, group=8, global_pool='avg', conv_init=False, embed_norm=None)

class timm.models.vision_transformer.VisionTransformer(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int | ~typing.Tuple[int, int] = 16, in_chans: int = 3, num_classes: int = 1000, global_pool: ~typing.Literal['', 'avg', 'token', 'map'] = 'token', embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_norm: bool = False, init_values: float | None = None, class_token: bool = True, no_embed_class: bool = False, reg_tokens: int = 0, pre_norm: bool = False, fc_norm: bool | None = None, dynamic_img_size: bool = False, dynamic_img_pad: bool = False, drop_rate: float = 0.0, pos_drop_rate: float = 0.0, patch_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, weight_init: ~typing.Literal['skip', 'jax', 'jax_nlhb', 'moco', ''] = '', fix_init: bool = False, embed_layer: ~typing.Callable = <class 'timm.layers.patch_embed.PatchEmbed'>, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, block_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'timm.models.vision_transformer.Block'>, mlp_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'timm.layers.mlp.Mlp'>)

Vision Transformer

A PyTorch impl ofAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

https://arxiv.org/abs/2010.11929

get_intermediate_layers(x: Tensor, n: int | Sequence = 1, reshape: bool = False, return_prefix_tokens: bool = False, norm: bool = False) → Tuple[Tensor | Tuple[Tensor]]: Intermediate layer accessor (NOTE: This is a WIP experiment). Inspired by DINO / DINOv2 interface

class timm.models.vision_transformer_relpos.VisionTransformerRelPos(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int | ~typing.Tuple[int, int] = 16, in_chans: int = 3, num_classes: int = 1000, global_pool: ~typing.Literal['', 'avg', 'token', 'map'] = 'avg', embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_norm: bool = False, init_values: float | None = 1e-06, class_token: bool = False, fc_norm: bool = False, rel_pos_type: str = 'mlp', rel_pos_dim: int | None = None, shared_rel_pos: bool = False, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, weight_init: ~typing.Literal['skip', 'jax', 'moco', ''] = 'skip', fix_init: bool = False, embed_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'timm.layers.patch_embed.PatchEmbed'>, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, block_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'timm.models.vision_transformer_relpos.RelPosBlock'>)

Vision Transformer w/ Relative Position Bias

Differing from classic vit, this impl

uses relative position index (swin v1 / beit) or relative log coord + mlp (swin v2) pos embed
defaults to no class token (can be enabled)
defaults to global avg pool for head (can be changed)
layer-scale (residual branch gain) enabled

class timm.models.vision_transformer_sam.VisionTransformerSAM(img_size: int = 1024, patch_size: int = 16, in_chans: int = 3, num_classes: int = 768, embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_norm: bool = False, init_values: float | None = None, pre_norm: bool = False, drop_rate: float = 0.0, pos_drop_rate: float = 0.0, patch_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, weight_init: str = '', embed_layer: ~typing.Callable = functools.partial(<class 'timm.layers.patch_embed.PatchEmbed'>, output_fmt=<Format.NHWC: 'NHWC'>, strict_img_size=False), norm_layer: ~typing.Callable | None = <class 'torch.nn.modules.normalization.LayerNorm'>, act_layer: ~typing.Callable | None = <class 'torch.nn.modules.activation.GELU'>, block_fn: ~typing.Callable = <class 'timm.models.vision_transformer_sam.Block'>, mlp_layer: ~typing.Callable = <class 'timm.layers.mlp.Mlp'>, use_abs_pos: bool = True, use_rel_pos: bool = False, use_rope: bool = False, window_size: int = 14, global_attn_indexes: ~typing.Tuple[int, ...] = (), neck_chans: int = 256, global_pool: str = 'avg', head_hidden_size: int | None = None, ref_feat_shape: ~typing.Tuple[~typing.Tuple[int, int], ~typing.Tuple[int, int]] | None = None)

Vision Transformer for Segment-Anything Model(SAM)

A PyTorch impl ofExploring Plain Vision Transformer Backbones for Object Detection or Segment Anything Model (SAM)

https://arxiv.org/abs/2010.11929

class timm.models.volo.VOLO(layers, img_size=224, in_chans=3, num_classes=1000, global_pool='token', patch_size=8, stem_hidden_dim=64, embed_dims=None, num_heads=None, downsamples=(True, False, False, False), outlook_attention=(True, False, False, False), mlp_ratio=3.0, qkv_bias=False, drop_rate=0.0, pos_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, post_layers=('ca', 'ca'), use_aux_head=True, use_mix_token=False, pooling_scale=2)

Vision Outlooker, the main class of our model

forward_train(x): A separate forward fn for training with mix_token (if a train script supports). Combining multiple modes in as single forward with different return types is torchscript hell.

class timm.models.vovnet.VovNet(cfg, in_chans=3, num_classes=1000, global_pool='avg', output_stride=32, norm_layer=<class 'timm.layers.norm_act.BatchNormAct2d'>, act_layer=<class 'torch.nn.modules.activation.ReLU'>, drop_rate=0.0, drop_path_rate=0.0, **kwargs)

class timm.models.xception.Xception(num_classes=1000, in_chans=3, drop_rate=0.0, global_pool='avg'): Xception optimized for the ImageNet dataset, as specified in https://arxiv.org/pdf/1610.02357.pdf

class timm.models.xception_aligned.XceptionAligned(block_cfg: ~typing.List[~typing.Dict], num_classes: int = 1000, in_chans: int = 3, output_stride: int = 32, preact: bool = False, act_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, norm_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.batchnorm.BatchNorm2d'>, drop_rate: float = 0.0, drop_path_rate: float = 0.0, global_pool: str = 'avg'): Modified Aligned Xception

class timm.models.xcit.Xcit(img_size=224, patch_size=16, in_chans=3, num_classes=1000, global_pool='token', embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, act_layer=None, norm_layer=None, cls_attn_layers=2, use_pos_embed=True, eta=1.0, tokens_norm=False): Based on timm and DeiT code bases https://github.com/rwightman/pytorch-image-models/tree/master/timm https://github.com/facebookresearch/deit/