Image Models

This page contains the list of external image models that can be used with EIR, coming from the great timm library.

There are 3 ways to use these models:

  • Configure and train specific architectures (e.g. ResNet with chosen number of layers) from scratch.

  • Train a specific architecture (e.g. resnet18) from scratch.

  • Use a pre-trained model (e.g. resnet18) and fine-tune it.

Please refer to this page for more detailed information about configurable architectures, and this page for a list of pre-defined architectures, with the option of using pre-trained weights.

Configurable Models

The following models can be configured and trained from scratch.

The model type is specified in the model_type field of the configuration, while the model specific configuration is specified in the model_init_config field.

For example, the ResNet architecture includes the layers and block parameters, and can be configured as follows:

input_configurable_image_model.yaml
input_info:
  input_source: eir_tutorials/a_using_eir/05_image_tutorial/data/hot_dog_not_hot_dog/food_images
  input_name: hot_dog
  input_type: image

input_type_info:
  mixing_subtype: "cutmix"
  size:
    - 64

model_config:
  model_type: "ResNet"
  model_init_config:
    layers: [1, 1, 1, 1]
    block: "BasicBlock"

interpretation_config:
    num_samples_to_interpret: 30
class timm.models.beit.Beit(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int | ~typing.Tuple[int, int] = 16, in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 768, depth: int = 12, num_heads: int = 12, qkv_bias: bool = True, mlp_ratio: float = 4.0, swiglu_mlp: bool = False, scale_mlp: bool = False, drop_rate: float = 0.0, pos_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_layer: ~typing.Callable = <class 'timm.layers.norm.LayerNorm'>, init_values: float | None = None, use_abs_pos_emb: bool = True, use_rel_pos_bias: bool = False, use_shared_rel_pos_bias: bool = False, head_init_scale: float = 0.001)

Vision Transformer with support for patch or hybrid CNN input stage

class timm.models.byobnet.ByobNet(cfg: ByoModelCfg, num_classes: int = 1000, in_chans: int = 3, global_pool: str = 'avg', output_stride: int = 32, img_size: int | Tuple[int, int] | None = None, drop_rate: float = 0.0, drop_path_rate: float = 0.0, zero_init_last: bool = True, **kwargs)

‘Bring-your-own-blocks’ Net

A flexible network backbone that allows building model stem + blocks via dataclass cfg definition w/ factory functions for module instantiation.

Current assumption is that both stem and blocks are in conv-bn-act order (w/ block ending in act).

class timm.models.cait.Cait(img_size=224, patch_size=16, in_chans=3, num_classes=1000, global_pool='token', embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, block_layers=<class 'timm.models.cait.LayerScaleBlock'>, block_layers_token=<class 'timm.models.cait.LayerScaleBlockClassAttn'>, patch_layer=<class 'timm.layers.patch_embed.PatchEmbed'>, norm_layer=functools.partial(<class 'torch.nn.modules.normalization.LayerNorm'>, eps=1e-06), act_layer=<class 'torch.nn.modules.activation.GELU'>, attn_block=<class 'timm.models.cait.TalkingHeadAttn'>, mlp_block=<class 'timm.layers.mlp.Mlp'>, init_values=0.0001, attn_block_token_only=<class 'timm.models.cait.ClassAttn'>, mlp_block_token_only=<class 'timm.layers.mlp.Mlp'>, depth_token_only=2, mlp_ratio_token_only=4.0)
class timm.models.coat.CoaT(img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dims=(64, 128, 320, 512), serial_depths=(3, 4, 6, 3), parallel_depth=0, num_heads=8, mlp_ratios=(4, 4, 4, 4), qkv_bias=True, drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'timm.layers.norm.LayerNorm'>, return_interm_layers=False, out_features=None, crpe_window=None, global_pool='token')

CoaT class.

class timm.models.convit.ConVit(img_size=224, patch_size=16, in_chans=3, num_classes=1000, global_pool='token', embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=False, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, hybrid_backbone=None, norm_layer=<class 'timm.layers.norm.LayerNorm'>, local_up_to_layer=3, locality_strength=1.0, use_pos_embed=True)

Vision Transformer with support for patch or hybrid CNN input stage

class timm.models.convmixer.ConvMixer(dim, depth, kernel_size=9, patch_size=7, in_chans=3, num_classes=1000, global_pool='avg', drop_rate=0.0, act_layer=<class 'torch.nn.modules.activation.GELU'>, **kwargs)
class timm.models.convnext.ConvNeXt(in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', output_stride: int = 32, depths: Tuple[int, ...] = (3, 3, 9, 3), dims: Tuple[int, ...] = (96, 192, 384, 768), kernel_sizes: int | Tuple[int, ...] = 7, ls_init_value: float | None = 1e-06, stem_type: str = 'patch', patch_size: int = 4, head_init_scale: float = 1.0, head_norm_first: bool = False, head_hidden_size: int | None = None, conv_mlp: bool = False, conv_bias: bool = True, use_grn: bool = False, act_layer: str | Callable = 'gelu', norm_layer: str | Callable | None = None, norm_eps: float | None = None, drop_rate: float = 0.0, drop_path_rate: float = 0.0)

A PyTorch impl of : A ConvNet for the 2020s - https://arxiv.org/pdf/2201.03545.pdf

class timm.models.crossvit.CrossVit(img_size=224, img_scale=(1.0, 1.0), patch_size=(8, 16), in_chans=3, num_classes=1000, embed_dim=(192, 384), depth=((1, 3, 1), (1, 3, 1), (1, 3, 1)), num_heads=(6, 12), mlp_ratio=(2.0, 2.0, 4.0), multi_conv=False, crop_scale=False, qkv_bias=True, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=functools.partial(<class 'torch.nn.modules.normalization.LayerNorm'>, eps=1e-06), global_pool='token')

Vision Transformer with support for patch or hybrid CNN input stage

class timm.models.cspnet.CspNet(cfg: CspModelCfg, in_chans=3, num_classes=1000, output_stride=32, global_pool='avg', drop_rate=0.0, drop_path_rate=0.0, zero_init_last=True, **kwargs)

Cross Stage Partial base model.

Paper: CSPNet: A New Backbone that can Enhance Learning Capability of CNN - https://arxiv.org/abs/1911.11929 Ref Impl: https://github.com/WongKinYiu/CrossStagePartialNetworks

NOTE: There are differences in the way I handle the 1x1 ‘expansion’ conv in this impl vs the darknet impl. I did it this way for simplicity and less special cases.

class timm.models.davit.DaVit(in_chans=3, depths=(1, 1, 3, 1), embed_dims=(96, 192, 384, 768), num_heads=(3, 6, 12, 24), window_size=7, mlp_ratio=4, qkv_bias=True, norm_layer='layernorm2d', norm_layer_cl='layernorm', norm_eps=1e-05, attn_types=('spatial', 'channel'), ffn=True, cpe_act=False, drop_rate=0.0, drop_path_rate=0.0, num_classes=1000, global_pool='avg', head_norm_first=False)
DaViT

A PyTorch implementation of DaViT: Dual Attention Vision Transformers - https://arxiv.org/abs/2204.03645 Supports arbitrary input sizes and pyramid feature extraction

Parameters:
  • in_chans (int) – Number of input image channels. Default: 3

  • num_classes (int) – Number of classes for classification head. Default: 1000

  • depths (tuple(int)) – Number of blocks in each stage. Default: (1, 1, 3, 1)

  • embed_dims (tuple(int)) – Patch embedding dimension. Default: (96, 192, 384, 768)

  • num_heads (tuple(int)) – Number of attention heads in different layers. Default: (3, 6, 12, 24)

  • window_size (int) – Window size. Default: 7

  • mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Default: 4

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Default: True

  • drop_path_rate (float) – Stochastic depth rate. Default: 0.1

  • norm_layer (nn.Module) – Normalization layer. Default: nn.LayerNorm.

class timm.models.deit.VisionTransformerDistilled(*args, **kwargs)

Vision Transformer w/ Distillation Token and Head

Distillation token & head support for DeiT: Data-efficient Image Transformers
class timm.models.densenet.DenseNet(growth_rate=32, block_config=(6, 12, 24, 16), num_classes=1000, in_chans=3, global_pool='avg', bn_size=4, stem_type='', act_layer='relu', norm_layer='batchnorm2d', aa_layer=None, drop_rate=0.0, proj_drop_rate=0.0, memory_efficient=False, aa_stem_only=True)

Densenet-BC model class, based on “Densely Connected Convolutional Networks”

Parameters:
  • growth_rate (int) - how many filters to add each layer (k in paper)

  • block_config (list of 4 ints)

  • bn_size (int) – (i.e. bn_size * k features in the bottleneck layer)

  • drop_rate (float)

  • proj_drop_rate (float)

  • num_classes (int)

  • memory_efficient (bool) – but slower. Default: False. See “paper”

class timm.models.dla.DLA(levels, channels, output_stride=32, num_classes=1000, in_chans=3, global_pool='avg', cardinality=1, base_width=64, block=<class 'timm.models.dla.DlaBottle2neck'>, shortcut_root=False, drop_rate=0.0)
class timm.models.dpn.DPN(k_sec=(3, 4, 20, 3), inc_sec=(16, 32, 24, 128), k_r=96, groups=32, num_classes=1000, in_chans=3, output_stride=32, global_pool='avg', small=False, num_init_features=64, b=False, drop_rate=0.0, norm_layer='batchnorm2d', act_layer='relu', fc_act_layer='elu')
class timm.models.edgenext.EdgeNeXt(in_chans=3, num_classes=1000, global_pool='avg', dims=(24, 48, 88, 168), depths=(3, 3, 9, 3), global_block_counts=(0, 1, 1, 1), kernel_sizes=(3, 5, 7, 9), heads=(8, 8, 8, 8), d2_scales=(2, 2, 3, 4), use_pos_emb=(False, True, False, False), ls_init_value=1e-06, head_init_scale=1.0, expand_ratio=4, downsample_block=False, conv_bias=True, stem_type='patch', head_norm_first=False, act_layer=<class 'torch.nn.modules.activation.GELU'>, drop_path_rate=0.0, drop_rate=0.0)
class timm.models.efficientformer.EfficientFormer(depths, embed_dims=None, in_chans=3, num_classes=1000, global_pool='avg', downsamples=None, num_vit=0, mlp_ratios=4, pool_size=3, layer_scale_init_value=1e-05, act_layer=<class 'torch.nn.modules.activation.GELU'>, norm_layer=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, norm_layer_cl=<class 'torch.nn.modules.normalization.LayerNorm'>, drop_rate=0.0, proj_drop_rate=0.0, drop_path_rate=0.0, **kwargs)
class timm.models.efficientnet.EfficientNet(block_args, num_classes=1000, num_features=1280, in_chans=3, stem_size=32, fix_stem=False, output_stride=32, pad_type='', round_chs_fn=<function round_channels>, act_layer=None, norm_layer=None, se_layer=None, drop_rate=0.0, drop_path_rate=0.0, global_pool='avg')
A flexible and performant PyTorch implementation of efficient network architectures, including:
  • EfficientNet-V2 Small, Medium, Large, XL & B0-B3

  • EfficientNet B0-B8, L2

  • EfficientNet-EdgeTPU

  • EfficientNet-CondConv

  • MixNet S, M, L, XL

  • MnasNet A1, B1, and small

  • MobileNet-V2

  • FBNet C

  • Single-Path NAS Pixel1

  • TinyNet

class timm.models.efficientvit_mit.EfficientVit(in_chans=3, widths=(), depths=(), head_dim=32, expand_ratio=4, norm_layer=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, act_layer=<class 'torch.nn.modules.activation.Hardswish'>, global_pool='avg', head_widths=(), drop_rate=0.0, num_classes=1000)
class timm.models.efficientvit_msra.EfficientVitMsra(img_size=224, in_chans=3, num_classes=1000, embed_dim=(64, 128, 192), key_dim=(16, 16, 16), depth=(1, 2, 3), num_heads=(4, 4, 4), window_size=(7, 7, 7), kernels=(5, 5, 5, 5), down_ops=(('', 1), ('subsample', 2), ('subsample', 2)), global_pool='avg', drop_rate=0.0)
class timm.models.eva.Eva(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int | ~typing.Tuple[int, int] = 16, in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 768, depth: int = 12, num_heads: int = 12, qkv_bias: bool = True, qkv_fused: bool = True, mlp_ratio: float = 4.0, swiglu_mlp: bool = False, scale_mlp: bool = False, scale_attn_inner: bool = False, drop_rate: float = 0.0, pos_drop_rate: float = 0.0, patch_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_layer: ~typing.Callable = <class 'timm.layers.norm.LayerNorm'>, init_values: float | None = None, class_token: bool = True, use_abs_pos_emb: bool = True, use_rot_pos_emb: bool = False, use_post_norm: bool = False, dynamic_img_size: bool = False, dynamic_img_pad: bool = False, ref_feat_shape: int | ~typing.Tuple[int, int] | None = None, head_init_scale: float = 0.001)

Eva Vision Transformer w/ Abs & Rotary Pos Embed

This class implements the EVA and EVA02 models that were based on the BEiT ViT variant
  • EVA - abs pos embed, global avg pool

  • EVA02 - abs + rope pos embed, global avg pool, SwiGLU, scale Norm in MLP (ala normformer)

class timm.models.focalnet.FocalNet(in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 96, depths: ~typing.Tuple[int, ...] = (2, 2, 6, 2), mlp_ratio: float = 4.0, focal_levels: ~typing.Tuple[int, ...] = (2, 2, 2, 2), focal_windows: ~typing.Tuple[int, ...] = (3, 3, 3, 3), use_overlap_down: bool = False, use_post_norm: bool = False, use_post_norm_in_modulation: bool = False, normalize_modulator: bool = False, head_hidden_size: int | None = None, head_init_scale: float = 1.0, layerscale_value: float | None = None, drop_rate: bool = 0.0, proj_drop_rate: bool = 0.0, drop_path_rate: bool = 0.1, norm_layer: ~typing.Callable = functools.partial(<class 'timm.layers.norm.LayerNorm2d'>, eps=1e-05))

“ Focal Modulation Networks (FocalNets)

class timm.models.gcvit.GlobalContextVit(in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', img_size: Tuple[int, int] = 224, window_ratio: Tuple[int, ...] = (32, 32, 16, 32), window_size: Tuple[int, ...] = None, embed_dim: int = 64, depths: Tuple[int, ...] = (3, 4, 19, 5), num_heads: Tuple[int, ...] = (2, 4, 8, 16), mlp_ratio: float = 3.0, qkv_bias: bool = True, layer_scale: float | None = None, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, weight_init='', act_layer: str = 'gelu', norm_layer: str = 'layernorm2d', norm_layer_cl: str = 'layernorm', norm_eps: float = 1e-05)
class timm.models.ghostnet.GhostNet(cfgs, num_classes=1000, width=1.0, in_chans=3, output_stride=32, global_pool='avg', drop_rate=0.2, version='v1')
class timm.models.hgnet.HighPerfGpuNet(cfg, in_chans=3, num_classes=1000, global_pool='avg', use_last_conv=True, class_expand=2048, drop_rate=0.0, drop_path_rate=0.0, use_lab=False, **kwargs)
class timm.models.hrnet.HighResolutionNet(cfg, in_chans=3, num_classes=1000, output_stride=32, global_pool='avg', drop_rate=0.0, head='classification', **kwargs)
class timm.models.inception_resnet_v2.InceptionResnetV2(num_classes=1000, in_chans=3, drop_rate=0.0, output_stride=32, global_pool='avg', norm_layer='batchnorm2d', norm_eps=0.001, act_layer='relu')
class timm.models.inception_v3.InceptionV3(num_classes=1000, in_chans=3, drop_rate=0.0, global_pool='avg', aux_logits=False, norm_layer='batchnorm2d', norm_eps=0.001, act_layer='relu')

Inception-V3

class timm.models.inception_v4.InceptionV4(num_classes=1000, in_chans=3, output_stride=32, drop_rate=0.0, global_pool='avg', norm_layer='batchnorm2d', norm_eps=0.001, act_layer='relu')
class timm.models.levit.Levit(img_size=224, in_chans=3, num_classes=1000, embed_dim=(192,), key_dim=64, depth=(12,), num_heads=(3,), attn_ratio=2.0, mlp_ratio=2.0, stem_backbone=None, stem_stride=None, stem_type='s16', down_op='subsample', act_layer='hard_swish', attn_act_layer=None, use_conv=False, global_pool='avg', drop_rate=0.0, drop_path_rate=0.0)

Vision Transformer with support for patch or hybrid CNN input stage

NOTE: distillation is defaulted to True since pretrained weights use it, will cause problems w/ train scripts that don’t take tuple outputs,

class timm.models.maxxvit.MaxxVitCfg(embed_dim: Tuple[int, ...] = (96, 192, 384, 768), depths: Tuple[int, ...] = (2, 3, 5, 2), block_type: Tuple[Union[str, Tuple[str, ...]], ...] = ('C', 'C', 'T', 'T'), stem_width: Union[int, Tuple[int, int]] = 64, stem_bias: bool = False, conv_cfg: timm.models.maxxvit.MaxxVitConvCfg = <factory>, transformer_cfg: timm.models.maxxvit.MaxxVitTransformerCfg = <factory>, head_hidden_size: int = None, weight_init: str = 'vit_eff')
class timm.models.metaformer.MetaFormer(in_chans=3, num_classes=1000, global_pool='avg', depths=(2, 2, 6, 2), dims=(64, 128, 320, 512), token_mixers=<class 'timm.models.metaformer.Pooling'>, mlp_act=<class 'timm.models.metaformer.StarReLU'>, mlp_bias=False, drop_path_rate=0.0, proj_drop_rate=0.0, drop_rate=0.0, layer_scale_init_values=None, res_scale_init_values=(None, None, 1.0, 1.0), downsample_norm=<class 'timm.models.metaformer.LayerNorm2dNoBias'>, norm_layers=<class 'timm.models.metaformer.LayerNorm2dNoBias'>, output_norm=<class 'timm.layers.norm.LayerNorm2d'>, use_mlp_head=True, **kwargs)
A PyTorch impl ofMetaFormer Baselines for Vision -

https://arxiv.org/abs/2210.13452

Parameters:
  • in_chans (int) – Number of input image channels.

  • num_classes (int) – Number of classes for classification head.

  • global_pool – Pooling for classifier head.

  • depths (list or tuple) – Number of blocks at each stage.

  • dims (list or tuple) – Feature dimension at each stage.

  • token_mixers (list, tuple or token_fcn) – Token mixer for each stage.

  • mlp_act – Activation layer for MLP.

  • mlp_bias (boolean) – Enable or disable mlp bias term.

  • drop_path_rate (float) – Stochastic depth rate.

  • drop_rate (float) – Dropout rate.

  • layer_scale_init_values (list, tuple, float or None) – Init value for Layer Scale. None means not use the layer scale. Form: https://arxiv.org/abs/2103.17239.

  • res_scale_init_values (list, tuple, float or None) – Init value for res Scale on residual connections. None means not use the res scale. From: https://arxiv.org/abs/2110.09456.

  • downsample_norm (nn.Module) – Norm layer used in stem and downsampling layers.

  • norm_layers (list, tuple or norm_fcn) – Norm layers for each stage.

  • output_norm – Norm layer before classifier head.

  • use_mlp_head – Use MLP classification head.

class timm.models.mobilenetv3.MobileNetV3(block_args: ~typing.List[~typing.List[~typing.Dict[str, ~typing.Any]]], num_classes: int = 1000, in_chans: int = 3, stem_size: int = 16, fix_stem: bool = False, num_features: int = 1280, head_bias: bool = True, pad_type: str | int | ~typing.Tuple[int, int] = '', act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, se_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, se_from_exp: bool = True, round_chs_fn: ~typing.Callable = <function round_channels>, drop_rate: float = 0.0, drop_path_rate: float = 0.0, global_pool: str = 'avg')

MobiletNet-V3

Based on my EfficientNet implementation and building blocks, this model utilizes the MobileNet-v3 specific ‘efficient head’, where global pooling is done before the head convolution without a final batch-norm layer before the classifier.

Paper: Searching for MobileNetV3 - https://arxiv.org/abs/1905.02244

Other architectures utilizing MobileNet-V3 efficient head that are supported by this impl include:
class timm.models.mvitv2.MultiScaleVit(cfg: MultiScaleVitCfg, img_size: Tuple[int, int] = (224, 224), in_chans: int = 3, global_pool: str | None = None, num_classes: int = 1000, drop_path_rate: float = 0.0, drop_rate: float = 0.0)

Improved Multiscale Vision Transformers for Classification and Detection Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik,

Christoph Feichtenhofer*

https://arxiv.org/abs/2112.01526

Multiscale Vision Transformers Haoqi Fan*, Bo Xiong*, Karttikeya Mangalam*, Yanghao Li*, Zhicheng Yan, Jitendra Malik,

Christoph Feichtenhofer*

https://arxiv.org/abs/2104.11227

class timm.models.nasnet.NASNetALarge(num_classes=1000, in_chans=3, stem_size=96, channel_multiplier=2, num_features=4032, output_stride=32, drop_rate=0.0, global_pool='avg', pad_type='same')

NASNetALarge (6 @ 4032)

class timm.models.nest.Nest(img_size=224, in_chans=3, patch_size=4, num_levels=3, embed_dims=(128, 256, 512), num_heads=(4, 8, 16), depths=(2, 2, 20), num_classes=1000, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.5, norm_layer=None, act_layer=None, pad_type='', weight_init='', global_pool='avg')

Nested Transformer (NesT)

A PyTorch impl ofAggregating Nested Transformers
class timm.models.nfnet.NormFreeNet(cfg: NfCfg, num_classes: int = 1000, in_chans: int = 3, global_pool: str = 'avg', output_stride: int = 32, drop_rate: float = 0.0, drop_path_rate: float = 0.0, **kwargs)

Normalization-Free Network

As described in : Characterizing signal propagation to close the performance gap in unnormalized ResNets

and High-Performance Large-Scale Image Recognition Without Normalization - https://arxiv.org/abs/2102.06171

This model aims to cover both the NFRegNet-Bx models as detailed in the paper’s code snippets and the (preact) ResNet models described earlier in the paper.

There are a few differences:
  • channels are rounded to be divisible by 8 by default (keep tensor core kernels happy),

    this changes channel dim and param counts slightly from the paper models

  • activation correcting gamma constants are moved into the ScaledStdConv as it has less performance

    impact in PyTorch when done with the weight scaling there. This likely wasn’t a concern in the JAX impl.

  • a config option gamma_in_act can be enabled to not apply gamma in StdConv as described above, but

    apply it in each activation. This is slightly slower, numerically different, but matches official impl.

  • skipinit is disabled by default, it seems to have a rather drastic impact on GPU memory use and throughput

    for what it is/does. Approx 8-10% throughput loss.

class timm.models.pit.PoolingVisionTransformer(img_size: int = 224, patch_size: int = 16, stride: int = 8, stem_type: str = 'overlap', base_dims: Sequence[int] = (48, 48, 48), depth: Sequence[int] = (2, 6, 4), heads: Sequence[int] = (2, 4, 8), mlp_ratio: float = 4, num_classes=1000, in_chans=3, global_pool='token', distilled=False, drop_rate=0.0, pos_drop_drate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0)

Pooling-based Vision Transformer

A PyTorch implement of ‘Rethinking Spatial Dimensions of Vision Transformers’
class timm.models.pnasnet.PNASNet5Large(num_classes=1000, in_chans=3, output_stride=32, drop_rate=0.0, global_pool='avg', pad_type='')
class timm.models.pvt_v2.PyramidVisionTransformerV2(in_chans=3, num_classes=1000, global_pool='avg', depths=(3, 4, 6, 3), embed_dims=(64, 128, 256, 512), num_heads=(1, 2, 4, 8), sr_ratios=(8, 4, 2, 1), mlp_ratios=(8.0, 8.0, 4.0, 4.0), qkv_bias=True, linear=False, drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'timm.layers.norm.LayerNorm'>)
class timm.models.regnet.RegNet(cfg: RegNetCfg, in_chans=3, num_classes=1000, output_stride=32, global_pool='avg', drop_rate=0.0, drop_path_rate=0.0, zero_init_last=True, **kwargs)

RegNet-X, Y, and Z Models

Paper: https://arxiv.org/abs/2003.13678 Original Impl: https://github.com/facebookresearch/pycls/blob/master/pycls/models/regnet.py

class timm.models.repghost.RepGhostNet(cfgs, num_classes=1000, width=1.0, in_chans=3, output_stride=32, global_pool='avg', drop_rate=0.2, reparam=True)
class timm.models.repvit.RepVit(in_chans=3, img_size=224, embed_dim=(48, ), depth=(2, ), mlp_ratio=2, global_pool='avg', kernel_size=3, num_classes=1000, act_layer=<class 'torch.nn.modules.activation.GELU'>, distillation=True, drop_rate=0.0, legacy=False)
class timm.models.resnet.ResNet(block: ~timm.models.resnet.BasicBlock | ~timm.models.resnet.Bottleneck, layers: ~typing.List[int], num_classes: int = 1000, in_chans: int = 3, output_stride: int = 32, global_pool: str = 'avg', cardinality: int = 1, base_width: int = 64, stem_width: int = 64, stem_type: str = '', replace_stem_pool: bool = False, block_reduce_first: int = 1, down_kernel_size: int = 1, avg_down: bool = False, act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.batchnorm.BatchNorm2d'>, aa_layer: ~typing.Type[~torch.nn.modules.module.Module] | None = None, drop_rate: float = 0.0, drop_path_rate: float = 0.0, drop_block_rate: float = 0.0, zero_init_last: bool = True, block_args: ~typing.Dict[str, ~typing.Any] | None = None)

ResNet / ResNeXt / SE-ResNeXt / SE-Net

This class implements all variants of ResNet, ResNeXt, SE-ResNeXt, and SENet that
  • have > 1 stride in the 3x3 conv layer of bottleneck

  • have conv-bn-act ordering

This ResNet impl supports a number of stem and downsample options based on the v1c, v1d, v1e, and v1s variants included in the MXNet Gluon ResNetV1b model. The C and D variants are also discussed in the ‘Bag of Tricks’ paper: https://arxiv.org/pdf/1812.01187. The B variant is equivalent to torchvision default.

ResNet variants (the same modifications can be used in SE/ResNeXt models as well):
  • normal, b - 7x7 stem, stem_width = 64, same as torchvision ResNet, NVIDIA ResNet ‘v1.5’, Gluon v1b

  • c - 3 layer deep 3x3 stem, stem_width = 32 (32, 32, 64)

  • d - 3 layer deep 3x3 stem, stem_width = 32 (32, 32, 64), average pool in downsample

  • e - 3 layer deep 3x3 stem, stem_width = 64 (64, 64, 128), average pool in downsample

  • s - 3 layer deep 3x3 stem, stem_width = 64 (64, 64, 128)

  • t - 3 layer deep 3x3 stem, stem width = 32 (24, 48, 64), average pool in downsample

  • tn - 3 layer deep 3x3 stem, stem width = 32 (24, 32, 64), average pool in downsample

ResNeXt
  • normal - 7x7 stem, stem_width = 64, standard cardinality and base widths

  • same c,d, e, s variants as ResNet can be enabled

SE-ResNeXt
  • normal - 7x7 stem, stem_width = 64

  • same c, d, e, s variants as ResNet can be enabled

SENet-154 - 3 layer deep 3x3 stem (same as v1c-v1s), stem_width = 64, cardinality=64,

reduction by 2 on width of first bottleneck convolution, 3x3 downsample convs after first block

class timm.models.resnetv2.ResNetV2(layers, channels=(256, 512, 1024, 2048), num_classes=1000, in_chans=3, global_pool='avg', output_stride=32, width_factor=1, stem_chs=64, stem_type='', avg_down=False, preact=True, act_layer=<class 'torch.nn.modules.activation.ReLU'>, norm_layer=functools.partial(<class 'timm.layers.norm_act.GroupNormAct'>, num_groups=32), conv_layer=<class 'timm.layers.std_conv.StdConv2d'>, drop_rate=0.0, drop_path_rate=0.0, zero_init_last=False)

Implementation of Pre-activation (v2) ResNet mode.

class timm.models.rexnet.RexNet(in_chans=3, num_classes=1000, global_pool='avg', output_stride=32, initial_chs=16, final_chs=180, width_mult=1.0, depth_mult=1.0, se_ratio=0.08333333333333333, ch_div=1, act_layer='swish', dw_act_layer='relu6', drop_rate=0.2, drop_path_rate=0.0)
class timm.models.selecsls.SelecSls(cfg, num_classes=1000, in_chans=3, drop_rate=0.0, global_pool='avg')

SelecSls42 / SelecSls60 / SelecSls84

Parameters:
  • cfg (network config dictionary specifying block type, feature, and head args)

  • num_classes (int, default 1000) – Number of classification classes.

  • in_chans (int, default 3) – Number of input (color) channels.

  • drop_rate (float, default 0.) – Dropout probability before classifier, for training

  • global_pool (str, default 'avg') – Global pooling type. One of ‘avg’, ‘max’, ‘avgmax’, ‘catavgmax’

class timm.models.senet.SENet(block, layers, groups, reduction, drop_rate=0.2, in_chans=3, inplanes=64, input_3x3=False, downsample_kernel_size=1, downsample_padding=0, num_classes=1000, global_pool='avg')
class timm.models.sequencer.Sequencer2d(num_classes=1000, img_size=224, in_chans=3, global_pool='avg', layers=(4, 3, 8, 3), patch_sizes=(7, 2, 2, 1), embed_dims=(192, 384, 384, 384), hidden_sizes=(48, 96, 96, 96), mlp_ratios=(3.0, 3.0, 3.0, 3.0), block_layer=<class 'timm.models.sequencer.Sequencer2dBlock'>, rnn_layer=<class 'timm.models.sequencer.LSTM2d'>, mlp_layer=<class 'timm.layers.mlp.Mlp'>, norm_layer=functools.partial(<class 'torch.nn.modules.normalization.LayerNorm'>, eps=1e-06), act_layer=<class 'torch.nn.modules.activation.GELU'>, num_rnn_layers=1, bidirectional=True, union='cat', with_fc=True, drop_rate=0.0, drop_path_rate=0.0, nlhb=False, stem_norm=False)
class timm.models.swin_transformer.SwinTransformer(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int = 4, in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 96, depths: ~typing.Tuple[int, ...] = (2, 2, 6, 2), num_heads: ~typing.Tuple[int, ...] = (3, 6, 12, 24), head_dim: int | None = None, window_size: int | ~typing.Tuple[int, int] = 7, mlp_ratio: float = 4.0, qkv_bias: bool = True, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, embed_layer: ~typing.Callable = <class 'timm.layers.patch_embed.PatchEmbed'>, norm_layer: str | ~typing.Callable = <class 'torch.nn.modules.normalization.LayerNorm'>, weight_init: str = '', **kwargs)

Swin Transformer

A PyTorch impl ofSwin Transformer: Hierarchical Vision Transformer using Shifted Windows -

https://arxiv.org/pdf/2103.14030

class timm.models.swin_transformer_v2.SwinTransformerV2(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int = 4, in_chans: int = 3, num_classes: int = 1000, global_pool: str = 'avg', embed_dim: int = 96, depths: ~typing.Tuple[int, ...] = (2, 2, 6, 2), num_heads: ~typing.Tuple[int, ...] = (3, 6, 12, 24), window_size: int | ~typing.Tuple[int, int] = 7, mlp_ratio: float = 4.0, qkv_bias: bool = True, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, norm_layer: ~typing.Callable = <class 'torch.nn.modules.normalization.LayerNorm'>, pretrained_window_sizes: ~typing.Tuple[int, ...] = (0, 0, 0, 0), **kwargs)

Swin Transformer V2

A PyTorch impl ofSwin Transformer V2: Scaling Up Capacity and Resolution
class timm.models.swin_transformer_v2_cr.SwinTransformerV2Cr(img_size: ~typing.Tuple[int, int] = (224, 224), patch_size: int = 4, window_size: int | None = None, img_window_ratio: int = 32, in_chans: int = 3, num_classes: int = 1000, embed_dim: int = 96, depths: ~typing.Tuple[int, ...] = (2, 2, 6, 2), num_heads: ~typing.Tuple[int, ...] = (3, 6, 12, 24), mlp_ratio: float = 4.0, init_values: float | None = 0.0, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.normalization.LayerNorm'>, extra_norm_period: int = 0, extra_norm_stage: bool = False, sequential_attn: bool = False, global_pool: str = 'avg', weight_init='skip', **kwargs: ~typing.Any)
Swin Transformer V2
A PyTorch impl ofSwin Transformer V2: Scaling Up Capacity and Resolution -

https://arxiv.org/pdf/2111.09883

Parameters:
  • img_size – Input resolution.

  • window_size – Window size. If None, img_size // window_div

  • img_window_ratio – Window size to image size ratio.

  • patch_size – Patch size.

  • in_chans – Number of input channels.

  • depths – Depth of the stage (number of layers).

  • num_heads – Number of attention heads to be utilized.

  • embed_dim – Patch embedding dimension.

  • num_classes – Number of output classes.

  • mlp_ratio – Ratio of the hidden dimension in the FFN to the input channels.

  • drop_rate – Dropout rate.

  • proj_drop_rate – Projection dropout rate.

  • attn_drop_rate – Dropout rate of attention map.

  • drop_path_rate – Stochastic depth rate.

  • norm_layer – Type of normalization layer to be utilized.

  • extra_norm_period – Insert extra norm layer on main branch every N (period) blocks in stage

  • extra_norm_stage – End each stage with an extra norm layer in main branch

  • sequential_attn – If true sequential self-attention is performed.

get_classifier() Module

Method returns the classification head of the model. :returns: Current classification head :rtype: head (nn.Module)

reset_classifier(num_classes: int, global_pool: str | None = None) None

Method results the classification head

Parameters:
  • num_classes (int) – Number of classes to be predicted

  • global_pool (str) – Unused

update_input_size(new_img_size: Tuple[int, int] | None = None, new_window_size: int | None = None, img_window_ratio: int = 32) None

Method updates the image resolution to be processed and window size and so the pair-wise relative positions.

Parameters:
  • new_window_size (Optional[int]) – New window size, if None based on new_img_size // window_div

  • new_img_size (Optional[Tuple[int, int]]) – New input resolution, if None current resolution is used

  • img_window_ratio (int) – divisor for calculating window size from image size

class timm.models.tiny_vit.TinyVit(in_chans=3, num_classes=1000, global_pool='avg', embed_dims=(96, 192, 384, 768), depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_sizes=(7, 7, 14, 7), mlp_ratio=4.0, drop_rate=0.0, drop_path_rate=0.1, use_checkpoint=False, mbconv_expand_ratio=4.0, local_conv_size=3, act_layer=<class 'torch.nn.modules.activation.GELU'>)
class timm.models.tnt.TNT(img_size=224, patch_size=16, in_chans=3, num_classes=1000, global_pool='token', embed_dim=768, inner_dim=48, depth=12, num_heads_inner=4, num_heads_outer=12, mlp_ratio=4.0, qkv_bias=False, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, first_stride=4)

Transformer in Transformer - https://arxiv.org/abs/2103.00112

class timm.models.tresnet.TResNet(layers, in_chans=3, num_classes=1000, width_factor=1.0, v2=False, global_pool='fast', drop_rate=0.0, drop_path_rate=0.0)
class timm.models.twins.Twins(img_size=224, patch_size=4, in_chans=3, num_classes=1000, global_pool='avg', embed_dims=(64, 128, 256, 512), num_heads=(1, 2, 4, 8), mlp_ratios=(4, 4, 4, 4), depths=(3, 4, 6, 3), sr_ratios=(8, 4, 2, 1), wss=None, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=functools.partial(<class 'torch.nn.modules.normalization.LayerNorm'>, eps=1e-06), block_cls=<class 'timm.models.twins.Block'>)

Twins Vision Transfomer (Revisiting Spatial Attention)

Adapted from PVT (PyramidVisionTransformer) class at https://github.com/whai362/PVT.git

class timm.models.vgg.VGG(cfg: ~typing.List[~typing.Any], num_classes: int = 1000, in_chans: int = 3, output_stride: int = 32, mlp_ratio: float = 1.0, act_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.ReLU'>, conv_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.conv.Conv2d'>, norm_layer: ~torch.nn.modules.module.Module = None, global_pool: str = 'avg', drop_rate: float = 0.0)
class timm.models.visformer.Visformer(img_size=224, patch_size=16, in_chans=3, num_classes=1000, init_channels=32, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4.0, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'timm.layers.norm.LayerNorm2d'>, attn_stage='111', use_pos_embed=True, spatial_conv='111', vit_stem=False, group=8, global_pool='avg', conv_init=False, embed_norm=None)
class timm.models.vision_transformer.VisionTransformer(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int | ~typing.Tuple[int, int] = 16, in_chans: int = 3, num_classes: int = 1000, global_pool: ~typing.Literal['', 'avg', 'token', 'map'] = 'token', embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_norm: bool = False, init_values: float | None = None, class_token: bool = True, no_embed_class: bool = False, reg_tokens: int = 0, pre_norm: bool = False, fc_norm: bool | None = None, dynamic_img_size: bool = False, dynamic_img_pad: bool = False, drop_rate: float = 0.0, pos_drop_rate: float = 0.0, patch_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, weight_init: ~typing.Literal['skip', 'jax', 'jax_nlhb', 'moco', ''] = '', fix_init: bool = False, embed_layer: ~typing.Callable = <class 'timm.layers.patch_embed.PatchEmbed'>, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, block_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'timm.models.vision_transformer.Block'>, mlp_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'timm.layers.mlp.Mlp'>)

Vision Transformer

A PyTorch impl ofAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
get_intermediate_layers(x: Tensor, n: int | Sequence = 1, reshape: bool = False, return_prefix_tokens: bool = False, norm: bool = False) Tuple[Tensor | Tuple[Tensor]]

Intermediate layer accessor (NOTE: This is a WIP experiment). Inspired by DINO / DINOv2 interface

class timm.models.vision_transformer_relpos.VisionTransformerRelPos(img_size: int | ~typing.Tuple[int, int] = 224, patch_size: int | ~typing.Tuple[int, int] = 16, in_chans: int = 3, num_classes: int = 1000, global_pool: ~typing.Literal['', 'avg', 'token', 'map'] = 'avg', embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_norm: bool = False, init_values: float | None = 1e-06, class_token: bool = False, fc_norm: bool = False, rel_pos_type: str = 'mlp', rel_pos_dim: int | None = None, shared_rel_pos: bool = False, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, weight_init: ~typing.Literal['skip', 'jax', 'moco', ''] = 'skip', fix_init: bool = False, embed_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'timm.layers.patch_embed.PatchEmbed'>, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, block_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'timm.models.vision_transformer_relpos.RelPosBlock'>)

Vision Transformer w/ Relative Position Bias

Differing from classic vit, this impl
  • uses relative position index (swin v1 / beit) or relative log coord + mlp (swin v2) pos embed

  • defaults to no class token (can be enabled)

  • defaults to global avg pool for head (can be changed)

  • layer-scale (residual branch gain) enabled

class timm.models.vision_transformer_sam.VisionTransformerSAM(img_size: int = 1024, patch_size: int = 16, in_chans: int = 3, num_classes: int = 768, embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_norm: bool = False, init_values: float | None = None, pre_norm: bool = False, drop_rate: float = 0.0, pos_drop_rate: float = 0.0, patch_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, weight_init: str = '', embed_layer: ~typing.Callable = functools.partial(<class 'timm.layers.patch_embed.PatchEmbed'>, output_fmt=<Format.NHWC: 'NHWC'>, strict_img_size=False), norm_layer: ~typing.Callable | None = <class 'torch.nn.modules.normalization.LayerNorm'>, act_layer: ~typing.Callable | None = <class 'torch.nn.modules.activation.GELU'>, block_fn: ~typing.Callable = <class 'timm.models.vision_transformer_sam.Block'>, mlp_layer: ~typing.Callable = <class 'timm.layers.mlp.Mlp'>, use_abs_pos: bool = True, use_rel_pos: bool = False, use_rope: bool = False, window_size: int = 14, global_attn_indexes: ~typing.Tuple[int, ...] = (), neck_chans: int = 256, global_pool: str = 'avg', head_hidden_size: int | None = None, ref_feat_shape: ~typing.Tuple[~typing.Tuple[int, int], ~typing.Tuple[int, int]] | None = None)

Vision Transformer for Segment-Anything Model(SAM)

A PyTorch impl ofExploring Plain Vision Transformer Backbones for Object Detection or Segment Anything Model (SAM)
class timm.models.volo.VOLO(layers, img_size=224, in_chans=3, num_classes=1000, global_pool='token', patch_size=8, stem_hidden_dim=64, embed_dims=None, num_heads=None, downsamples=(True, False, False, False), outlook_attention=(True, False, False, False), mlp_ratio=3.0, qkv_bias=False, drop_rate=0.0, pos_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, post_layers=('ca', 'ca'), use_aux_head=True, use_mix_token=False, pooling_scale=2)

Vision Outlooker, the main class of our model

forward_train(x)

A separate forward fn for training with mix_token (if a train script supports). Combining multiple modes in as single forward with different return types is torchscript hell.

class timm.models.vovnet.VovNet(cfg, in_chans=3, num_classes=1000, global_pool='avg', output_stride=32, norm_layer=<class 'timm.layers.norm_act.BatchNormAct2d'>, act_layer=<class 'torch.nn.modules.activation.ReLU'>, drop_rate=0.0, drop_path_rate=0.0, **kwargs)
class timm.models.xception.Xception(num_classes=1000, in_chans=3, drop_rate=0.0, global_pool='avg')

Xception optimized for the ImageNet dataset, as specified in https://arxiv.org/pdf/1610.02357.pdf

class timm.models.xception_aligned.XceptionAligned(block_cfg: ~typing.List[~typing.Dict], num_classes: int = 1000, in_chans: int = 3, output_stride: int = 32, preact: bool = False, act_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, norm_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.batchnorm.BatchNorm2d'>, drop_rate: float = 0.0, drop_path_rate: float = 0.0, global_pool: str = 'avg')

Modified Aligned Xception

class timm.models.xcit.Xcit(img_size=224, patch_size=16, in_chans=3, num_classes=1000, global_pool='token', embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, act_layer=None, norm_layer=None, cls_attn_layers=2, use_pos_embed=True, eta=1.0, tokens_norm=False)

Based on timm and DeiT code bases https://github.com/rwightman/pytorch-image-models/tree/master/timm https://github.com/facebookresearch/deit/