An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows


MBLLEN: Low-light Image/Video Enhancement Using CNNs

BLNet: A Fast Deep Learning Framework for Low-Light Image Enhancement with Noise Removal and Color Restoration

DA-DRN: Degradation-Aware Deep Retinex Network for Low-Light Image Enhancement

Generative Query Network

Single-stage Keypoint-based Category-level Object Pose Estimation from an RGB Image

StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Transfer Learning for Pose Estimation of Illustrated Characters

Block-NeRF: Scalable Large Scene Neural View Synthesis

Rank Minimization for Snapshot Compressive Imaging

Unsupervised Scale-consistent Depth Learning from Video

HINet: Half Instance Normalization Network for Image Restoration

Learning To Count Everything

SAFA: Structure Aware Face Animation

FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization

Learning to Detect Every Thing in an Open World

Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-Person Human Pose Estimation

Objects as Points

Learning Privacy-preserving Optics for Human Pose Estimation

Patches Are All You Need?

CLIPasso: Semantically-Aware Object Sketching

MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching

FILM: Frame Interpolation for Large Motion

SSAST: Self-Supervised Audio Spectrogram Transformer

Boundary-Aware Segmentation Network for Mobile and Web Applications

End-to-end Lane Shape Prediction with Transformers

SRWarp: Generalized Image Super-Resolution under Arbitrary Transformation

Deep High-Resolution Representation Learning for Visual Recognition

Suppress and Balance: A Simple Gated Network for Salient Object Detection

Deblurring Face Images using Uncertainty Guided Multi-Stream Semantic Networks

BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Autoregressive Diffusion Models

Vector Quantized Diffusion Model for Text-to-Image Synthesis


AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation

SSUL: Semantic Segmentation with Unknown Label for Exemplar-based Class-Incremental Learning

FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image

WHENet: Real-time Fine-Grained Estimation for Wide Range Head Pose

Graph2Pix: A Graph-Based Image to Image Translation Framework

Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task Learning

A System for General In-Hand Object Re-Orientation

MG-GAN: A Multi-Generator Model Preventing Out-of-Distribution Samples in Pedestrian Trajectory Prediction

The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation

MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?

JoJoGAN: One Shot Face Stylization

textless-lib: a Library for Textless Spoken Language Processing


Score-Based Generative Modeling with Critically-Damped Langevin Diffusion

Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection

FILM: Frame Interpolation for Large Motion

scpi: Uncertainty Quantification for Synthetic Control Estimators

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking

VinVL: Revisiting Visual Representations in Vision-Language Models

YOLOX: Exceeding YOLO Series in 2021

It's Raw! Audio Generation with State-Space Models

Cyclical Focal Loss

Masked-attention Mask Transformer for Universal Image Segmentation

Human Pose Regression with Residual Log-likelihood Estimation

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses

How Do Vision Transformers Work?

RainGAN: Unsupervised Raindrop Removal via Decomposition and Composition

GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds

HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching

TabNet: Attentive Interpretable Tabular Learning

Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Neural Outlier Rejection for Self-Supervised Keypoint Learning

Self-Supervised 3D Mesh Reconstruction From Single Images

Unsupervised Learning of Action Classes With Continuous Temporal Embedding

UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction

3D Reconstruction of Novel Object Shapes from Single Images

HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences

BoxInst: High-Performance Instance Segmentation with Box Annotations

End-to-End Video Instance Segmentation with Transformers

Local Deep Implicit Functions for 3D Shape

Is Space-Time Attention All You Need for Video Understanding?

PatchmatchNet: Learned Multi-View Patchmatch Stereo

DataMix: Efficient Privacy-Preserving Edge-Cloud Inference

RepVGG: Making VGG-style ConvNets Great Again

A Morphable Model For The Synthesis Of 3D Faces

Do 2D GANs Know 3D Shape? Unsupervised 3D shape reconstruction from 2D Image GANs

NeRF++: Analyzing and Improving Neural Radiance Fields

Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction

pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis

Geo-PIFu: Geometry and Pixel Aligned Implicit Functions for Single-view Human Reconstruction

Test-Time Training with Self-Supervision for Generalization under Distribution Shifts

CoReNet: Coherent 3D scene reconstruction from a single RGB image

Synthesize then Compare: Detecting Failures and Anomalies for Semantic Segmentation

Are Labels Necessary for Neural Architecture Search?