User profiles for Jianwei Yang
Jianwei YangPrincipal Researcher, Microsoft Research, Redmond Verified email at microsoft.com Cited by 12937 |
Hierarchical question-image co-attention for visual question answering
A number of recent works have proposed attention models for Visual Question Answering (VQA)
that generate spatial maps highlighting image regions relevant to answering the …
that generate spatial maps highlighting image regions relevant to answering the …
Vinvl: Revisiting visual representations in vision-language models
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …
object detection model for vision language (VL) tasks. Compared to the most widely used …
Regionclip: Region-based language-image pretraining
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive
results on image classification in both zero-shot and transfer learning settings. However, …
results on image classification in both zero-shot and transfer learning settings. However, …
Joint unsupervised learning of deep representations and image clusters
In this paper, we propose a recurrent framework for joint unsupervised learning of deep
representations and image clusters. In our framework, successive operations in a clustering …
representations and image clusters. In our framework, successive operations in a clustering …
Neural baby talk
We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach reconciles …
explicitly grounded in entities that object detectors find in the image. Our approach reconciles …
Grounded language-image pre-training
This paper presents a grounded language-image pre-training (GLIP) model for learning
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …
Segment everything everywhere all at once
In this work, we present SEEM, a promotable and interactive model for segmenting everything
everywhere all at once in an image. In SEEM, we propose a novel and versatile decoding …
everywhere all at once in an image. In SEEM, we propose a novel and versatile decoding …
Florence: A new foundation model for computer vision
Automated visual understanding of our diverse and open world demands computer vision
models to generalize well with minimal customization for specific tasks, similar to human vision…
models to generalize well with minimal customization for specific tasks, similar to human vision…
Gligen: Open-set grounded text-to-image generation
Large-scale text-to-image diffusion models have made amazing advances. However, the
status quo is to use text input alone, which can impede controllability. In this work, we propose …
status quo is to use text input alone, which can impede controllability. In this work, we propose …
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
In this paper, we present an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …