User profiles for Jianwei Yang

Jianwei Yang

Principal Researcher, Microsoft Research, Redmond
Verified email at microsoft.com
Cited by 12937

Hierarchical question-image co-attention for visual question answering

J Lu, J Yang, D Batra, D Parikh - Advances in neural …, 2016 - proceedings.neurips.cc
A number of recent works have proposed attention models for Visual Question Answering (VQA)
that generate spatial maps highlighting image regions relevant to answering the …

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive
results on image classification in both zero-shot and transfer learning settings. However, …

Joint unsupervised learning of deep representations and image clusters

J Yang, D Parikh, D Batra - … of the IEEE conference on computer …, 2016 - cv-foundation.org
In this paper, we propose a recurrent framework for joint unsupervised learning of deep
representations and image clusters. In our framework, successive operations in a clustering …

Neural baby talk

J Lu, J Yang, D Batra, D Parikh - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com
We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach reconciles …

Grounded language-image pre-training

LH Li, P Zhang, H Zhang, J Yang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
This paper presents a grounded language-image pre-training (GLIP) model for learning
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …

Segment everything everywhere all at once

X Zou, J Yang, H Zhang, F Li, L Li… - Advances in …, 2024 - proceedings.neurips.cc
In this work, we present SEEM, a promotable and interactive model for segmenting everything
everywhere all at once in an image. In SEEM, we propose a novel and versatile decoding …

Florence: A new foundation model for computer vision

…, Y Shi, L Wang, J Wang, B Xiao, Z Xiao, J Yang… - arXiv preprint arXiv …, 2021 - arxiv.org
Automated visual understanding of our diverse and open world demands computer vision
models to generalize well with minimal customization for specific tasks, similar to human vision…

Gligen: Open-set grounded text-to-image generation

Y Li, H Liu, Q Wu, F Mu, J Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale text-to-image diffusion models have made amazing advances. However, the
status quo is to use text input alone, which can impede controllability. In this work, we propose …

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

…, T Ren, F Li, H Zhang, J Yang, C Li, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we present an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …