Google Scholar

User profiles for Jianwei Yang

Jianwei Yang

Principal Researcher, Microsoft Research, Redmond

Verified email at microsoft.com

Cited by 12937

[PDF] neurips.cc

Hierarchical question-image co-attention for visual question answering

J Lu, J Yang, D Batra, D Parikh - Advances in neural …, 2016 - proceedings.neurips.cc

A number of recent works have proposed attention models for Visual Question Answering (VQA)
that generate spatial maps highlighting image regions relevant to answering the …

Save Cite Cited by 1920 Related articles All 7 versions View as HTML

[PDF] thecvf.com

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com

This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …

Save Cite Cited by 863 Related articles All 10 versions View as HTML

[PDF] thecvf.com

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive
results on image classification in both zero-shot and transfer learning settings. However, …

Save Cite Cited by 344 Related articles All 8 versions View as HTML

[PDF] cv-foundation.org

Joint unsupervised learning of deep representations and image clusters

J Yang, D Parikh, D Batra - … of the IEEE conference on computer …, 2016 - cv-foundation.org

In this paper, we propose a recurrent framework for joint unsupervised learning of deep
representations and image clusters. In our framework, successive operations in a clustering …

Save Cite Cited by 949 Related articles All 12 versions View as HTML

[PDF] thecvf.com

Neural baby talk

J Lu, J Yang, D Batra, D Parikh - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com

We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach reconciles …

Save Cite Cited by 540 Related articles All 11 versions View as HTML

[PDF] thecvf.com

Grounded language-image pre-training

LH Li, P Zhang, H Zhang, J Yang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper presents a grounded language-image pre-training (GLIP) model for learning
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …

Save Cite Cited by 670 Related articles All 10 versions View as HTML

[PDF] neurips.cc

Segment everything everywhere all at once

X Zou, J Yang, H Zhang, F Li, L Li… - Advances in …, 2024 - proceedings.neurips.cc

In this work, we present SEEM, a promotable and interactive model for segmenting everything
everywhere all at once in an image. In SEEM, we propose a novel and versatile decoding …

Save Cite Cited by 239 Related articles All 4 versions View as HTML

[PDF] arxiv.org

Florence: A new foundation model for computer vision

…, Y Shi, L Wang, J Wang, B Xiao, Z Xiao, J Yang… - arXiv preprint arXiv …, 2021 - arxiv.org

Automated visual understanding of our diverse and open world demands computer vision
models to generalize well with minimal customization for specific tasks, similar to human vision…

Save Cite Cited by 674 Related articles All 3 versions View as HTML

[PDF] thecvf.com

Gligen: Open-set grounded text-to-image generation

Y Li, H Liu, Q Wu, F Mu, J Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large-scale text-to-image diffusion models have made amazing advances. However, the
status quo is to use text input alone, which can impede controllability. In this work, we propose …

Save Cite Cited by 263 Related articles All 5 versions View as HTML

[PDF] arxiv.org

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

…, T Ren, F Li, H Zhang, J Yang, C Li, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we present an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …

Save Cite Cited by 577 Related articles All 3 versions View as HTML

Create alert

Cite

Advanced search

Saved to My library

User profiles for Jianwei Yang

Jianwei Yang

Hierarchical question-image co-attention for visual question answering

Vinvl: Revisiting visual representations in vision-language models

Regionclip: Region-based language-image pretraining

Joint unsupervised learning of deep representations and image clusters

Neural baby talk

Grounded language-image pre-training

Segment everything everywhere all at once

Florence: A new foundation model for computer vision

Gligen: Open-set grounded text-to-image generation

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Related searches