Shyamgopal Karthik

I just finished my PhD at the University of Tübingen. In the past few years, I've broadly worked on problems at the intersection of Vision and Language. In the past year, I've especially enjoyed worked on post-training diffusion models with human feedback. I'm excited to develop better controllability and alignment of image and video models in the coming years.

Before this, I completed my Bachelor's and Master's degree from the International Institute of Information Technology, Hyderabad, where I worked with Prof. Vineet Gandhi on a variety of computer vision problems.

I also did an internship at the Creative Vision team at Snap Research in Santa Monica where I worked with Anil Kag and Jian Ren on improving Direct Preference Optimization for text-to-image models. Previously, I did an internship at Naver Labs Europe, working with Boris Chidlovskii and Jerome Revaud developing self-supervised learning methods to learn from long-tailed data.

In a previous life, I used to be a terrible chess player.

News

Selected Publications

Generative Models

RankDPO publication image

Scalable Ranked Preference Optimization for Text-to-Image Generation

Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata, Sergey Tulyakov, Jian Ren, and Anil Kag
ICCV 2025, Hawaii, USA.
paper webpage bibtex

While ReNO did an amazing job at improving the quality of text-to-image models, this came with an increased runtime. As a result, we were looking at DPO based techniques to improve the quality of text-to-image models. Turns out, the biggest bottleneck with applying DPO on these models is that public datasets for these tasks aren't of great quality. To address this issue, we generated and labelled a new preference dataset using newer text-to-image models and off-the-shelf reward models. This also allowed us to collect preference rankings and develop a nice ranking based objective to improve upon the standard DPO objective.

ReNO publication image

ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

Luca Eyring*, Shyamgopal Karthik*, Karsten Roth*, Alexey Dosovitskiy, Zeynep Akata
NeurIPS 2024, Vancouver, Canada.
paper code demo bibtex

We knew that best-of-n sampling with a reward model was already an extremely strong baseline. However, could we go one-step further and optimize the initial noise to improve this further? This problem stumped us for a long while since backprop through the whole diffusion process was expensive and had exploding gradients. We finally found the solution with one-step models! However, would the one-step models be good enough to work with? Turns out that optimziing the noise of one-step text-to-image models could give us results that were competitive with proprietary closed source models that were 10x larger! This also culminated a frutiful 1.5 year journey for me of trying my best to find interesting research directions without updating a single parameter of any model.

ImageSelect publication image

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

Shyamgopal Karthik*, Karsten Roth*, Massimiliano Mancini, Zeynep Akata
ICCV Workshop on Multimodal Foundation Models 2023, Paris, France.
paper code bibtex

This paper started my journey into text-to-image generation. The main challenge we had was that Stable Diffusion models were doing a decent job at generating high-quality images, but there were tons of issues in closely following the prompt. While there were several methods proposed especially focusing on the attention maps during inference, we realized than best-of-n sampling with a human-preference reward model went a long way in improving the results. While this was quite trivial in some ways, it set the stage for us to continue exploring the effectiveness of reward models and the effect of the seed in image generation.

Compositionality and VLMs

Good CREPE publication image

A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

Vishaal Udandarao*, Mehdi Cherti, Shyamgopal Karthik, Jenia Jitsev, Samuel Albanie, Matthias Bethge
EVAL-FOMO Workshop at CVPR 2025, Nashville, USA.
paper

This was a fun exploration into a bunch of benchmarks designed to evaluate the compositional understanding of VLMs (e.g. SugarCREPE). Turns out, many of them have severe issues, allowing blind baselines and heuristics to outperform VLMs in several cases, and fixing these benchmarks isn't too straightforward either. Hopefully, we're able to provide useful insights to guide benchmark construction going ahead!

EgoCVR publication image

EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval

Thomas Hummel*, Shyamgopal Karthik*, Mariana-Iuliana Georgescu, Zeynep Akata
ECCV 2024, Milan, Italy.
paper code bibtex

Building on our previous work, we were keen on exploring Composed Video Retrieval. The biggest issue was that the existing benchmark (WebVid-CoVR) was focused excessively on images and did not really require the whole video to solve the task. To address this issue, we spent a lot of time manually curating a nice evaluation set from Ego4D which eventually turned out into a very nice benchmark. CIReVL adapted for videos also turned out to be a very nice training-free method that was competitive with mehtods training on millions of videos!

Vision-by-Language publication image

Vision-by-Language for Training-Free Compositional Image Retrieval

Shyamgopal Karthik*, Karsten Roth*, Massimiliano Mancini, Zeynep Akata
ICLR 2024, Vienna, Austria.
paper code bibtex

We started off looking at Composed Image Retrieval task where we have a query image and textual instruction that modified the query. Popular methods for this task were trained similar to textual inversion methods and predicted a "pseudo-token" for the query image. Our immediate instinct was that using an off-the-shelf captioning model must provide a stronger and more interpretable signal than these trained pseudo-token methods. Therefore, our "vision-by-language" method was just to caption an image, reformulate the caption based on the textual instruction and retrieve images based on the reformulated caption. Not only was this method more interpretable and training-free, it also allowed us to double the state-of-the-art performance on some popular benchmarks.

Representation Learning

KG-SP publication image

KG-SP: Knowledge Guided Simple Primitives for Open World Compositional Zero-Shot Learning

Shyamgopal Karthik, Massimiliano Mancini, Zeynep Akata
CVPR 2022, New Orleans, USA.
paper code bibtex

In this work, we looked at the problem of Compositional Zero-Shot Learning, where the goal is to predict (attribute, object) labels for an image, and generalize to unseen (attribute, object) pairs. Recent methods had tried to model attributes and objects jointly using a variety of ideas. Here, we show that predicting attributes and objects independently can work quite well for this task. Additionally, we show how a knowledge-base can be incorporated to improve the performance of the model at inference. Finally, we introduce a new partially labeled setting where we show how we can train our model in the absence of compositional labels.

SSL publication image

Learning from Long-Tailed Data with Noisy Labels

Shyamgopal Karthik, Jerome Revaud, Boris Chidlovskii
ICCV 2021 Workshop on Self-supervised Learning for Next-Generation Industry-level Autonomous Driving, Virtual.
paper bibtex

This paper started off as a fun journey towards developing methods that are robust to both label noise and long-tailed class distributions. Methods tailored for one of these challenges collapsed when the other challenge was introduced. In the end, it turned out that vanilla self-supervised training went a long way in learning representations that were robust to both label noise and long-tailed distributions.

ICLR publication image

No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks

Shyamgopal Karthik, Ameya Prabhu, Puneet Dokania, Vineet Gandhi
ICLR 2021, Virtual.
paper code bibtex

The main motivation behind this work was to see if we could reduce the severity of mistakes in a classification setting. To do this, we make use of label hierarchies which are readily available through taxonomies like WordNet. For our method, we show that a simple algorithm from Duda and Hart's Pattern Recognition textbook way back in 1973 can be effectively used in a post-hoc manner while retaining the calibration of the base model.

Tracking publication image

Simple Unsupervised Multi-Object Tracking

Shyamgopal Karthik, Ameya Prabhu, Vineet Gandhi
Arxiv 2020.
paper bibtex

We revisited Re-Idenification models that were widely used in Multi-Object Tracking algorithms. In various trackers, this is often the only component that requires video level supervision. Our insight was that we could train a ReID model using pseudo-labels generated from a Kalman filter based tracker in a self-supervised fashion. The resulting ReID model can be used as a drop-in replacement to the supervised ReID models used in trackers. Alternatively, using these ReID features as a post-processing step in trackers that don't use a ReID model can reduce the number of ID Switches by 67%. In hindsight, I hope this was useful in motivating some works later on which strengthened the traditional tracking-by-detection paradigm.

Website style cloned from this wonderful website.
Last update: May 2025