Shyamgopal Karthik

I am a final year PhD candidate at the University of Tübingen. In the past few years, I've broadly worked on problems at the intersection of Vision and Language. In the past year, I've predominantly worked on enhancing text-to-image models with human feedback.

Before this, I completed my Bachelor's and Master's degree from the International Institute of Information Technology, Hyderabad in 2021 where I worked with Prof. Vineet Gandhi on a variety of computer vision problems.

Recently, I also had the pleasure of doing an internship at the Creative Vision team at Snap Research in Santa Monica where I worked on preference optimization of text-to-image models with Jian Ren and Anil Kag. Previously, I did an internship at Naver Labs Europe, working on self-supervised learning with Boris Chidlovskii and Jérome Revaud.

In a previous life, I used to be a terrible chess player.

News

27 September 2024. ReNO was accepted to NeurIPS 2024!
27 September 2024. EgoCVR was accepted to ECCV 2024!
15 April 2024. Started my internship with Snap at Santa Monica.
27 September 2024. CIReVL was accepted to ICLR 2024!

Selected Publications

Scalable Ranked Preference Optimization for Text-to-Image Generation
Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata, Sergey Tulyakov, Jian Ren, and Anil Kag
Arxiv 2024
paper webpage bibtex

While ReNO did an amazing job at improving the quality of text-to-image models, this came with an increased runtime. As a result, we were looking at DPO based techniques to improve the quality of text-to-image models. Turns out, the biggest bottleneck with applying DPO on these models is that public datasets for these tasks aren't of great quality. To address this issue, we generated and labelled a new preference dataset using newer text-to-image models and off-the-shelf reward models. This also allowed us to collect preference rankings and develop a nice ranking based objective to improve upon the standard DPO objective.

ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization
Luca Eyring*, Shyamgopal Karthik*, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata
NeurIPS 2024, Vancouver, Canada.
paper code demo bibtex

We knew that best-of-n sampling with a reward model was already an extremely strong baseline. However, could we go one-step further and optimize the initial noise to improve this further? This problem stumped us for a long while since backprop through the whole diffusion process was expensive and had exploding gradients. We finally found the solution with one-step models! However, would the one-step models be good enough to work with? Turns out that optimziing the noise of one-step text-to-image models could give us results that were competitive with proprietary closed source models that were 10x larger! This also culminated a frutiful 1.5 year journey for me of trying my best to find interesting research directions without updating a single parameter of any model.

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection
Shyamgopal Karthik*, Karsten Roth*, Massimiliano Mancini, and Zeynep Akata
ICCV Workshop on Multimodal Foundation Models 2023, Paris, France.
paper code bibtex

This paper started my journey into text-to-image generation. The main challenge we had was that Stable Diffusion models were doing a decent job at generating high-quality images, but there were tons of issues in closely following the prompt. While there were several methods proposed especially focusing on the attention maps during inference, we realized than best-of-n sampling with a human-preference reward model went a long way in improving the results. While this was quite trivial in some ways, it set the stage for us to continue exploring the effectiveness of reward models and the effect of the seed in image generation.

EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
Thomas Hummel*, Shyamgopal Karthik*, Mariana-Iuliana Georgescu, and Zeynep Akata
ECCV 2024, Milan, Italy.
paper code bibtex

Building on our previous work, we were keen on exploring Composed Video Retrieval. The biggest issue was that the existing benchmark (WebVid-CoVR) was focused excessively on images and did not really require the whole video to solve the task. To address this issue, we spent a lot of time manually curating a nice evaluation set from Ego4D which eventually turned out into a very nice benchmark. CIReVL adapted for videos also turned out to be a very nice training-free method that was competitive with mehtods training on millions of videos!

Vision-by-Language for Training-Free Compositional Image Retrieval
Shyamgopal Karthik*, Karsten Roth*, Massimiliano Mancini, and Zeynep Akata
ICLR 2024, Vienna, Austria.
paper code bibtex

We started off looking at Composed Image Retrieval task where we have a query image and textual instruction that modified the query. Popular methods for this task were trained similar to textual inversion methods and predicted a "pseudo-token" for the query image. Our immediate instinct was that using an off-the-shelf captioning model must provide a stronger and more interpretable signal than these trained pseudo-token methods. Therefore, our "vision-by-language" method was just to caption an image, reformulate the caption based on the textual instruction and retrieve images based on the reformulated caption. Not only was this method more interpretable and training-free, it also allowed us to double the state-of-the-art performance on some popular benchmarks.

KG-SP: Knowledge Guided Simple Primitives for Open World Compositional Zero-Shot Learning
Shyamgopal Karthik, Massimiliano Mancini, and Zeynep Akata
CVPR 2022, New Orleans, USA.
paper code bibtex

In this work, we looked at the problem of Compositional Zero-Shot Learning, where the goal is to predict (attribute, object) labels for an image, and generalize to unseen (attribute, object) pairs. Recent methods had tried to model attributes and objects jointly using a variety of ideas. Here, we show that predicting attributes and objects independently can work quite well for this task. Additionally, we show how a knowledge-base can be incorporated to improve the performance of the model at inference. Finally, we introduce a new partially labeled setting where we show how we can train our model in the absence of compositional labels.

Learning from Long-Tailed Data with Noisy Labels
Shyamgopal Karthik, Jerome Revaud, and Boris Chidlovskii
ICCV 2021 Workshop on Self-supervised Learning for Next-Generation Industry-level Autonomous Driving, Virtual.
paper bibtex

This paper started off as a fun journey towards developing methods that are robust to both label noise and long-tailed class distributions. Methods tailored for one of these challenges collapsed when the other challenge was introduced. In the end, it turned out that vanilla self-supervised training went a long way in learning representations that were robust to both label noise and long-tailed distributions.

No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks
Shyamgopal Karthik, Ameya Prabhu, Puneet Dokania, and Vineet Gandhi
ICLR 2021, Virtual.
paper code bibtex

The main motivation behind this work was to see if we could reduce the severity of mistakes in a classification setting. To do this, we make use of label hierarchies which are readily available through taxonomies like WordNet. For our method, we show that a simple algorithm from Duda and Hart's Pattern Recognition textbook way back in 1973 can be effectively used in a post-hoc manner while retaining the calibration of the base model.

Simple Unsupervised Multi-Object Tracking
Shyamgopal Karthik, Ameya Prabhu, and Vineet Gandhi
Arxiv 2020.
paper bibtex

We revisited Re-Idenification models that were widely used in Multi-Object Tracking algorithms. In various trackers, this is often the only component that requires video level supervision. Our insight was that we could train a ReID model using pseudo-labels generated from a Kalman filter based tracker in a self-supervised fashion. The resulting ReID model can be used as a drop-in replacement to the supervised ReID models used in trackers. Alternatively, using these ReID features as a post-processing step in trackers that don't use a ReID model can reduce the number of ID Switches by 67%. In hindsight, I hope this was useful in motivating some works later on which strengthened the traditional tracking-by-detection paradigm.

Website style cloned from this wonderful website.
Last update: November 2024