PASTA: Collaborative AI for Personalized Image Generation

DeepGeek
المؤلف DeepGeek
تاريخ النشر
آخر تحديث
PASTA: Collaborative AI for Personalized Image Generation

Struggling to translate your exact vision into an AI-generated image? You input a prompt, generate, and the result is close but misses the mark. Refining your prompt with more detail still fails to bridge the gap between your idea and the final output. This common frustration stems from text-to-image (T2I) models' difficulty capturing nuanced, individual creative intent from a single prompt. We introduce a transformative solution: turning image generation into a collaborative conversation.

Our research, “Preference Adaptive and Sequential Text-to-image Agent” (PASTA), presents a reinforcement learning (RL) agent that actively collaborates with users to progressively refine T2I results. This eliminates the need for time-consuming, trial-and-error prompt engineering. Through extensive human evaluations, we developed a novel dataset of sequential preferences, enabling us to benchmark PASTA against leading models. Results confirm PASTA, trained with our hybrid real and simulated data, consistently delivers more satisfying images. We proudly release our foundational dataset, featuring over 7,000 human rater interactions with PASTA, to accelerate community innovation.

How PASTA Revolutionizes Image Generation

Training an AI agent to precisely adapt to individual user preferences demands vast, diverse interaction data. However, collecting this data from real users presents significant challenges, particularly concerning privacy. Our research overcomes this by employing a sophisticated two-stage training strategy that integrates authentic human feedback with large-scale user simulation.

Initially, we curated a high-quality foundational dataset comprising over 7,000 sequential interactions from human raters. These interactions involved prompt expansions generated by a Gemini Flash large multimodal model, paired with corresponding images produced by a Stable Diffusion XL (SDXL) T2I model. This core set of authentic preference data then fueled the training of a highly accurate user simulator, meticulously designed to replicate real human choices and preferences, generating scalable synthetic data.

At the core of our methodology lies an advanced user model. This model features two critical components: 1) a utility model that accurately predicts user satisfaction levels for any given set of images, and 2) a choice model that determines which image set a user will select when presented with multiple options. We constructed this user model using pre-trained CLIP encoders, augmented with user-specific parameters. Training employed an expectation-maximization algorithm, enabling simultaneous learning of precise user preferences and the identification of latent “user types”—clusters of users exhibiting similar tastes, such as a predilection for images featuring animals, scenic landscapes, or abstract art.

The resulting trained user simulator effectively provides feedback, expresses preferences on generated images, and makes informed selections from presented image sets. This capability allows us to generate over 30,000 simulated interaction trajectories. Our approach transcends simple data augmentation; it establishes a controlled environment for exploring a broad spectrum of user behaviors, empowering us to train the PASTA agent for optimal user collaboration.

PASTA-1

Our user simulator effectively identifies distinct user types from preference data. Each row showcases top-rated images for an emergent user profile, revealing clear preferences for categories such as "Animals" or "Food."

Leveraging this robust, data-driven foundation, we trained the PASTA agent to proficiently engage with diverse users, generating images that precisely match their individual preferences. The agent operates as a value-based reinforcement learning model, adeptly selecting the optimal set of prompt expansions—elaborations of the current prompt used to generate subsequent images—to present to the user at each interaction turn. Its primary objective: maximizing the user's cumulative satisfaction throughout the entire collaborative process.

Upon PASTA's deployment, a user initiates the process with an initial prompt. PASTA then employs a candidate generator, a powerful large multimodal model, to produce a varied set of potential prompt expansions. Subsequently, a candidate selector—our trained RL agent—identifies and selects the optimal slate of four prompt expansions. These are used to generate corresponding images for user review. The user selects the image that most closely aligns with their vision, providing crucial feedback that guides PASTA's subsequent suggestions. This iterative, collaborative dialogue enables the model to rapidly learn user preferences, effectively steering the creative process toward their ideal outcome with each exchange.

play silent looping videopause silent looping video
unmute videomute video

Starting with a simple prompt for "A white cat", PASTA engages the user in a visually grounded dialogue. The user's selections (highlighted in blue) help the agent quickly learn their preference for a more fantastical and colorful style.

Empirical Validation of PASTA

To rigorously evaluate our approach, we trained PASTA as a value-based reinforcement learning agent employing implicit Q-learning (IQL). Our critical objective was to assess the performance impact of distinct training data configurations. We developed three agent variants: 1) trained exclusively on real volunteer-rater data, 2) trained solely on simulated data, and 3) trained on a synergistic combination of both real and simulated datasets.

Subsequently, we conducted comprehensive human evaluations, directly comparing these agents against a baseline model (comprising base Gemini Flash and SDXL models without additional fine-tuning). Performance was measured across four key metrics: accuracy on the Pick-a-Pic dataset, Spearman's rank correlation, choice model accuracy, and cross-turn accuracy. Pick-a-Pic accuracy and Spearman's rank correlation assess the model's proficiency in predicting user preferences and rankings on established, large-scale, single-turn datasets. Choice model accuracy and cross-turn accuracy specifically measure the model's ability to predict a user's image selection at a given turn and confirm whether chosen images represent an improvement over the preceding turn, respectively.

The evaluation results revealed that training PASTA solely on synthetic data failed to surpass the baseline. While the agent trained on authentic human data demonstrated significant improvement, it too did not outperform the baseline. Crucially, the agent trained on the integrated combination of both real and simulated data exhibited superior performance. This outcome validates that our user simulation effectively captures essential human interaction dynamics while providing the necessary scale for robust reinforcement learning training.

PASTA-3

The graphs above illustrate the accuracy performance of a trained user model (y axis) against the number of user types considered (x axis). The top row displays the model's accuracy on the Pick-a-Pic test set (left) and its Spearman's rank correlation on the HPS test set (right). The bottom row shows the model’s choice accuracy (left) and cross-turn preference accuracy (right), both evaluated on our human-rated test set.

Direct comparisons of final images generated by our best-performing PASTA agent against the baseline revealed a striking preference: 85% of human raters favored PASTA's outputs. This performance advantage is particularly pronounced with abstract prompts. Initiating with a simple concept like "an image of love," PASTA successfully adapted to diverse user types, producing a wide array of results—from intimate portraits to complex, geometric abstract art.

PASTA-4

Using the identical starting prompt, "An image of happiness", PASTA generates dramatically different results for two distinct user types (User Type A and User Type B), powerfully demonstrating its capacity to adapt to an individual's unique creative style. For instance, the result for Type A corresponds to a prompt such as “Abstract happy faces, Art Deco inspired geometric shapes, muted jewel-toned background.”

Future Directions and Open-Source Contributions

PASTA unequivocally demonstrates that the future of generative AI lies in interactivity, preference adaptation, and enhanced collaboration. The innovative methodologies we've developed, especially the sophisticated use of robust user simulators, possess broad applicability across numerous generative tasks, paving the way for AI systems that more effectively align with and adapt to human users.

To foster continued research and development within the AI community, we have open-sourced our sequential rater dataset and simulated user data. We eagerly anticipate the groundbreaking innovations the community will achieve with these valuable resources.

Acknowledgements

The contributing authors are: Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, and Craig Boutilier. We extend special thanks to Mark Simborg for his invaluable assistance in crafting this blog post and to Kimberly Schwede for producing the compelling figures featured herein.

أضف تفاعلك على هذا المقال

Commentaires

عدد التعليقات : 0