REGEN: Personalized Recommendations with Natural Language

DeepGeek
المؤلف DeepGeek
تاريخ النشر
آخر تحديث
REGEN: Personalized Recommendations with Natural Language

Large language models (LLMs) are changing how recommender systems interact with users. Current systems predict what users might like based on past actions. However, the true goal is creating systems that converse with users, understand their needs through feedback, and explain recommendations. Until now, no datasets supported these advanced capabilities.

We introduce REGEN, or Reviews Enhanced with GEnerative Narratives. This new benchmark dataset includes item recommendations, natural language critiques, and personalized narratives. We built REGEN by adding conversational elements to the popular Amazon Product Reviews dataset using Gemini 1.5 Flash. REGEN lets us test new recommender systems that use user feedback and generate natural language explanations. Our research shows LLMs trained on REGEN can create both recommendations and fitting narratives, matching current top-performing systems.

Building the REGEN Dataset

Existing datasets for training conversational recommenders often miss key aspects of real conversations. They might focus only on item order, short dialogues, or lack clear user feedback. We chose the Amazon Product Reviews dataset for its extensive vocabulary, which LLMs might not know well.

REGEN adds two main parts to the Amazon Reviews dataset:

Critiques

Critiques are vital for conversational recommendations. They help users state preferences and guide the system. In REGEN, critiques help steer the recommender from one item to a preferred one. For instance, a user might critique a "red pen" by asking for a "black one."

To ensure critiques are relevant, we generate them only for similar item pairs. We use the item categories from Amazon Reviews to check similarity. Gemini 1.5 Flash creates several critique options, and we randomly pick one for the dataset.

Narratives

Narratives offer detailed information about recommended items, improving the user experience. REGEN includes various narratives, such as:

  • Purchase reasons: Explaining why an item suits a user.
  • Product endorsements: Describing item benefits and features.
  • User summaries: Brief profiles of user preferences and past purchases.

These narratives differ in detail and length, creating a rich dataset for training conversational recommenders.

Experiments

To test REGEN well, we wanted to see if models could not only recommend items but also explain their choices, adapt to feedback, and generate user-specific language. We created a new task: conversational recommendation that generates both items and explanations together. The goal is simple but effective: given a user's history and a critique (like "I need more storage"), the model must recommend an item and create a matching narrative.

This task mirrors how users naturally interact with recommendation systems when they can express needs in their own words. It also avoids separate systems for recommending and generating text. Instead, we treat both as part of one complete goal.

We developed two baseline systems. The first is a hybrid system. A sequential recommender (FLARE) predicts the next item using past data and content information. This prediction then goes to a small LLM (Gemma 2B) that creates the narrative. This setup is common in real systems, where different parts handle different tasks.

The second system is LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives). LUMEN handles all tasks within a single LLM. It learns to process critiques, generate recommendations, and create narratives cohesively. During generation, the model decides when to output an item ID and when to write text. We changed the word list and text input layers to support both item IDs and text, allowing the model to treat item recommendation as part of the generation process.

This two-part approach—hybrid versus fully generative—lets us compare modularity with integration. It provides a strong way to measure how well models handle this complete conversational task.

Results

Our tests show REGEN effectively challenges and distinguishes models in both recommendation and generation. In the Office section of the Amazon Product Reviews dataset, adding user critiques consistently improved recommendation results for both systems. For instance, the FLARE hybrid model's top performance (0.124) on a metric measuring if a desired item appears in the top 10 predictions (Recall@10) rose to 0.1402 with critiques. This shows how language feedback greatly helps.

LUMEN's performance was good, though slightly lower on standard recommendation metrics. This is expected, as generating the item and narrative at once is harder. However, LUMEN excels at keeping the item and its explanation consistent. Unlike systems where different parts may create unrelated explanations, LUMEN's narratives usually fit better with the user's history and critique.

For text generation, we used BLEU, ROUGE, and semantic similarity. The hybrid model generally scored higher on BLEU and ROUGE, especially for product details and reasons for purchase. This is likely because the LLM received the correct item as input. LUMEN had slightly less text overlap but maintained strong meaning connections, particularly for user summaries based on long-term behavior, not just the specific item. See the paper for full details.

FLARE-3a-Benchmarks

These results show interesting trends. Narratives focused on the user, like preference summaries, are easier for both models to create well. But when the narrative closely relates to the item, like a product endorsement, recommendation accuracy is key. If the model picks the wrong item, the entire explanation can be flawed. This is more noticeable with LUMEN, where generating both item and narrative together is a tougher test of how well everything works together.

We also tested performance with a much larger item selection in the Clothing category, which has over 370,000 unique items. This is 5 to 60 times more than other product types. No other system we know of tests on such a large Clothing dataset, a key feature of FLARE and REGEN. Even in this complex setting, the hybrid system performed well. We saw clear Recall@10 gains from 0.1264 to 0.1355 when critiques were used, proving REGEN's value in rewarding smart, user-guided decisions.

FLARE-Recommendation

Conclusion

REGEN provides a dataset with consistent user preferences, recommendations, and generated narratives. This allows us to study LLM abilities in conversational recommendation. We tested REGEN with LUMEN, an LLM for combined recommendation and narrative creation, showing its usefulness alongside sequential recommender models. We believe REGEN is a key resource for studying conversational recommender models, a vital step toward personalized multi-turn systems.

REGEN improves conversational recommendation by making language a core part of how recommenders understand and respond to user needs. This encourages research into multi-turn interactions, where systems can have longer talks to improve recommendations based on changing user feedback.

The dataset also promotes better models and training methods. It supports expanding model power, using advanced training methods, and adapting the approach to different areas beyond Amazon reviews, such as travel, education, and music.

Ultimately, REGEN sets a new path for recommender systems, focusing on understanding and interaction. This leads to more natural, helpful, and human-like recommendation experiences.

أضف تفاعلك على هذا المقال

Commentaires

عدد التعليقات : 0