\ul
Wenjie Wangwenjiewang96@gmail.comNational University of SingaporeSingapore,Xinyu Linxylin1028@gmail.comNational University of SingaporeSingapore,Fuli Fengfulifeng93@gmail.comUniversity of Science and Technology of ChinaChina,Xiangnan Hexiangnanhe@gmail.comUniversity of Science and Technology of ChinaChinaandTat-Seng Chuadcscts@nus.edu.sgNational University of SingaporeSingapore
Abstract.
Recommender systems typically retrieve items from an item corpus for personalized recommendations. However, such a retrieval-based recommender paradigm faces two limitations: 1) the human-generated items in the corpus might fail to satisfy the users’ diverse information needs, and 2) users usually adjust the recommendations via passive and inefficient feedback such as clicks. Nowadays, AI-Generated Content (AIGC) has revealed significant success across various domains, offering the potential to overcome these limitations: 1) generative AI can produce personalized items to satisfy users’ specific information needs, and 2) the newly emerged large language models with strong language understanding and generation abilities significantly reduce the efforts of users to precisely express information needs via natural language instructions. In this light, the boom of AIGC points the way towards the next-generation recommender paradigm with two new objectives: 1) generating personalized content through generative AI, and 2) integrating user instructions to guide content generation.
To this end, we propose a novel Generative Recommender paradigm named GeneRec, which adopts an AI generator to personalize content generation and leverages user instructions to acquire users’ information needs. Specifically, we pre-process users’ instructions and traditional feedback (e.g., clicks) via an instructor to output the generation guidance. Given the guidance, we instantiate the AI generator through an AI editor and an AI creator to repurpose existing items and create new items, respectively. Eventually, GeneRec can perform content retrieval, repurposing, and creation to satisfy users’ information needs. Besides, to ensure the trustworthiness of the generated items, we emphasize various fidelity checks such as authenticity and legality checks. Moreover, we provide a roadmap to envision future developments of GeneRec and we present several domain-specific applications of GeneRec with some potential research tasks. Lastly, we study the feasibility of implementing the AI editor and AI creator on micro-video generation, showing promising results.
Generative Recommender Paradigm, AI-generated Content, Next-generation Recommender Systems, Large Language Models, Generative Models
††ccs: Information systemsRecommender systems1. Introduction
Recommender systems fulfill users’ information needs by retrieving item content in a personalized manner. Traditional recommender systems primarily retrieve human-generated content such as expert-generated movies on Netflix and user-generated micro-videos on Tiktok(Gomez-Uribe and Hunt, 2015). However, AI-Generated Content (AIGC) has emerged as a prevailing trend across various domains. The advent of powerful neural networks, exemplified by diffusion models(Rombach etal., 2022), has enabled generative AI to produce superhuman content.As shown in Figure1, ChatGPT(Ouyang etal., 2022; Brown etal., 2020) demonstrates a remarkable ability to interact with users via natural language and generative AI is possible to create multimodal content such as video, text, and audio to form new items.Driven by the boom of AIGC, recommender systems must move beyond human-generated content, by envisioning a generative recommender paradigm to automatically repurpose existing items111Here, repurposing means editing existing items for a different purpose, i.e., satisfying another user’s personalized preference. or create new items.
11footnotetext: https://chat.openai.com/chat/.22footnotetext: https://stablediffusionweb.com/.To envision the generative recommender paradigm, we first retrospect the traditional retrieval-based recommender paradigm. The traditional paradigm ranks human-generated items in the item corpus, recommends the top-ranked items to users, and then collects user feedback (e.g., clicks) and context (e.g., interaction time) to optimize the future rankings for users(Davidson etal., 2010).Despite its success, such a traditional paradigm suffers from two limitations. 1) The content available in the item corpus might be insufficient to satisfy users’ personalized information needs.For instance, users may prefer a music video performed by a singer in a specific style (see Figure1), while generating such a multimodal music video by humans is impossible or costly(Wu etal., 2023b).And 2) users are currently able to refine the recommendations mostly via passive feedback (e.g., clicks), which cannot express their information needs explicitly and efficiently(Liu etal., 2010; Liang and Willemsen, 2023).
AIGC offers the potential to overcome the inherent limitations of the retrieval-based recommender paradigm.In particular, 1) generative AI can generate personalized content to supplement existing items, including repurposing existing items and creating new items(Brooks etal., 2023; Singer etal., 2022).Additionally, 2) the newly emerged Large Language Models (LLMs) show strong language understanding and generation abilities(Wei etal., 2022b), which can effectively reduce users’ efforts to convey their diverse information needs via natural language instructions (Figure1).Compared to traditional conversational agents, users will engage more readily with advanced ChatGPT-like models, supplementing traditional user feedback.In this light, the emerging AIGC has spurred new objectives for the next-generation recommender systems to enable:1) the automatic generation of personalized content through generative AI, and 2) the integration of user instructions to guide content generation.
To this end, we propose a novel Generative Recommender paradigm called GeneRec, which integrates the powerful generative AI for personalized content generation, including both repurposing and creation.Figure2 illustrates how GeneRec adds a loop between an AI generator and users.Taking user instructions and feedback as inputs, the AI generator needs to understand users’ information needs and generate personalized content. The generated content can be either added to the item corpus for ranking or directly recommended to the users.Wherein, the user instructions are not only limited to textual conversations but can also include multimodal conversations, i.e., fusing images, videos, audio, and natural languages to express the information needs.
To instantiate the GeneRec paradigm,we formulate one instructor module to process the instructions, as well as two modules for item repurposing and creation.Specifically, the instructor module pre-processes user instructions and feedback to determine whether to initiate the content generation, and also encodes the instructions and feedback to guide the content generation.Given the guidance, an AI editor repurposes an existing item to fulfill users’ specific preference, i.e., personalized item editing, and an AI creator directly creates new items for personalized item creation.To ensure the trustworthiness and high quality of generated items, we emphasize the importance of various fidelity checks from the aspects of bias, privacy, safety, authenticity, and legality(Wu etal., 2023b; Guo etal., 2023; Wang etal., 2023).Moreover, in Section2, we present a roadmap to explain the future development trends of GeneRec. Besides, we introduce several application scenarios of GeneRec across different domains in Section3.3 and detail some potential research tasks under GeneRec in Section3.4.Lastly, to explore the feasibility of applying therecent advance in AIGC to implement the AI editor and AI creator, we devise several tasks of micro-video generation and conduct experiments on a high-quality micro-video dataset. Empirical results show that existing AIGC methods can accomplish some repurposing and creation tasks, and it is promising to achieve the grand objectives of GeneRec in the future. We release our code and dataset at https://github.com/Linxyhaha/GeneRec.
To summarize, our contributions are threefold.
- •
We highlight the essential role of AIGC in recommender systems and point out the extended objectives for next-generation recommender systems: moving towards a generative recommender paradigm, which can naturally interact with users via multimodal instructions, and flexibly retrieve, repurpose, and/or create item content to meet users’ diverse information needs in various recommendation domains.
- •
We propose to instantiate the generative recommender paradigm by formulating three key modules:the instructor for interacting with users and processing user instructions to guide content generation, the AI editor for personalized item editing, and the AI creator for personalized item creation.
- •
We spotlight the essential perspectives of fidelity checks and present a roadmap with several application scenarios and potential research tasks to envision the future directions of GeneRec.
- •
We investigate the feasibility of utilizing existing AIGC methods to implement the AI editor and AI creator in the micro-video recommendation domain.
2. Generative Recommender Paradigm
We propose two new objectives for the next-generation recommender systems: 1) automatically repurposing or creating items via generative AI, and 2) integrating rich user instructions.To achieve these objectives, we present GeneRec to complement the traditional retrieval-based recommender paradigm.
Overview.Figure2 presents the overview of the proposed GeneRec paradigm with two loops.In the traditional retrieval-based user-system loop,human uploaders, including domain experts (e.g., musicians) and regular users (e.g., micro-video users), generate and upload items to the item corpus. These items are then ranked for recommendations according to the user preference, where the preference is learned from the context (e.g., interaction time) and user feedback over historical recommendations.
To complement this traditional paradigm, GeneRec adds another loop between the AI generator and users. Users can control the content generated by the AI generator through user instructions and feedback.Thereafter, the generated items can be directly exposed to the users without ranking if the users clearly express their expectations for AI-generated items or if they have rejected human-generated items via negative feedback (e.g., dislikes) many times. In addition, the AI-generated items can be ranked together with the human-generated items to output the recommendations.
User instructions.The strong conversational ability of ChatGPT-like LLMs can enrich the interaction modes between users and the AI generator. The users can flexibly control content generation via conversational instructions, where the instructions can be either textual conversations or multimodal conversations. Through instructions, users can 1) freely enable the AI generator to generate their preferred content at any time, and 2) express their information needs more quickly and efficiently thaninteraction-based feedback. In particular, users cannot only express what they like, but can also indicate what they dislike via instructions. In the past, users were unwilling to make efforts to give explicit feedback such as instructions, while LLMs will be incredibly intelligent to collect user instruction by audio or multimodal inputs, reducing the burden of users via advanced interfaces in the future.
Content generation.Before content generation, the AI generator might need to pre-process the user instructions, for instance, some pre-trained language models might require designing prompts(Brown etal., 2020) or instruction tuning(Wei etal., 2022a); diffusion models may need to simplify queries or extract instruction embeddings as inputs for image synthesis(Rombach etal., 2022). In addition to user instructions, user feedback such as clicks can also guide the content generation since user instructions might ignore some implicit user preference and the AI generator can infer such preference from users’ historical interactions. Learning implicit user preference from noisy and passive user feedback (e.g., clicks and dwell time) has long been a central focus in the field of recommendations. GeneRec can borrow the prior experience from the traditional recommendation domain for user preference modeling.
Subsequently, the AI generator learns personalized information needs from user instructions and feedback, and then generates personalized item content accordingly.The generation includes both repurposing existing items and creating new items from scratch. For example, to repurpose a micro-video, we may convert it into multiple styles or split it into clips with different themes for distinct users; besides, the AI generator may select a topic and create a new micro-video based on user instructions and collected Web data (e.g., facts and knowledge).
Post-processing is essential to ensure the quality of generated content. The AI generator can judge whether the generated content will satisfy users’ information needs and further refine it, such as adding captions and subtitles for micro-videos. The relevance between the generated content and users’ information needs is also significant. Besides, it is also vital to ensure the trustworthiness of generated content through fidelity checks.
Fidelity checks.To ensure the generated content is accurate, fair, and safe, GeneRec should pass the fidelity checks. Although different recommendation scenarios may require various fidelity checks, they generally should include but not be limited to the following perspectives.
- 1)
Bias and fairness: the AI generator might learn from biased data(Baeza-Yates, 2020), and thus should confirm that the generated content does not perpetuate stereotypes, promote hate speech and discrimination, cause unfairness to certain populations, or reinforce other harmful biases(Gao and Shah, 2020; Fu etal., 2020; Zehlike etal., 2022). For example, the news generation, especially regarding sensitive topics, should pay more attention to the checks of bias and fairness to avoid ethical and social issues.
- 2)
Privacy: the generated content cannot disseminate any sensitive or personal information that may violate someone’s privacy(Shin etal., 2018). GeneRec may generate an item based on a user’s personalized data, and then the generated item may be disseminated to other users, possibly leading to privacy leakage. Many recommendation scenarios such as news, tweets, and micro-videos are sensitive to such privacy leakage.
- 3)
Safety: the AI generator must not pose any risks of harm to users, including the risks of physical and psychological harm(Amodei etal., 2016). For instance, the generated micro-video for teenagers should not contain any unhealthy content. Besides, it is crucial to prevent GeneRec from various attacks such as shilling attack(Gunes etal., 2014; Chirita etal., 2005).
- 4)
Authenticity: to prevent misinformation spread, we need to verify that the facts, statistics, and claims made in the generated content are accurate based on reliable sources(Maras and Alexandrou, 2019). Users need to access factual information in certain recommendation scenarios, such as daily events and historical facts, making the authenticity of the generated content extremely important.
See AlsoUsing OpenAI API for Recommendation Systems: Techniques and Best PracticesBuild Your Own Clustering Based Recommendation Engine in 15 minutes !!Recommendations in books about generators - Electric motors & generators engineeringA comprehensive guide to building effective AI recommendation systems - 5)
Legal compliance: more importantly, AIGC must comply with all relevant laws and regulations(Buiten, 2019). For instance, if the generated micro-videos are about recommending healthy food, they must follow the regulations of healthcare.Besides, copyright regulation should also be considered to tackle the intricacies of authorship and ownership regarding the content edited or created by the generative AI(Lucchi, 2023). The government should work with the recommender platform to publish new regulations to clarify the ownership regarding AI-generated content.
- 6)
Identifiability: to assist with AIGC supervision, we suggest adding digital watermark(VanSchyndel etal., 1994) into AI-generated content for distinguishing human-generated and AI-generated items. For example, in the fashion domain, designers’ patents are essential. Therefore, it is necessary to distinguish between AI-generated, human-generated, and AI-assisted human-generated content. We can develop AI technologies to automatically identify AI-generated items(Mitchell etal., 2023). Furthermore, we may consider deleting the AI-generated items after browsing by users to prevent them from being reused for inappropriate context.
Evaluation.To evaluate the generated content, we propose two kinds of evaluation setups: 1) item-side evaluation and 2) user-side evaluation. Item-side evaluation emphasizes the measurements from the item itself, including the item quality measurements (e.g., using Fréchet Video Distance (FVD) metric(Voleti etal., 2022) to measure micro-video quality), the relevance between the generated content and users’ information needs, and various fidelity checks.User-side evaluation judges the quality of generated content based on users’ satisfaction. The satisfaction can be collected either by explicit feedback or implicit feedback like in the traditional retrieval-based recommender paradigm. In detail, 1) explicit feedback includes users’ ratings and conversational feedback, e.g., “I like this item” in natural language. Moreover, we can design multiple facets to help users’ evaluation, for instance, the style, length, and thumbnail for the evaluation of generated micro-videos. And 2) implicit feedback (e.g., clicks) can also be evaluated. The widely used metrics such as the click-through rate, dwell time, and user retention rate are still applicable to measure users’ satisfaction with the generated content.
Roadmap.In Figure3, we outline the future development roadmap of GeneRec by examining three separate aspects: user-system interaction, content generation, and recommender algorithms.
- 1)
User-system interaction.Users’ passive feedback (e.g., clicks) actually contains many subtle preferences, which might even be difficult for users themselves to articulate. Historically, recommender systems have heavily relied on user features and such passive feedback to model user preference. Beyond passive feedback, we believe that future recommender systems will be able to engage in multimodal conversations with users through voice, visual, and textual interactions, much like the AI assistant Jarvis222https://en.wikipedia.org/wiki/J.A.R.V.I.S.., to understand users’ needs. By combining conversation with historical passive feedback, recommender systems can build more comprehensive user profiles, and subsequently, they can retrieve or generate content as recommendations to fulfill users’ information needs.
Existing LLMs have already achieved proficient textual conversation and burgeoning capabilities in multimodal interactions(Driess etal., 2023), effectively meeting the requirements for barrier-free communication with users. In the next years, a direction worth exploring is how to enable LLMs to quickly acquire and comprehend the item content in the recommendation domain. The new item content is continually emerging and the item popularity is quickly shifting over time. By efficiently understanding the item content, LLMs can better communicate with users and understand their information needs. Additionally, another direction to investigate in the coming years is how to combine active user conversation with passive feedback for more comprehensive user modeling.
- 2)
Content generation. Content generation has three evolution phases. 1) Expert-generated professional content, such as carefully crafted movies and music. This type of content typically has high quality yet also comes with high production costs, making it challenging to generate large quantities quickly. 2) User-generated content, driven by the rise of micro-videos, has turned many users on the platform into content creators, greatly enriching content diversity and reach. However, user-generated content often varies widely in quality, with many items being of lower quality. 3) In the era of AIGC, AI can assist users in content creation and may even have the potential to independently generate new content. We can leverage AI to assist content creators and users themselves in content generation tasks, such as generating music for micro-videos or refining news articles. This helps improve the quality of user-generated content. During the phase of AI-assisted generation, users may still need to assume a significant portion of responsibility, including providing instructions and content design. If generative AI continues to advance, it is promising that generative AI can learn user preference directly from a vast amount of high-quality content and continuously create new content. In this way, AI-generated content will complement user-generated or expert-generated content to meet users’ information needs.
- 3)
Algorithm. Discriminative algorithms have traditionally been the mainstream approach for item rankings in recommendations, including but not limited to methods such as matrix factorization(Rendle etal., 2009; Wu etal., 2021), graph neural networks-based methods(He etal., 2020; Chang etal., 2021), and transformer-based sequential algorithms(Kang and McAuley, 2018; Xie etal., 2021; Zhou etal., 2020). Generative models such as variational autoencoders have some explorations(Wang etal., 2022b; Liang etal., 2018) while they receive relatively limited scrutiny. However, the recent emergence of LLMs has sparked a trend in using generative models for recommendation. Researchers are exploring the integration of various recommendation tasks, such as sequential recommendation, click-through rate prediction, recommendation explanations, and conversational recommendations, within a generative framework based on LLMs(Geng etal., 2022).Furthermore, they are also delving into how to better leverage word knowledge within LLMs for recommendation purposes(Lin etal., 2023) and how to enable LLMs to effectively capture collaborative filtering signals in user-item interactions(Zhang etal., 2023a). We believe that in the future, there will be a coexistence and mutual promotion of discriminative and generative models to accomplish various recommendation tasks. It is even possible that a unified LLM-based recommender model may emerge for various recommendation tasks, covering item retrieval, repurposing, and creation.
3. Demonstration
To instantiate the proposed GeneRec, we develop three modules: an instructor, an AI editor, and an AI creator. As described in Figure4, the instructor is responsible for initializing the content generation and pre-processing user instructions, while the AI editor and AI creator implement the AI generator for personalized item editing and creation, respectively.Lastly, we present several application scenarios and potential research tasks.
3.1. Instructor
The instructor aims to pre-process user instructions and feedback to initialize the AI generator and guide the content generation process.
Input: Users’ multimodal conversational instructions and the feedback over historically recommended items.
Processing:Given the inputs, the instructor may still need to engage in multi-turn interactions with the users to fully understand users’ information needs.Thereafter, the instructor analyzes the multimodal instructions and user feedback to determine whetherthere is a need to initiate the AI generator to meet users’ information needs.If the users have explicitly requested AIGC via instructions or rejected human-generated items many times, the instructor may enable the AI generator for content generation.The instructor then pre-processes users’ instructions and feedback as guidance signals, according to the input requirements of the AI generator. For instance, some pre-trained language models may need appropriately designed prompts and diffusion models might require the extraction of guidance embeddings from users’ instructions and historically liked item features.
Output: 1) The decision on whether to initiate the AI generator, and 2) the guidance signals for content generation.
3.2. AI Editor and AI Creator
To implement the content generation, we formulate two modules: an AI editor and an AI creator.
3.2.1. AI editor for personalized item editing
As depicted in Figure4(b), the AI editor intends to refine and repurpose existing items (generated by either humans or AI) in the item corpus according to the instructions and historical feedback from either human uploaders or users. Human uploaders such as video uploaders on YouTube can interact with the AI editor to generate different versions of items, which can be fed into the general item corpus for rankings. Users can also generate personalized items for themselves with the assistance of the AI editor.
Input: 1) The guidance signals extracted from user instructions and feedback by the instructor, 2) an existing item in the corpus, and 3) the facts and knowledge from the Web data.
Processing:Given the input data,the AI editor leverages powerful neural networks to learn the users’ information needs and preference, and then repurpose the input item accordingly. The “facts and knowledge” here can provide some factual events, generation skills, common knowledge, laws, and regulations to help generate accurate, safe, and legal items. Note that the “facts and knowledge” can be updated with new events and regulations continually.For instance, based on the daily events, the AI editor can help to revise and generate personalized news reports for users.
Output: An edited item that better fulfills users’ information preference than the original one.
3.2.2. AI creator for personalized item creation
Besides the AI editor, we also develop an AI creator to generate new items based on personalized instructions and feedback.
Input: 1) The guidance signals extracted from user instructions and feedback by the instructor, and 2) the facts and knowledge from the Web data.
Processing: Given the guidance signals, facts, and knowledge, the AI creator learns the users’ information needs, and creates new items to fulfill users’ needs.As illustrated in Figure1, the AI creator may know users’ preference for Jay Chou’s songs from the user’s historical feedback, determine the singer to Zhen Tian based on user instructions, and learn the singing skills of Zhen Tian on the Web to make a music video of “Worldly Tavern” performed by Zhen Tian.
Output: A new item that fulfills users’ information needs.
3.3. Domain Applications of GeneRec
High-quality AIGC via GeneRec is applicable in multiple domains. The related domains include but are not limited to the recommendations of textual content (e.g., news, articles, books, and financial reports), advertisements, videos (e.g., micro-videos, animation, and movies), images (e.g., art portraits and landscape), and audio (e.g., music).The users and human producers can leverage the AI generators to produce extensive items such as movies, music, advertisements, and micro-videos.Furthermore, it is also viable to harness AI generators for the design of various products, including fashion apparel and accessories, with the potential for a seamless transition into the manufacturing process.
In detail, we present several examples in different domains.
- •
News recommendation.Human news editors or users on the news platform can define rules by instructions for GeneRec to generate news. AI generators can also infer potential user preference by analyzing the feedback from editors’ or users’ historical interactions. Consequently, GeneRec can produce personalized news articles daily, taking into account newly occurring events.After news generation, fidelity checks are needed to detect misleading headlines, misinformation, satire, biased and deceptive news(Zhou and Zafarani, 2020).
- •
Fashion recommendation.GeneRec can assist fashion designers in creating various versions of fashion products or directly generate personalized items for users. The passive feedback of fashion designers and users, such as clicks, plays a crucial role by providing AI generators with valuable insights into their implicit clothing preferences.AI-generated digital products with high value (e.g., popular products liked by lots of people) can be sent to factories for customized production.Currently, some fashion brands are exploring this direction, such as Tribute Brand333https://www.tribute-brand.com/. and The Dematerialised444https://thedematerialised.com/..In the fashion domain, fidelity checks should primarily focus on the realism of product details, ensuring image clarity, color accuracy, and high detail level while considering style consistency and compatibility.
- •
Music recommendation.AI generators have demonstrated their ability to produce high-quality music(Yu etal., 2023). Under the GeneRec framework, to enhance the personalization of music generation, we should empower AI generators to learn users’ implicit preferences for artists, lyrics, melodies, and singing styles from users’ feedback.In terms of fidelity checks, music generation should address copyright and legality concerns, ensuring compliance with all relevant regulations.
- •
Micro-video recommendation.Micro-video generation within the GeneRec framework is significantly challenging, as it involves the generation of multiple modalities, including textual subtitles, cover images, videos, background music, or ambient sounds. Nevertheless, AI-assisted user generation is a promising starting point. Extensive tools for AI-assisted content creation are emerging, encompassing features such as automatic music generation, cover image refinement, and subtitle generation. The field of AI-generated micro-videos is likely to transition progressively from AI-human collaboration toward AI-driven creation.Regarding fidelity checks, micro-video generation necessitates careful consideration of many aspects, including but not limited to bias, privacy, authenticity, safety, and realism.
3.4. Potential Research Tasks
Under the novel paradigm of GeneRec, there are various promising research tasks that deserve future exploration. We introduce some potential tasks as follows.
- 1)
Instruction tuning for LLM-based instructor.One crucial factor in implementing the instructor of GeneRec is to enhance the instruction-following capabilities, allowing the instructor to comprehend user intention and take the right actions. These actions may involve generating responses, activating AI generators for content creation, or providing guidance for content generation. Existing work still cannot effectively achieve such an instructor. On one hand, conversational recommender models previously lack strong instruction-following capabilities, struggling to understand user intention. On the other hand, LLMs exhibit limited instruction-following abilities for item recommendations(Zhang etal., 2023b; Bao etal., 2023). This limitation arises from a lack of instruction tuning within the recommendation domain.Consequently, improving LLM-based instructors for GeneRec is an essential research task in the future.
- 2)
Controlling for AI generator activation.Under GeneRec, an essential task is to control whether to activate AI generators. Besides, after generating a new item, it needs to decide between directly recommending the generated item to users and ranking it with existing items.These two decisions depend on the recommendation context and user behaviors, including users’ instructions and feedback. For instance, if a user explicitly expresses a need for a generated item through instructions or consistently provides negative feedback on existing items, the AI generator should be initiated for content generation.In the future, collecting diverse user behaviors and required actions to train the LLM-based instructor is vital for enhancing its decision-making capabilities.
- 3)
Personalized item editing.One of the core tasks within GeneRec is the implementation of AI editors to assist human uploaders or users in personalized content editing. As mentioned in Section3.3, item editing should be explored separately in specific domains.While Generative AI has shown the ability to generate high-quality content in some domains, it is usually guided by user instructions. However, user instructions cannot describe various nuanced aspects of an item while crafting complex instructions is challenging and time-consuming.Furthermore, many human uploaders and users may have implicit preferences that they themselves may not be aware of. As such, extracting user interests from noisy and implicit user feedback is helpful, which aligns with the central focus of recommendation models in recent years.Hence, integrating users’ implicit feedback into AI editors and effectively capturing users’ implicit preference for personalized item editing holds great promise.
- 4)
Personalized item creation.Going beyond personalized item editing, personalized item creation represents a more challenging task. This endeavor can begin with relatively easier domains, such as news and music recommendations, where generative AI technologies are relatively mature.
- 5)
Domain-specific fidelity checks.As discussed in Section3.3, implementing GeneRec in various domains requires careful consideration of domain-specific fidelity checks. It is of paramount importance to design fidelity evaluators tailored to specific domains. For instance, in news generation, special attention should be given to aspects like news authenticity and bias.
4. Discussion
In this section, we present the comparison between the GeneRec paradigm and two related tasks: conversational recommendation and traditional AI generation. Moreover, we present a possible vision for future developments of GeneRec.
4.1. Comparison
We illustrate how GeneRec differs from conversational recommendation and traditional AI generation tasks.
4.1.1. Comparison with conversational recommendation
Conversational recommender systems rely on multi-turn natural language conversations with users to acquire user preference and provide recommendations(Sun and Zhang, 2018; Zhang etal., 2018; Qu etal., 2018). Although conversational recommender systems also consider multi-turn conversations, we highlight two critical differences with the GeneRec paradigm:1) Previous conversational recommender models lack instruction-following abilities, leading to poor user experience. The dramatic development of LLMs, especially ChatGPT, has brought a revolution to traditional conversational systems by significantly improving language understanding and generative abilities. Based on LLMs, we can build a more powerful instructor for GeneRec.And 2) GeneRec automatically repurposes or creates items through the AI generator to meet users’ specific information needs while conversational recommender systems are still retrieving human-generated items.
4.1.2. Comparison with traditional content generation
There exist various cross-modal AIGC tasks such as text, image, and audio generation conditional on users’ images(Yang etal., 2022a), single-turn queries(Rombach etal., 2022), and multi-turn conversations(Jiang etal., 2021).Nevertheless, there are essential differences between traditional AIGC tasks and GeneRec. 1) GeneRec can utilize the user modeling techniques from traditional recommender systems to capture implicit user preference.For example, GeneRec may leverage user feedback such as clicks and dwell time to dig out the implicit user preference that is not indicated through user instructions. Users may not be aware of their preference for a particular type of micro-videos, whereas their clicks and long dwell time on these micro-videos can indicate this preference.Learning implicit user preference from user features and behaviors, including long-term, short-term, noisy, and implicit user feedback, has been the core research in the recommendation domain in the past years.Many prior techniques in the recommendation domain can be used to exploit implicit user preference for GeneRec. In this case, GeneRec can consider both users’ explicit instructions and implicit preference to complement each other.And 2) despite the success of AIGC, retrieving human-generated content remains indispensable in many cases to satisfy users’ information needs.For example, if an emergency event occurs, journalists can send the latest video reports from the scene; besides, many human content producers have unique experiences or creativity that are difficult to replicate by generative AI. Therefore, compared to previous generative AI, GeneRec considers the cooperation between AI-generated and human-generated content to meet user information needs.
4.2. A vision for future GeneRec.
In the future, generative AI might be incorporated into various information platforms to supplement human-generated content.It might start with using AI generators to assist human content producers in repurposing existing items or creating new items. With the technical advancement of AI generators, they may gradually begin to independently generate content in some simple scenarios.Moreover, from a technical view, constrained by the development of existing AIGC technologies, we need to design different modules to achieve the generation tasks across multiple modalities. However, we believe that building a unified AI model for multimodal content generation is a feasible and worthwhile endeavor(Wu etal., 2023a; Li etal., 2023a).Under the GeneRec paradigm, we expect to see the increasing maturity of various generation tasks, along with growing integration between these tasks.Ultimately, with the inputs of user instructions and feedback, GeneRec will be able to perform retrieval, repurposing, and creation tasks by a unified model for personalized recommendations, leading to a totally new information seeking paradigm.
5. Feasibility Study
To investigate the feasibility of instantiating GeneRec, we employ AIGC methods to implement some simple demos of the AI editor and AI creator in micro-video application scenario due to the widespread of micro-video content.The instructor could be a ChatGPT-like interactive conversational user interface(Gao etal., 2023), an option-based interface for the user to choose “like” or “dislike”(Zhang and Sundar, 2019), or a recommender model that collects the user interactions.The obtained user instructions or user feedback will then be preprocessed to guide the AI editor or AI creator to repurpose or create the micro-video content. In our experiments, we mainly focus on implementing the AI editor and AI creator so we simply use a recommender model MMGCN(Wei etal., 2019) as the instructor and obtain the user’s historical interactions or the user embeddings from the well-trained MMGCN as the guidance.
Dataset.We utilize a high-quality micro-video dataset with raw videos.It contains interactions between users and micro-videos of diverse genres (e.g., news and celebrities).The micro-video length is longer than eight seconds and each micro-video has a thumbnail with an approximate resolution of .We follow the micro-video pre-processing in(Voleti etal., 2022) and each pre-processed micro-video has 400 frames with resolution.
5.1. AI Editor
We design three tasks for personalized micro-video editing and separately tailor different methods for the tasks.
5.1.1. Thumbnail selection and generation
Considering that personalized thumbnails might better attract users to click on micro-videos(Liu etal., 2015), we devise the tasks of personalized thumbnail selection and generation to present a more attractive and personalized micro-video thumbnail for users.
Task. We aim to generate personalized thumbnails based on user feedback without requiring user instructions. Formally, given a micro-video in the corpus and a user’s historical feedback, the AI editor should select a frame from the micro-video as a thumbnail or generate a thumbnail to match the user’s preference.
Implementation. To implement personalized thumbnail selection, we utilize the image encoder of a representative Contrastive Language Image Pre-training model (CLIP)(Radford etal., 2021) to conduct zero-shot selection.As illustrated in Figure5(a), given a set of frames from a micro-video, and the set of thumbnails from a user’s historically liked micro-videos, we calculate
(1) |
where is the average representation of , and we select the -th frame as the recommended thumbnail due to the highest dot product score between the user representation and the -th frame representation . For performance comparison, we randomly select a frame from the micro-video (“Random Frame”) and utilize the original thumbnail (“Original”) as the two baselines.
To achieve personalized thumbnail generation, we adopt a newly pre-trained Retrieval-augmented Diffusion Model (RDM)(Blattmann etal., 2022), in which an external item corpus can be plugged as conditions to guide image synthesis.To generate personalized thumbnails for a micro-video, we combine this micro-video and the user’s historically liked micro-videos as the input conditions of RDM (see Figure5(b)).
Evaluation.To evaluate the selected and generated thumbnails, we propose two metrics:1) Cosine@ that takes the average cosine similarity between the selected/generated thumbnails of recommended items and the user’s historically liked thumbnails; and 2) PS@ that calculates the Prediction Score (PS) from a well trained MMGCN(Wei etal., 2019) by using the features of selected/generated thumbnails. In detail, we train an MMGCN by using the thumbnail features and user representations, and then averaging the prediction scores between the selected/generated thumbnails and the target user representation.Higher scores of Cosine@ and PS@ imply better results. For each user, we randomly choose or non-interacted items as recommendations and report the average results of ten experimental runs to ensure reliability.
Thumbnail Selection and Generation Cosine@5 Cosine@10 PS@5 PS@10 Random Frame 0.4796 0.4786 22.6735 23.1950 Original 0.4978 0.4970 22.2606 22.7445 CLIP \ul0.5142 \ul0.5134 \ul22.7682 \ul23.2854 RDM 0.5369 0.5347 23.0145 23.3712
Results. The results of thumbnail selection and generation w.r.t. Cosine@ and PS@ are reported in Table1, from which we have the following findings.1) “Original” usually yields better Cosine@ scores than “Random Frame”since the thumbnails manually selected by the uploaders are more appealing to users than random frames.2) CLIP outperforms “Random Frame” and “Original” by considering user feedback, validating the efficacy of personalized thumbnail selection.3) RDM achieves the best results, justifying the superiority of using diffusion models to generate personalized thumbnails.The superior results are reasonable since RDM can generate thumbnails beyond existing frames, leading to a better alignment with user preference.
Case study.For intuitive understanding, we visualize several cases from CLIP and RDM.From the cases of CLIP in Figure6, we observe that the second frame containing rural landscape is selected for User 5 due to the user’s historical preference for magnificent natural scenery.In contrast, the frame containing a fancy sports car is chosen for User 100 because this user likes to browse stylish vehicles.This reveals the effectiveness of CLIP in selecting personalized thumbnails according to different user preference.The case of RDM for thumbnail generation is presented in Figure7. From the generated result, we can find that RDM tries to insert some elements to the generated thumbnail to better align with user preference while maintaining key information of the original micro-video.For instance, RDM decorates the man with a white shirt and a red tie for User 2943 based on this user’s historical preference.Such observation reveals the potential of using generative AI to repurpose existing items for meeting personalized user preference.Nevertheless, we can see that the generated thumbnail lacks fidelity to some extent, probably due to the domain gap between this micro-video dataset and the pre-training data of RDM.
5.1.2. Micro-video clipping
Given a long micro-video (e.g., one longer than 1 minute), the task of personalized micro-video clipping aims to recommend only the users’ preferred clip in order to save users’ time and improve users’ browsing experience(Luo etal., 2022).
Task. Given an existing micro-video and a user’s historical feedback, the AI editor needs to select a shorter clip comprising of a set of consecutive frames from the original micro-video as the personalized recommendation.
Implementation.Similar to the thumbnail selection, we leverage of CLIP(Radford etal., 2021) to obtain personalized micro-video clips as shown in Figure5(c).Given user representation obtained from historically liked thumbnails via Eq. (1), and a set of clips where each clip has consecutive frames555The frame number is a hyper-parameter for micro-video clipping. We tune it in and choose 8 due to its better scores w.r.t. Cosine@5. , we compute
(2) |
where denotes the clip representation calculated by averaging the frame representations, and we select -th clip as the recommended one due to its highest similarity with the user representation .For performance comparison, we select a random clip (“Random”), the first clip with frames (“1st Clip”), and the original unclipped micro-video (“Unclipped”) as baselines.
Micro-video Clipping Cosine@5 Cosine@10 PS@5 PS@10 Random 0.4864 0.4851 22.1483 23.1401 1st Clip 0.4910 0.4899 22.1509 23.1657 Unclipped 0.4969 0.4976 22.1685 23.1700 CLIP 0.5052 0.5038 22.1863 23.1758
Results.Similar to thumbnail selection and generation, we use Cosine@ and PS@ for evaluation by replacing thumbnails with frames to calculate cosine and prediction scores.From Table2, we find that CLIP surpasses other approaches because it utilizes user feedback to select personalized clips that match users’ specific preference. Besides, the superior performance of “Unclipped” over “Random” and “1st Clip” makes sense because the random clip and the first clip may lose some users’ preferred content.
Case study.In Figure8, CLIP chooses two clips from a raw micro-video for two users with different preference. For User 83, the clip with the tiger is selected because of this user’s interest in wild animals; in contrast, the clip with a man facing the camera is chosen for User 36 due to this user’s preference for portraits.
5.1.3. Micro-video content editing
Users might wish to repurpose and refine the micro-video content according to personalized user preference.As such, we implement the AI editor to edit the micro-video content for satisfying users’ information needs.
Task. Given an existing micro-video in the corpus, user instructions, and user feedback666We do not consider the “facts and knowledge” in Section3.2.1 to simplify the implementation, leaving the knowledge-enhanced implementation to future work., the AI editor is asked to repurpose and edit the micro-video content to meet user preference.
Implementation.We consider two subtasks for micro-video content editing: 1) micro-video style transfer based on user instructions, where we simulate user instructions to select some pre-defined styles and utilize an interactive tool VToonify777https://github.com/williamyang1991/vtoonify/.(Yang etal., 2022b) to achieve the style transfer;and 2) micro-video content revision based on user feedback. We resort to a newly published Masked Conditional Video Diffusion model (MCVD)(Voleti etal., 2022) for micro-video revision.The revision process is presented in Figure9(a). We first fine-tune MCVD on this micro-video dataset by reconstructing the users’ liked micro-videos conditional on user feedback.During inference, we forwardly corrupt the input micro-video by gradually adding noises, and then perform step-by-step denoising to generate an edited micro-video guided by user feedback.The user feedback for fine-tuning and inference can be obtained from: 1) user embeddings from a pre-trained recommender model such as MMGCN (denoted as “User_Emb”), and 2) averaged features of the user’s historically liked micro-videos (“User_Hist”).To evaluate the quality of generated micro-videos, we follow(Voleti etal., 2022) and adopt the widely used FVD metric(Unterthiner etal., 2019), which measures the distribution gap between the real micro-videos and generated micro-videos.Specifically, FVD builds on the principles underlying Frechet Inception Distance (FID(Heusel etal., 2017)) and additionally considers the temporal coherence for videos. The FVD metric is formally written as:
(3) |
where and are the means and and are the co-variance matrices of the distribution of feature representation of real-world videos and the generated videos. The feature representations are obtained by the pre-trained Inflated 3D Convnet (I3D(Carreira and Zisserman, 2017)), which considers the temporal coherence of the visual content across a sequence of frames.A lower FVD score indicates higher quality.
Micro-video Content Revision Cosine@5 Cosine@10 PS@5 PS@10 FVD Original 0.5010 0.5083 25.8900 24.6800 - User_Hist 0.5166 0.5127 25.9012 24.7107 783.7505 User_Emb 0.5273 0.5200 26.0200 24.7900 646.7156
Results.We show some cases of micro-video style transfer in Figure10. The same micro-video is transferred into different styles according to users’ instructions. From Figure10, we can observe that the repurposed videos show high fidelity and quality, validating that the task of micro-video style transfer can be well accomplished by existing generative models.However, it might be costly for users to give detailed instructions for more complex tasks, thus considering user feedback as guidance, especially implicit feedback like in micro-video content revision, is worth studying.
Quantitative results of micro-video content revision are summarized in Table3, from which we have the following observation. 1) The edited items (“User_Hist” and “User_Emb”) cater more to user preference, validating the feasibility of generative models for personalized micro-video revision.2) “User_Emb” outperforms “User_Hist” possibly because the pre-trained user embeddings contain compressed critical information.“User_Hist” directly fuses the raw features from the user’s historically liked micro-videos, which inevitably include some noises.And 3) the FVD score of “User_Emb” is significantly smaller than the unconditional revision, indicating the high quality of the edited micro-videos.
In addition, we analyze the cases of two users in Figure11, where the original micro-video depicts a male standing in front of a backdrop. Given the users’ instruction “revise this micro-video based on my historical preference”, we initiate MCVD to repurpose the micro-video to meet personalized user preference.Specifically, since User 2650 prefers male portraits in a black suit and a white shirt, MCVD converts the dressing style of the man to match this user’s preference. In contrast, for User 2450 who favors black shirts, MCVD alters both the suit and shirt to black accordingly.Despite the edited micro-videos having some deficiencies (e.g., corrupted background for User 2650), we can find that integrating user instructions and feedback for higher-quality micro-video content revision is a promising direction. Besides, we highlight that the content revision and generation should add necessary watermarks and seek permission from all relevant stakeholders.
5.2. AI Creator
In this subsection, we explore instantiating the AI creator for micro-video content creation.
5.2.1. Micro-video content creation
Beyond repurposing thumbnails, clips, and content of existing micro-videos, we formulate an AI creator to create new micro-videos from scratch.
Task. Given the user instructions and the user feedback over historical recommendations, the AI creator aims to create new micro-videos to meet personalized user preference.
Implementation.Before implementing content creation, we investigate the performance of image synthesis based on user instructions, where we construct users’ single-turn instructions and apply stable diffusion(Rombach etal., 2022) for image synthesis.From the generated images in Figure12, we can find that stable diffusion is capable of generating high-quality images according to users’ single-turn instructions.Here, we explore the possibility of micro-video content creation via the video diffusion model MCVD.As presented in Figure9(b), MCVD first samples a random noise from the standard normal distribution, and then it generates a micro-video based on personalized user instructions and user feedback through the denoising process.We write some textual single-turn instructions by humans such as “a man with a beard is talking”, and encode them via CLIP(Radford etal., 2021). The encoded instruction representation is then combined with user feedback for the conditional generation of MCVD.To represent user feedback, we still employ the pre-trained user embeddings (“User_Emb”) and the average features of the user’s historically liked micro-videos (“User_Hist”). Similar to micro-video content revision, we fine-tune MCVD on the micro-video dataset conditional on users’ encoded instructions and feedback, and then apply it for content creation.
Results:From Table4, we can find that generated micro-videos can achieve higher Cosine@888We cannot calculate PS@K via MMGCN because the newly created items do not have item ID embeddings, which are necessary for MMGCN prediction. values as the generation is guided by personalized user feedback and instructions.In spite of the high values of Cosine@K, the quality of generated micro-videos from “User_Emb” and “User_Hist” is worse than the unconditional creation as shown by the larger FVD scores.The case study in Figure13(a) also validates the unsatisfactory generation quality. Specifically, MCVD generates a micro-video containing a woman with long brown hair for User 11 where the women’s face is distorted, evident in the blurred rendering of the mouth and the generated hair extending to the chin region (see Figure13(b)).And the generated portrait for User 38 is also blurred and distorted as shown in Figure13(a) and (b). We can find thatthe collar and coat are conjoined, and the left arm remains absent in the generated content.
Micro-video Content Creation Cosine@5 Cosine@10 FVD Original 0.4883 0.4907 - User_Hist 0.4902 0.4915 735.0413 User_Emb 0.5356 0.5376 743.1090
Admittedly, the current results of personalized micro-video creation are less satisfactory. This inferior performance might be attributed to that:1) the simple single-turn instructions fail to describe all the details in the micro-video; and 2) the current micro-video dataset lacks sufficient micro-videos, facts, and knowledge, thus limiting the creative ability of the AI generator.In this light, it is promising to enhance the AI generator in future work from three aspects:1) enabling the powerful ChatGPT-like tools999In this work, we do not explore the usage of ChatGPT because it was recently released and we have insufficient time for comprehensive exploration. to implement the instructor and acquire detailed user instructions;2) pursuing more comprehensive user modeling and better encoding users’ implicit preference to AI generators; and3) using more advanced algorithms of generative AI with strong prior world knowledge through pre-training on more data in different modalities, such as image (e.g., Midjourney), video (e.g., PikaLab), and audio (e.g., AudioCraft).We believe that the development of video generation might also follow the trajectory of image generation (see Figure 12) and eventually achieve satisfactory generation results.
Moreover, we emphasize the importance of addressing copyright issues regarding AI-generated content. For GeneRec in the micro-video domain, we offer potential solutions from two perspectives:1) authorship belongs to the platform when the micro-video is fully generated by the AI creator;2) the uploaders hold the authorship of the generated micro-video if the generated content is edited from the uploader-created content.Nevertheless, the policy for tackling copyright violations needs formal regulations, especially from the government’s view.
Running example of GeneRec on micro-video recommendation.To run the whole GeneRec paradigm for micro-video recommendation, it involves the steps as follows:1) A user watches the recommended micro-videos, or directly searches or expresses what they prefer to watch (i.e., information needs) at the moment;2) Instructor with conversational interaction interface acquires the user’s information needs and pre-processes instructions.3) Based on the pre-processed user’s instructions and historical feedback, the instructor will decide whether to use the AI editor or the AI creator for generating personalized micro-videos. For example, if the user requests the instructor to provide creative AI-generated content, the AI creator will then be called to generate a new micro-video as the response.4) Post-processing of the micro-video such as the checks of quality, relevance, and fidelity will then be conducted.5) Lastly, the instructor will decide to directly recommend the AI-generated micro-video to the user, or rank the AI-generated micro-video with all existing micro-videos in the item corpus.6) Provide the final recommendation and go back to step 1) or 2) based on the user’s information needs.
6. Related Work
Recommender Systems.The traditional retrieval-based recommender paradigm constructs a loop between the recommender models and users(Ren etal., 2022; Zou etal., 2022). Recommender models rank the items in a corpus and recommend the top-ranked items to the users(Zehlike etal., 2022; Hu etal., 2022). The collected user feedback and context are then used to optimize the next round of recommendations.Following such paradigm, recommender systems have been widely investigated. Technically speaking, the most representative method is Collaborative Filtering (CF), which assumes that users with similar behaviors share similar preference(He etal., 2017; Ebesu etal., 2018; Konstas etal., 2009).Early approaches directly utilize collaborative behaviors of similar users (i.e., user-based CF) or items (i.e., item-based CF). Later on, MF(Rendle etal., 2009) decomposes the interaction matrix into user and item matrices separately, laying the foundation for subsequent neural CF methods(He etal., 2017) and graph-based CF methods(He etal., 2020).Beyond purely using user-item interactions, prior work considers incorporating context(Rendle etal., 2011) and the content features of users and items(Wei etal., 2019; Guy etal., 2010) for session-based, sequential, and multimedia recommendations(Hidasi etal., 2016; Kang and McAuley, 2018; Deng etal., 2021).In recent years, various new recommender frameworks have been proposed, such as conversational recommender systems(Sun and Zhang, 2018; Tu etal., 2022) acquiring user preference via conversations and user-controllable recommender systems(Wang etal., 2022a) for controlling the attributes of recommended items. Recently, some researchers have considered using large language models for recommendations(Bao etal., 2023; Hou etal., 2023; Dai etal., 2023; Li etal., 2023b; Liu etal., 2023) while they mainly enhance recommender algorithms instead of producing content through generative AI or revolutionizing the user-system interaction interfaces.
Past work only recommends human-generated items, which might fail to satisfy users’ diverse information needs.In our work, we propose to empower traditional recommender paradigms with the ability of content generation to meet users’ information needs and present a novel generative paradigm for next-generation recommender systems.Satisfying personalized information needs has been studied in the Human-Computer Interaction (HCI) domain(Arazy etal., 2015). While studies from HCI mainly focus on user modeling based on device- and context-specific information(Völkel etal., 2019), we emphasize the harness of ChatGPT-like models for advanced user-system interactions, supplementing traditional user feedback with better user engagement.
Generative AI.The development of content generation roughly falls into three stages.At the early stage, most platforms heavily rely on high-quality professionally-generated content, which is however challenging to meet the demand of large-scale content production due to the high costs of professional experts.Later on, User-Generated Content (UGC) becomes prevalent due to the emergence of smartphones and well-wrapped generation tools. Despite its increasing growth, the quality of UGC is usually not guaranteed.Beyond human-generated content, recent years have witnessed third-stage content generation with groundbreaking generative AI techniques, leading to various AIGC-driven applications.
As the core of AIGC, generative AI has been extensively investigated across a diverse range of applications as depicted in Figure14.In the text synthesis domain, substantial methods are proposed to serve different tasks such as article writing and dialog systems.For example, the newly published ChatGPT demonstrates an impressive ability for conversational interactions.Image and video synthesis are also two prevailing tasks of AIGC. Recent advancement of diffusion models shows promising results in generating high-quality images in various aesthetic styles(Dhariwal and Nichol, 2021; Nichol and Dhariwal, 2021; Song etal., 2021) as well as high-coherence videos(Ho etal., 2022; Voleti etal., 2022; Hong etal., 2022).Besides, previous work has explored audio synthesis(Ren etal., 2020). For instance, (Donahue etal., 2021) proposes a generative model that aligns text with waves, endowing high-fidelity text-to-speech generation.Furthermore, extensive generative models are crafted for other fields, such as 3D generation(Guo etal., 2020), gaming(dePontes and Gomes, 2020), and chemistry(Polykovskiy etal., 2020).
The revolution of generative AI has catalyzed the production of high-quality content and brought a promising future for generative recommender systems. The advent of AIGC can complement existing human-generated content to better satisfy users’ information needs.Besides, the powerful generative language models can help acquire users’ information needs via multimodal conversations.
7. Conclusion and Future Work
In this work, we empowered recommender systems with the abilities of content generation and instruction guidance. In particular, we proposed a GeneRec paradigm, which could: 1) acquire users’ information needs via user instructions and feedback, and 2) achieve both item retrieval, repurposing, and creation to meet users’ information needs. To instantiate GeneRec, we formulated three modules: an instructor for pre-processing user instructions and feedback, an AI editor for repurposing existing items, and an AI creator for creating new items. Besides, we highlighted the importance of multiple fidelity checks to ensure the trustworthiness of the generated content, and also pointed out the roadmap, application scenarios, and future research tasks of GeneRec. We explored the feasibility of implementing GeneRec on micro-video generation and the experiments reveal some weaknesses and promising results of existing AIGC methods on various tasks.
This work formulates a new generative paradigm for next-generation recommender systems, leaving many valuable research directions for future work. In particular,1) it is critical to learn users’ information needs from users’ multimodal instructions and feedback. In detail, GeneRec should learn to ask questions for efficient information acquisition, reduce the modality gap to understand users’ multimodal instructions, and integrate user feedback to complement instructions for better generation guidance.2) Developing more powerful generative modules for various tasks (e.g., thumbnail generation and micro-video creation) is essential. Besides, we might implement some generation tasks through a unified model, where multiple tasks may promote each other.And 3) we should devise new metrics, standards, and technologies to enrich the evaluation and fidelity checks of AIGC. It is a promising direction to introduce human-machine collaboration for GeneRec evaluation and various fidelity checks.
References
- (1)
- Amodei etal. (2016)Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016.Concrete Problems in AI Safety.arXiv:1606.06565.
- Arazy etal. (2015)Ofer Arazy, Oded Nov, and Nanda Kumar. 2015.Personalityzation: UI personalization, theoretical grounding in HCI and design research.THCI 7, 2 (2015), 43–69.
- Baeza-Yates (2020)Ricardo Baeza-Yates. 2020.Bias in Search and Recommender Systems. In RecSys. ACM, 2–2.
- Bao etal. (2023)Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023.Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In RecSys. ACM.
- Blattmann etal. (2022)Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. 2022.Semi-Parametric Neural Image Synthesis. In NeurIPS. Curran Associates, Inc.
- Brooks etal. (2023)Tim Brooks, Aleksander Holynski, and AlexeiA Efros. 2023.Instructpix2pix: Learning to follow image editing instructions. In CVPR. 18392–18402.
- Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal. 2020.Language Models Are Few-shot Learners. In NeurIPS. Curran Associates, Inc., 1877–1901.
- Buiten (2019)MiriamC Buiten. 2019.Towards Intelligent Regulation of Artificial Intelligence.Eur J Risk Regul 10, 1 (2019), 41–59.
- Carreira and Zisserman (2017)Joao Carreira and Andrew Zisserman. 2017.Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR. 6299–6308.
- Chang etal. (2021)Jianxin Chang, Chen Gao, Yu Zheng, Yiqun Hui, Yanan Niu, Yang Song, Depeng Jin, and Yong Li. 2021.Sequential recommendation with graph neural networks. In SIGIR. 378–387.
- Chirita etal. (2005)Paul-Alexandru Chirita, Wolfgang Nejdl, and Cristian Zamfir. 2005.Preventing Shilling Attacks in Online Recommender Systems. In WIDM. ACM, 67–74.
- Dai etal. (2023)Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023.Uncovering ChatGPT’s Capabilities in Recommender Systems.arXiv:2305.02182.
- Davidson etal. (2010)James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor VanVleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, etal. 2010.The YouTube video recommendation system. In RecSys. 293–296.
- dePontes and Gomes (2020)RafaelGuerra de Pontes and HermanMartins Gomes. 2020.Evolutionary Procedural Content Generation for An Endless Platform Game. In SBGames. IEEE, 80–89.
- Deng etal. (2021)Yang Deng, Yaliang Li, Fei Sun, Bolin Ding, and Wai Lam. 2021.Unified Conversational Recommendation Policy Learning via Graph-based Reinforcement Learning. In SIGIR. ACM, 1431–1441.
- Dhariwal and Nichol (2021)Prafulla Dhariwal and Alexander Nichol. 2021.Diffusion Models Beat Gans on Image Synthesis. In NeurIPS. Curran Associates, Inc., 8780–8794.
- Donahue etal. (2021)Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski, Erich Elsen, and Karen Simonyan. 2021.End-to-end Adversarial Text-to-speech. In ICLR.
- Driess etal. (2023)Danny Driess, Fei Xia, MehdiSM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, etal. 2023.Palm-e: An embodied multimodal language model.arXiv:2303.03378.
- Ebesu etal. (2018)Travis Ebesu, Bin Shen, and Yi Fang. 2018.Collaborative Memory Network for Recommendation Systems. In SIGIR. ACM, 515–524.
- Fu etal. (2020)Zuohui Fu, Yikun Xian, Ruoyuan Gao, Jieyu Zhao, Qiaoying Huang, Yingqiang Ge, Shuyuan Xu, Shijie Geng, Chirag Shah, Yongfeng Zhang, etal. 2020.Fairness-aware Explainable Recommendation Over Knowledge Graphs. In SIGIr. ACM, 69–78.
- Gao and Shah (2020)Ruoyuan Gao and Chirag Shah. 2020.Counteracting Bias and Increasing Fairness in Search and Recommender Systems. In RecSys. ACM, 745–747.
- Gao etal. (2023)Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. 2023.Chat-rec: Towards interactive and explainable llms-augmented recommender system.arXiv:2303.14524.
- Geng etal. (2022)Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022.Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In RecSys. ACM, 299–315.
- Gomez-Uribe and Hunt (2015)CarlosA Gomez-Uribe and Neil Hunt. 2015.The netflix recommender system: Algorithms, business value, and innovation.TMIS 6, 4 (2015), 1–19.
- Gunes etal. (2014)Ihsan Gunes, Cihan Kaleli, Alper Bilge, and Huseyin Polat. 2014.Shilling Attacks Against Recommender Systems: A Comprehensive Survey.Artificial Intelligence Review 42, 4 (2014).
- Guo etal. (2020)Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020.Action2motion: Conditioned Generation of 3d Human Motions. In MM. ACM, 2021–2029.
- Guo etal. (2023)Danhuai Guo, Huixuan Chen, Ruoling Wu, and Yangang Wang. 2023.AIGC challenges and opportunities related to public safety: a case study of ChatGPT.Journal of Safety Science and Resilience 4, 4 (2023), 329–339.
- Guy etal. (2010)Ido Guy, Naama Zwerdling, Inbal Ronen, David Carmel, and Erel Uziel. 2010.Social Media Recommendation Based on People and Tags. In SIGIR. ACM, 194–201.
- He etal. (2020)Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020.Lightgcn: Simplifying and Powering Graph Convolution Network for Recommendation. In SIGIR. ACM, 639–648.
- He etal. (2017)Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017.Neural Collaborative Filtering. In WWW. ACM, 173–182.
- Heusel etal. (2017)Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017.Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS 30 (2017).
- Hidasi etal. (2016)Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016.Session-based Recommendations with Recurrent Neural Networks. In ICLR.
- Ho etal. (2022)Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and DavidJ Fleet. 2022.Video Diffusion Models.arXiv:2204.03458.
- Hong etal. (2022)Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2022.Cogvideo: Large-scale Pretraining for Text-to-video Generation via Transformers.arXiv:2205.15868.
- Hou etal. (2023)Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and WayneXin Zhao. 2023.Large language models are zero-shot rankers for recommender systems.arXiv:2305.08845.
- Hu etal. (2022)Chenhao Hu, Shuhua Huang, Yansen Zhang, and Yubao Liu. 2022.Learning to Infer User Implicit Preference in Conversational Recommendation. In SIGIR. ACM, 256–266.
- Jiang etal. (2021)Yuming Jiang, Ziqi Huang, Xingang Pan, ChenChange Loy, and Ziwei Liu. 2021.Talk-to-edit: Fine-grained Facial Editing via Dialog. In CVPR. IEEE, 13799–13808.
- Kang and McAuley (2018)Wang-Cheng Kang and Julian McAuley. 2018.Self-attentive sequential recommendation. In ICDM. IEEE, 197–206.
- Konstas etal. (2009)Ioannis Konstas, Vassilios Stathopoulos, and JoemonM Jose. 2009.On Social Networks and Collaborative Recommendation. In SIGIR. ACM, 195–202.
- Li etal. (2023a)Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. 2023a.Multimodal foundation models: From specialists to general-purpose assistants.arXiv:2309.10020 1, 2 (2023), 2.
- Li etal. (2023b)Jinming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni. 2023b.GPT4Rec: A Generative Framework for Personalized Recommendation and User Interests Interpretation.arXiv preprint arXiv:2304.03879 (2023).
- Liang etal. (2018)Dawen Liang, RahulG Krishnan, MatthewD Hoffman, and Tony Jebara. 2018.Variational Autoencoders for Collaborative Filtering. In WWW. ACM, 689–698.
- Liang and Willemsen (2023)Yu Liang and MartijnC Willemsen. 2023.Promoting music exploration through personalized nudging in a genre exploration recommender.International Journal of Human–Computer Interaction 39, 7 (2023), 1495–1518.
- Lin etal. (2023)Xinyu Lin, Wenjie Wang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2023.A Multi-facet Paradigm to Bridge Large Language Model and Recommendation.arXiv:2310.06491 (2023).
- Liu etal. (2023)Junling Liu, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. 2023.Is ChatGPT a Good Recommender? A Preliminary Study.arXiv preprint arXiv:2304.10149 (2023).
- Liu etal. (2010)NathanN Liu, EvanW Xiang, Min Zhao, and Qiang Yang. 2010.Unifying explicit and implicit feedback for collaborative filtering. In CIKM. 1445–1448.
- Liu etal. (2015)Wu Liu, Tao Mei, Yongdong Zhang, Cherry Che, and Jiebo Luo. 2015.Multi-task Deep Visual-semantic Embedding for Video Thumbnail Selection. In CVPR. IEEE, 3707–3715.
- Lucchi (2023)Nicola Lucchi. 2023.ChatGPT: a case study on copyright challenges for generative artificial intelligence systems.European Journal of Risk Regulation (2023), 1–23.
- Luo etal. (2022)Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022.CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval and Captioning.Neurocomputing 508 (2022), 293–304.
- Maras and Alexandrou (2019)Marie-Helen Maras and Alex Alexandrou. 2019.Determining Authenticity of Video Evidence in the Age of Artificial Intelligence and In the Wake of Deepfake Videos.The International Journal of Evidence & Proof 23, 3 (2019), 255–262.
- Mitchell etal. (2023)Eric Mitchell, Yoonho Lee, Alexander Khazatsky, ChristopherD Manning, and Chelsea Finn. 2023.DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature.arXiv:2301.11305.
- Nichol and Dhariwal (2021)AlexanderQuinn Nichol and Prafulla Dhariwal. 2021.Improved Denoising Diffusion Probabilistic Models. In ICML. PMLR, 8162–8171.
- Ouyang etal. (2022)Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, CarrollL Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal. 2022.Training Language Models to Follow Instructions with Human Feedback. In arXiv:2203.02155.
- Polykovskiy etal. (2020)Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, etal. 2020.Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models.Front. Pharmacol. 11 (2020), 565644.
- Qu etal. (2018)Chen Qu, Liu Yang, WBruce Croft, JohanneR Trippas, Yongfeng Zhang, and Minghui Qiu. 2018.Analyzing and Characterizing User Intent in Information-seeking Conversations. In SIGIR. ACM, 989–992.
- Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal. 2021.Learning Transferable Visual Models from Natural Language Supervision. In ICML. PMLR, 8748–8763.
- Ren etal. (2020)Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2020.Popmag: Pop Music Accompaniment Generation. In MM. ACM, 1198–1206.
- Ren etal. (2022)Zhaochun Ren, Zhi Tian, Dongdong Li, Pengjie Ren, Liu Yang, Xin Xin, Huasheng Liang, Maarten de Rijke, and Zhumin Chen. 2022.Variational Reasoning about User Preferences for Conversational Recommendation. In SIGIR. ACM, 165–175.
- Rendle etal. (2009)Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009.BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI. AUAI Press, 452–461.
- Rendle etal. (2011)Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2011.Fast Context-aware Recommendations with Factorization Machines. In SIGIR. ACM, 635–644.
- Rombach etal. (2022)Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022.High-resolution Image Synthesis with Latent Diffusion Models. In CVPR. IEEE, 10684–10695.
- Shin etal. (2018)Hyejin Shin, Sungwook Kim, Junbum Shin, and Xiaokui Xiao. 2018.Privacy Enhanced Matrix Factorization for Recommendation with Local Differential Privacy.TKDE 30, 9 (2018), 1770–1782.
- Singer etal. (2022)Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, etal. 2022.Make-a-video: Text-to-video generation without text-video data.arXiv:2209.14792.
- Song etal. (2021)Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021.Denoising Diffusion Implicit Models. In ICLR.
- Sun and Zhang (2018)Yueming Sun and Yi Zhang. 2018.Conversational Recommender System. In SIGIR. ACM, 235–244.
- Tu etal. (2022)Quan Tu, Shen Gao, Yanran Li, Jianwei Cui, Bin Wang, and Rui Yan. 2022.Conversational Recommendation via Hierarchical Information Modeling. In SIGIR. ACM, 2201–2205.
- Unterthiner etal. (2019)Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. 2019.FVD: A new metric for video generation.(2019).
- VanSchyndel etal. (1994)RonG VanSchyndel, AndrewZ Tirkel, and CharlesF Osborne. 1994.A Digital Watermark. In ICIP. IEEE, 86–90.
- Voleti etal. (2022)Vikram Voleti, Alexia Jolicoeur-Martineau, and Christopher Pal. 2022.MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In NeurIPS. Curran Associates, Inc.
- Völkel etal. (2019)SarahTheres Völkel, Ramona Schödel, Daniel Buschek, Clemens Stachl, Quay Au, Bernd Bischl, Markus Bühner, and Heinrich Hussmann. 2019.Opportunities and challenges of utilizing personality traits for personalization in HCI.Personalized Human-Computer Interaction 31 (2019).
- Wang etal. (2023)Tao Wang, Yushu Zhang, Shuren Qi, Ruoyu Zhao, Zhihua Xia, and Jian Weng. 2023.Security and privacy on generative data in aigc: A survey.arXiv:2309.09435 (2023).
- Wang etal. (2022a)Wenjie Wang, Fuli Feng, Liqiang Nie, and Tat-Seng Chua. 2022a.User-controllable Recommendation Against Filter Bubbles. In SIGIR. ACM, 1251–1261.
- Wang etal. (2022b)Wenjie Wang, Xinyu Lin, Fuli Feng, Xiangnan He, Min Lin, and Tat-Seng Chua. 2022b.Causal Representation Learning for Out-of-Distribution Recommendation. In WWW. ACM, 3562–3571.
- Wei etal. (2022a)Jason Wei, Maarten Bosma, VincentY Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM Dai, and QuocV Le. 2022a.Finetuned Language Models Are Zero-shot Learners. In ICLR.
- Wei etal. (2022b)Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, etal. 2022b.Emergent abilities of large language models.arXiv:2206.07682 (2022).
- Wei etal. (2019)Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019.MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In MM. ACM, 1437–1445.
- Wu etal. (2023b)Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin. 2023b.Ai-generated content (aigc): A survey.arXiv:2304.06632 (2023).
- Wu etal. (2021)Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021.Self-supervised graph learning for recommendation. In SIGIR. 726–735.
- Wu etal. (2023a)Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2023a.Next-gpt: Any-to-any multimodal llm.arXiv:2309.05519 (2023).
- Xie etal. (2021)Zhe Xie, Chengxuan Liu, Yichi Zhang, Hongtao Lu, Dong Wang, and Yue Ding. 2021.Adversarial and contrastive variational autoencoder for sequential recommendation. In WWW. ACM, 449–459.
- Yang etal. (2022a)Shuai Yang, Liming Jiang, Ziwei Liu, and ChenChange Loy. 2022a.Pastiche Master: Exemplar-based High-resolution Portrait Style Transfer. In CVPR. IEEE, 7693–7702.
- Yang etal. (2022b)Shuai Yang, Liming Jiang, Ziwei Liu, and ChenChange Loy. 2022b.VToonify: Controllable High-Resolution Portrait Video Style Transfer.TOG 41, 6 (2022), 1–15.
- Yu etal. (2023)Dingyao Yu, Kaitao Song, Peiling Lu, Tianyu He, Xu Tan, Wei Ye, Shikun Zhang, and Jiang Bian. 2023.MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models.arXiv:2310.11954.
- Zehlike etal. (2022)Meike Zehlike, Tom Sühr, Ricardo Baeza-Yates, Francesco Bonchi, Carlos Castillo, and Sara Hajian. 2022.Fair Top-k Ranking with multiple protected groups.IPM 59, 1 (2022), 102707.
- Zhang and Sundar (2019)Bo Zhang and SShyam Sundar. 2019.Proactive vs. reactive personalization: Can customization of privacy enhance user experience?Int. J. Hum. Comput. 128 (2019), 86–99.
- Zhang etal. (2023b)Junjie Zhang, Ruobing Xie, Yupeng Hou, WayneXin Zhao, Leyu Lin, and Ji-Rong Wen. 2023b.Recommendation as instruction following: A large language model empowered recommendation approach.arXiv:2305.07001.
- Zhang etal. (2018)Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and WBruce Croft. 2018.Towards Conversational Search and Recommendation: System Ask, User Respond. In CIKM. ACM, 177–186.
- Zhang etal. (2023a)Yang Zhang, Fuli Feng, Jizhi Zhang, Keqin Bao, Qifan Wang, and Xiangnan He. 2023a.CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation.arXiv:2310.19488.
- Zhou etal. (2020)Kun Zhou, Hui Wang, WayneXin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020.S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In CKIM. 1893–1902.
- Zhou and Zafarani (2020)Xinyi Zhou and Reza Zafarani. 2020.A survey of fake news: Fundamental theories, detection methods, and opportunities.CSUR 53, 5 (2020), 1–40.
- Zou etal. (2022)Jie Zou, Evangelos Kanoulas, Pengjie Ren, Zhaochun Ren, Aixin Sun, and Cheng Long. 2022.Improving Conversational Recommender Systems via Transformer-based Sequential Modelling. In SIGIR. ACM, 2319–2324.