Reinforcement Learning based Recommender Systems for Web Applications: scenarios of Radio and Game aggregators

Introduction

With the progression of time and the continuous evolution of digital entertainment services such as YouTube, Netflix, Spotify, and online gaming platforms, recommendation systems have become an essential daily tool for users. These systems save users time by analyzing various content, facilitating searches, and suggesting relevant content in a personalized manner. However, the same level of personalization is not consistently found across all media domains, particularly within the radio streaming and gaming sectors

The requirements for scalability, constant improvement make reinforcement learning approachs particularly attractive for recommender systems application scenarios.

The main goal of this project was to develop an innovative and efficient recommendation system, empowered by reinforcement learning, for two distinct applications: a game aggregator website and a radio aggregator website. This system should be capable of providing personalized game and radio station suggestions, adapting to users preferences, interests, as well as changes in consumption patterns over time.

High Level Design

Every Reinforcement Learning has at least five elements elements: an agent, a environment, a state, a reward and an action. The agent is the learner or decision-maker that interacts with the environment, continuously making decisions to achieve the best possible outcome. The environment represents the external context or system with which the agent interacts. All the data gathered from the “sensors” is captured and representd in state, providing the agent with the necessary information to make informed decisions and select an appropriate action. Actions are decisions made by the agent that influence the environment, These decisions are then evaluted, and a reward is the feedback received by the agent following each action. The figure shows the system components diagram.

Reinforcement Learning Diagram

Our Recommender system was developed to integrate web platforms without requiring to much preparation. The process begins when a user accesses the web application and requests a page. Upon receiving the request, the web server checks for any cached recommendation corresponding to the current state, if none are found, the recommendation system generates a new list of items, these items will then be retrieved and presented to the user through the web application. Each interaction a user has with the recommended items is captured through a post request sent to the server, where it is stored in the database for future reference and analysis. Unlike conventional RLRSs, which typically provide immediate recommendations in response to a user’s interaction, our system introduces a strategic delay in this process, This delay varies based on the specific application scenario. By accumulating more user interactions before generating recommendations, our system allows for a more nuanced and informed evaluation of user preferences and behaviors in different contexts. This approach facilitates the making of recommendations that are more tailored, thoughtful, and relevant to the user’s actual needs and interactions under various scenarios. This architecture allows for seamless integration of the recommender system into the web application, providing users with personalized recommendations based on their previous interactions and preferences. The use of a database to store user interactions and recommendations allows for continuous improvement of the recommendation algorithm, resulting in more accurate and relevant recommendations over time.

Recommendation Proccess Sequence Diagram

Software Design

Modeling the Recommendation Task: The recommendation task is modeled as a Markov Decision Process (MDP), where a Recommender Agent (RA) interacts with the environment (users) over sequential time steps. The objective is to maximize cumulative rewards by making strategic item recommendations that resonate with user preferences and browsing history.

Components of the MDP:

State space S: A state st = {s1t, s2t, …, snt} is defined as the interactions history of a user, i.e., previous N items that a user browsed before time t. The items in st are sorted in chronological order.
Action space A: An action at = {a1t, a2t, …, ant} is to recommend a list of items to a user at time t based on current state st, where K is the number of items the RA recommends to the user each time.
Reward R: After the recommender agent takes an action at the state st, i.e., recommending a list of items to a user, the user then browses these items and provides his feedback. He can then skip (not click), click, or play these items, and the agent receives immediate reward r(st, at) according to the user feedback.
Transition probability P: Transition probability p(st+1|st,at) defines the probability of state transition from st to st+1 when RA takes action at. We assume that the MDP satisfies p(st+1|st,at,…,s1,a1) = p(st+1|st,at). If the user skips all the recommended items, then the next state st+1 = st; while if the user clicks/orders part of items, then the next state st+1 updates.
Discount factor [0,1]: defines the discount factor when measuring the present value of future reward. In particular, when gamma = 0, RA only considers the immediate reward. When gamma = 1, all future rewards can be counted fully into that of the current action.

Features Extraction and selection

In recommender systems, using only discrete item identifiers is insufficient for modeling item relationships and properties. An alternative is to construct item embeddings using auxiliary information such as attributes, text, or historical interactions. In this work, we adopted an approach that exploits user-item interaction history to generate embeddings.

The methodology treats items analogously to words, with a user’s sequential interaction history analogized to a textual sentence. Latent semantic connections between items can thus be extracted by applying word embedding techniques on the corpus of historical sessions.

Specifically, we have employed a pipeline which engineers features like popularity, encodes categorical metadata via one-hot encoding, and extracts semantic vectors from title text via TF-IDF and SVD. The concatenated output is a rich item embedding integrating metadata, text, and collaborative signals. Dimensionality reduction improves efficiency for downstream tasks. This representation learning approach induces embeddings which capture not only attributes but also latent relationships between items emerging from community usage patterns.

In this way, we move beyond simple indexes to a learned semantic vector space capturing nuanced connections between items. Our usage-based technique provides a complementary learning signal to attributes or descriptions. The resulting embeddings can augment existing recommender approaches, providing useful features to better discern and recommend relevant items.

Cold Start

The cold start problem is a significant challenge in recommendation algorithms. There are three distinct types of cold start scenarios that can be identified

New Users (a): The algorithm lacks historical data or preferences to make accurate recommendations for new users.
New Items (b): The system doesn’t have sufficient user interactions or feedback to accurately gauge the relevance of new items.
Both Users and Items are New (c): This represents the most complex situation. The algorithm, faces a double challenge since it must navigate the absence of historical data for both users and items, complicating the recommendation process.

To tackle the cold start problem, a multi-faceted approach is proposed. To address recommendations for new users, we recommend items based on their popularity among existing users, leveraging the current trends in the system. As for new items, we devise item embeddings that utilize item metadata, such as descriptions, categories, and sub-categories. By leveraging these metadata-based embeddings, we can establish connections between recently added items and previously existing items, facilitating more informed recommendations. This approach enables the system to make intelligent inferences and mitigate the impact of the cold start problem in recommendation algorithms.

Reward Design

At the heart of reinforcement learning lies the concept of rewards, which provide a numerical measure of the quality or desirability of an agent’s action in a particular state. The agent’s objective is to learn a policy that maximizes the expected cumulative reward over interactions. In light of this, the design of the reward system has a big influence the agent’s behavior and directing it toward desired results.

Objective

The reward function is meticulously designed to encompass various facets of user interactions and engagement, directly tied to the effectiveness of the recommendations. Two pivotal components constitute the reward function:

Components of Reward

Click Distribution (CD): This component gauges the distribution of user clicks across the array of recommended items. It plays a vital role in ensuring the diversity and balance within the recommended content. A more evenly spread click distribution signifies that users find a broader array of recommendations appealing and engaging, enhancing the overall utility and attractiveness of the recommendation system. This aspect is particularly crucial for ensuring users satisfaction, therefore increasing the probability of return.
Dwell Time (DT): This is a measure of the time users spend interacting with a recommended item. A more extended dwell time is indicative of the user finding the recommended content engaging and captivating, which is a positive outcome from a recommendation quality perspective. In the context of web applications reliant on ad revenue, a lengthier dwell time enhances the likelihood of users viewing and engaging with more advertisements, thus boosting potential ad revenue.

Formulation of Reward

Given these crucial components, the reward function ( R(s,a) ) for a state ( s ) and an action ( a ) is strategically crafted as follows:

R(s,a) = w_1 * DT + w_2 * CD

In this formulation, ( w_1 ) and ( w_2 ) are weights assigned to calibrate the significance of each component within the reward function. The careful design of this reward function allows a good evaluation of the recommendation system’s performance, ensuring that it is not just theoretically sound but also practically aligned with real-world business objectives.

Given the recommendation system described earlier, it’s crucial to illustrate its application within different areas – one being a game aggregator and the other being a radio aggregator. Both areas, while distinct in their offerings, provide a unique perspective on the functionality and adaptability of our algorithm. Consequently, this will help us understand the system’s versatility and its ability to generate meaningful recommendations in disparate domains.

Results

The system was tested in two real world domains, a game aggregator and a radio aggregator.

Self-improving Lists for Radios

Data Collected for Evaluation

As previously mentioned, our system underwent testing in a real-world, live application setting with genuine users. To initiate this, we primed our algorithm using data amassed over the preceding three months. Following this preparatory phase, we deployed the system on Colombia’s regional website. Over the ensuing two months, the algorithm operated in this live environment, autonomously gathering and processing user interaction data. Our evaluation, as detailed in this chapter, is centered on analyzing the outcomes and insights derived from this two-month operational period, during which our algorithm produced 60 lists.

Analysis of the Results of the Algorithm

To effectively assess the progression and performance of our algorithm, we will employ two pivotal metrics. The first, the cumulative reward, offers insight into the total benefits the algorithm accumulates over time. The second metric, regret, provides a comparative perspective, highlighting the gap between the outcomes achieved by our algorithm and the optimal possible outcomes.

	First Month	Second Month
Average Reward	20	30
Average Regret	80	70

Coverage

In the specific case under study, the coverage is evaluated based on the selection of radio stations in Colombia. Over a period of two months, the system consistently exhibited a coverage value of approximately 25%. This was determined by the selection of 92 radios out of a possible 367, illustrating the system’s ability to access and utilize a diverse range of radio options.

The concept of coverage in this study is crucial as it reflects the systemâ€™s capacity to recommend a broad spectrum of radios, ensuring that users are provided with varied options.

Popularity

Popularity trends serves as a cornerstone for understanding the robustness and adaptability of our recommendation algorithm in the realm of radio broadcasting.
From the data illustrated in Figure bellow, and further corroborated by the provided dataset, we discerned noteworthy patterns during two distinct intervals: June-July and July-August. The data shows a clear trend where the algorithm tends to favor a specific set of radios that appear most relevant to listeners over time.

Radios Popularity Graph

Online Metrics Evaluation

Figure Bellow provides a visual representation of our system’s progression in the Colombian market from June 21 to August 21. Taking a close look we can see a general upward trend, underscoring an enhancing alignment between user inclinations and our recommendations.However, it also highlights a noteworthy dip between weeks 29 and 31. Such occasional deviations can be attributed to the explorative nature of reinforcement learning algorithms, which sometimes take seemingly sub-optimal actions that are essential for broader, long-term learning.

Average Played Time Rate Per Week Chart
Delving deeper into the deviations observed between weeks 29 and 31, Figure bellow draws attention to an intriguing pattern. This figure accentuates the correlation between radios that failed to playback and the dip in our performance metrics. Specifically, during weeks 30 and 31, a concerning average of 2.9 out of the 6 radios didnt did not play. While this might seem minimal in isolation, compared to the 2.1 average of other weeks, the difference becomes significant. Given our lists contain six items, and on average, half of them fail to playback due to errors, this limitation becomes glaring and demands immediate attention.

Average Radios With Error Per Week Chart

Game Recommender System

In our evaluation process, we collected data from Reludi, our game aggregator web application.

Data Collected for Evaluation

The dataset contained chronological user interaction metrics such as clicks and playtime, accumulated over a span of approximately one month. This sampled data was then used for the initial training of our recommendation system, enabling the model to harness valuable insights and patterns from user interactions. We began by sorting our data in chronological order and then divided it into two parts: training and testing. To do this, we used the first three weeks of collected data for training and allocated the remaining week for testing. This systematic organization of data was crucial to preserve the sequential integrity of user interactions, facilitating the extraction of meaningful temporal insights. Following the offline evaluation, the pre-trained model was deployed into a live production environment. Transitioning to an online setting, the model utilized the knowledge acquired from offline training to adapt to real-world user interactions and feedback. This phase was conducted over a period of six months. This thoughtful approach and structured implementation of the model in a dynamic environment allowed for refined tuning and optimization, enhancing its performance and recommendation capabilities based on actual user engagement and behaviors.
Analysis of the Results of the Algorithm

Since the basic structure of our system remains unchanged, it is essential to use the same evaluation measures as previously outlined. Reviewing Table bellow, it is clear that the offline setting provides a superior average reward and a reduced average regret compared to the online setting. In the offline environment, the algorithm shows satisfactory performance, evident from the average rewards and regrets.

	First Month	Second Month
Average Reward	40	20
Average Regret	60	80

Coverage

Over a period of six months, games were periodically introduced into the system by the content team, a dynamic that the system had to adapt to and manage efficiently. Despite the fluctuating game pool, the system displayed a remarkable adaptability, maintaining a coverage metric of approximately 53.15% by recommending 430 games out of an evolving pool that reached up to 809 games.

Popularity

The Plot bellow was created using our production data, in the y axis we can see the percentage of the number of time a game was recommended taking in account the number of recommendations, and in the x axis we can see the game name.

Games Popularity Graph

Online Metrics Evaluation

In Figure bellow, a comprehensive evolutionary trajectory of our recommender system is portrayed through user’s interactions. The visualization demonstrates two essential metrics: the total duration in seconds and the total clicks, illustrated on the y-axis, while the xaxis presents the number of interactions. Within the illustration, individual lines represent unique users, whereas a more pronounced red line demarcates the average user in both plots, establishing a comparative benchmark across the evaluated metrics.

Games Online Evaluation Graph

Web interface

The figures below show the web interface for both our scenarios, web aggregator (Reludi) and the radio aggregator (MyTunner)

Reludi Recommender System Web Interface

Mytunner Recommender System Web Interface

Conclusion

In the course of this work, a recommendation system was successfully designed and implemented, drawing from the foundational DDPG algorithm and its methodology. This system seamlessly integrates into two distinct applications: a game aggregation platform and a radio station aggregation platform. It adeptly tailors game and radio station suggestions by adapting to user preferences and evolving consumption patterns. The user interface of the recommender system not only boasts visual appeal and user-friendliness but also harmonizes flawlessly with the existing design ethos of both platforms. This strategic fusion mitigates user disorientation, fostering a unified interaction landscape. As a result, the system operates effectively, significantly enhancing user engagement, a pivotal objective of our research. Nevertheless, certain challenges emerged during implementation. Handling a substantial volume of pre-test data could potentially increase processing time as item databases expand. Despite this challenge, our primary goal of demonstrating the practicality and usefulness of an RL-based recommendation system has been satisfactorily achieved. Examination outcomes revealed a marked improvement in the system’s proficiency, adeptly adapting to fluctuating user consumption paradigms. The recommendations provided were diverse and relevant. In summary, this project highlights the potential of the evolved recommendation system in delivering precise, personalized user recommendations. Furthermore, the system’s recommendations demonstrate the capability for temporal evolution, synchronizing with user preferences to enrich their engagement experience comprehensively.