FLYR for Hospitality Now Available on Oracle Cloud Marketplace Arrow

Resource Hub / Tech Blog / Unlocking Efficiency and Revenue: Deep Q-learning with Batch Constraints

Unlocking Efficiency and Revenue: Deep Q-learning with Batch Constraints

Introduction

In revenue management, the search for optimal pricing strategies is perpetual. Traditional supervised learning models have limitations, especially when historical data is sparse or inconsistent. To overcome these challenges, deep reinforcement learning (RL) presents a compelling alternative. Specifically, Deep Q-learning (DQN) augmented with batch constraints (BCQ) can offer robust solutions. This post delves into how BCQ can be leveraged to generate optimal bid prices in dynamic environments, ensuring stability, efficiency, and enabling real-time decision-making in the travel industry.

The Need for Reinforcement Learning

Supervised learning methods typically rely on historical data with clearly defined labels. However, in pricing strategies, defining these labels is often tricky. Conventional forecasting techniques, which require high-quality, regularly spaced time series data, can falter when data is missing or inconsistent. Reinforcement learning, particularly DQN, addresses these limitations by allowing models to learn optimal policies through interactions with the environment, even when historical data is imperfect.

The Power of Deep Q-Learning

Q-learning is a type of RL that aims to learn the optimal policy by estimating the action-value function, which maps a state and action to the expected reward. DQNs enhance this approach by using experience replay and target networks, stabilizing the training process and breaking correlations within training data.

Addressing Extrapolation Error with Batch Constraints

One major issue in Q-learning is extrapolation error, caused by the mismatch between the training dataset and the current policy’s state-action visitation. This error becomes particularly problematic when historical data does not cover all possible state-action pairs that a new policy might encounter. BCQ mitigates this by restricting the available actions to those likely to have occurred historically, thereby stabilizing learning and producing more accurate Q estimates.

DQN graph showing historic data vs. learned trend

Scientific Approach: BCQ for Bid Pricing

BCQ for bid pricing diagram

Technique Overview

BCQ optimizes the next sellable seat’s bid price by ensuring the policy is grounded in historical pricing decisions. This approach utilizes two fully-connected neural networks for learning Q-values and discretizes the bid-price action space into fixed buckets to stabilize training. Joint training of sub-networks with a single optimizer further enhances the model’s stability and performance.

Learning the Bid-Price Curve

To generate the full bid price vector, the model fits two Generalized Logistic Functions using mean-squared-error loss. This approach models the inflection points that occur both ahead-of and behind the next sellable seat, ensuring accurate representation of the bid price curve. During inference, these curves are adjusted to align with the optimal next-sellable bid price learned by the BCQ.

Model Architecture

The action-value network produces the final output, with the optimal action being the one with the highest estimated Q-value subject to the P-network constraint. The target network stabilizes learning by providing consistent targets, while the action probability network estimates the likelihood of each action being taken historically. Only actions with a sufficient probability are considered, reducing the risk of unrealistic bid prices.

Results and Quality Review

Batch constraining helps the loss value for action probabilities reach stability by the end of the training cycle, with no signs of overfitting. The correlation between Q-values and actual discounted rewards converges to approximately 0.9, indicating robust performance.

Impact of Batch Constraints

Batch constraints are crucial in limiting the legal action-space, preventing the model from choosing highly sparse or unseen price actions. This approach ensures that the model explores nearby potentially optimal price actions, enhancing its adaptability and accuracy.

Flexibility of Shapes

The chosen logistic functions are flexible enough to model various S-shaped curves, crucial for capturing the growth and drop in bid prices throughout the day. This flexibility is essential for accurately representing different pricing scenarios.

Conclusion and Future Steps

Adding batch constraints to DQN significantly reduces extrapolation error, leading to better policies both at an aggregate level and for individual itineraries. The current discrete BCQ framework is promising, but future efforts will aim to develop a continuous version to further stabilize training.

Moving forward, live A/B testing is essential to validate the BCQ model’s performance in real-world scenarios, especially in tracking revenue growth. Additionally, exploring different action-probability distributions and shape functions could enhance the model’s adaptability across various domains.

By integrating BCQ into revenue management systems, businesses can achieve more stable and accurate pricing strategies. This innovation unlocks the freedom for travel companies to innovate, optimizing revenue and enhancing real-time decision-making.


For further details or inquiries, you can reach out to:

Naman Shukla, Product Manager, Applied Science, naman.shukla@flyr.com
Akhil Gupta, Senior Applied Scientist, akhil.gupta@flyr.com

Similar stories

Open source, both as an idea and as software, is at the heart of what we do at FLYR, including in our marketing technology engineering department.
As part of our responsibilities as data engineers on a team focused on delivering key customer metrics, the FLYR Cloud team challenges ourselves to build flexible ways of managing our pipelines.
FLYR’s engineering team estimates new pricing strategy outcomes before production to ensure the most successful models are deployed.