FLYR for Hospitality Now Available on Oracle Cloud Marketplace Arrow

Resource Hub / Tech Blog / Off-Policy Evaluation: Valuing New Pricing Models Using Reinforcement Learning

Off-Policy Evaluation: Valuing New Pricing Models Using Reinforcement Learning

Dynamic pricing models are at the core of the success of companies such as Uber, Airbnb, and ourselves here at FLYR Labs, where we help major airlines optimize their day-to-day pricing and forecast revenue. These models are responsible for setting prices that balance supply and demand to optimize key company metrics, and they manage billions of dollars in revenue annually. We continuously develop improvements to our deep learning pricing models and retrain them on the most recent data in order to ensure they perform well under changing market conditions. However, deploying new models in production means changing the client’s pricing strategy, so it is imperative that we maintain or improve our revenue performance with each new deployment.

For this reason, we decided to implement a process to estimate the revenue a pricing strategy will earn before it actually goes into production. This would prevent us from deploying a model that will earn less revenue than the existing pricing strategy and allow us to deploy only the best-performing revenue generation models out of a set of candidates. Furthermore, this process should estimate a pricing strategy’s value regardless of how that strategy sets prices, meaning it should work the same for our artificial intelligence (AI)-generated pricing models as for an airline’s human-powered pricing methods.

Typical approaches to this kind of problem might include econometric modeling of the elasticity of demand or the use of simulation environments. While companies like Uber and Airbnb operate platforms that have fine-grain visibility into demand for their services, in the airline pricing domain there is only sparse bookings data, as opposed to information about every traveler’s search for flights. This makes it difficult to accurately represent demand with a model or a simulation, since, in the common scenario where no tickets were purchased at a particular price, we cannot distinguish between a lack of demand for the flight or the price being set too high.

Our desire for a strategy-agnostic estimation method that doesn’t directly model demand led us to a subset of machine learning models in the field of reinforcement learning (RL). RL models gained popularity in recent years by surpassing expert human performance in a variety of complex domains, like chess and Atari video games. Through trial and error, they navigate sequential decision-making problems by receiving feedback on each of their decisions and discover the best strategy by optimizing for the total feedback they receive, similar to how humans and animals learn in the real world.

RL frameworks consist of an agent (i.e. the model) that observes the state of its environment, chooses an action in order to earn reward, and then observes the state in the next time period where it will take another action. The process of observing the state of the environment and choosing an action defines the strategy learned by the agent, which is called the policy in RL lingo. The agent can learn either from direct interaction with the environment (online learning), or more commonly by using historical data from interactions of other policies with the environment (offline learning).

The canonical RL diagram

Applied to the context of airline revenue management, the agent represents a team of revenue management analysts with legacy pricing controls or one of FLYR’s pricing models. The state of the environment represents information about the flight, like the departure date and how many seats are still available. The action is the price set by the agent, and the reward is the revenue earned from the chosen action.

In the majority of RL models, the agent is trained to select the next action that maximizes total discounted reward across all future time periods. The agent learns the value of different actions based on the relationship between states, actions, rewards, and the next state that follows based on the previous state and action. This usually happens by estimating the action-value function ( , ), which accepts as inputs a state and the action taken in that state, and outputs the total discounted reward that can be earned from the current time period onwards by following the actions according to the policy . This function is updated iteratively during training according to the recursive Bellman Equation formula for the Q-learning target:

( , ) ← + ( +1, ∗ +1)


  • is the reward earned in state from taking action ;
  • is the discount factor that weights future reward less than current reward, usually set to 0.99;
  • +1 is the state in the next time period that the environment transitions to based on the previous state and action;
  • ∗ +1 = argmax +1 ( +1, +1) is the action in the next time period that maximizes ( +1,⋅), or the model’s estimate of future reward from the next state onwards; and
  • is a binary flag that simply zeros out the estimate of future reward when the current state is the last one in the episode.

In this way, the Q values converge to the true discounted future reward by starting from a random initialization near zero for all state-action pairs and updating progressively towards the reward earned from each state-action pair plus the discounted estimate of future reward earned from taking the best action in the next state onwards. The policy that should theoretically be learned by the agent with this approach is called the optimal policy.

Convergence of an RL agent’s estimate of Q values as it is trained longer (Minih et al 2015)

Our goal is to value the actions from a new policy using historical data, and it turns out that the offline learning problem can be altered slightly so that instead of taking actions that maximize total discounted reward and learning the value of the optimal policy, the agent takes the actions as given and learns the value of this new policy. This is an area of research known as off-policy evaluation (OPE), and only requires a small change to the update formula above:

′( , ) ← + ′( +1, ′ +1)

where we simply change the action ∗ +1 , which maximizes the estimate of future reward, to the action ′ +1 , the action that the new policy ′ would take in state +1. In this way, the Q values are updated in the direction of the reward earned from the historical state-action pairs plus the estimated performance of the new policy’s actions in the next state. This approach to OPE has been shown theoretically to converge to the true value of the new policy with some bounds on the estimation error, and has been demonstrated empirically to be one of the most robust in stress tests.

We selected a very simple dueling deep Q network (DDQN) architecture for the OPE models to start, where two dense layers form the body of the model and feed into two separate heads: one for learning the value of being in a particular state, and the other for learning the advantage that each action provides over the others in that state. The outputs of these heads are added together to produce the final Q value that we compare against the Q-learning target to compute loss.

Moving from theory to implementation in Google Cloud Platform, we built a pipeline that uses Dataflow to add the inference outputs from multiple pricing models to an existing dataset, trains OPE models in parallel in an AI Platform to learn the value of each pricing policy specified (including the airline’s historical pricing policy), runs inference on these OPE models, and saves the outputs to BigQuery, where they can finally be visualized in dashboards.

We used this pipeline to compare three pricing policies against each other for one of our airline clients: the production pricing model currently setting prices for the airline and generating revenue uplift, a former candidate model that we discarded because it didn’t meet our model promotion standards, and the historical pricing policy of the airline’s revenue management analysts. The results from the OPE models corroborate what we had seen when previously analyzing the three policies.

Our production model (blue) and the candidate model (turquoise) are each estimated to earn slightly less in terms of average remaining discounted revenue far away from departure compared to the airline’s historical pricing decisions (red). However, our production model catches up to the historical policy by 125 days from departure and far surpasses it until the flight departs, which we can confirm with its real-world revenue performance. On the other hand, the discarded candidate model only catches up to the historical policy at 50 days out, and remains approximately level with it until departure.

We have just shown that these OPE methods can be a useful step in our model promotion process to ensure that we are only deploying models that meet our standards for revenue performance. Additional research in this area will enable us to measure the prediction error of the OPE models in our pricing model’s domain, enable them to learn more complex relationships in the data by using more advanced model architectures and input data structures, and extend their application to other domains beyond airline revenue management.

Similar stories

Open source, both as an idea and as software, is at the heart of what we do at FLYR, including in our marketing technology engineering department.
As part of our responsibilities as data engineers on a team focused on delivering key customer metrics, the FLYR Cloud team challenges ourselves to build flexible ways of managing our pipelines.
FLYR’s airline customers require lightning-fast turnaround times when generating seat prices, and at 3500 API calls per airline per second, every millisecond counts.