Peek behind the curtain: how we built AI Smart Summaries, our first Generative AI feature
So, how did it all start?
The idea for AI Smart Summaries was born during a discussion with our in-house Revenue Managers (members of the Revenue Strategy Services department - RSS) and Bill Daviau, our VP of Strategic Accounts and Partnerships.
We discovered that a big part of their daily routine was looking at Excel reports generated in Spider Analytics (Lighthouse BI solution) and summarizing the main data points into a textual format, easy to digest for hotel General Managers.
Their routine looked like this:
Open the report
Look into the summarized table
Analyze the numbers
Write a summary of the data:
Last night we picked up 19 rooms and easily beat our r28. We finished with 185 rooms on the books with a $176 ADR. Great job team!This weekend we have about 50 rooms left to sell. We are still the highest in the brand compset, but we can see the other brand hotels are priced much higher. We were not seeing much pickup at the higher rates.
We did see some pick-up last night for next week and we are starting to see some rate movement in the compset.
Pick-up outside of next week was minimal, with Saturday the 8th, being the peak of pick-up.
Recommend (or decide) what actions should be taken:
Last night we picked up 19 rooms and easily beat our r28. We finished with 185 rooms on the books with a $176 ADR. Great job team!This weekend we have about 50 rooms left to sell. We are still the highest in the brand compset, but we can see the other brand hotels are priced much higher.
We were not seeing much pickup at the higher rates so we will not push up that high, but I will bring Saturday up to $219.We did see some pick-up last night for next week and we are starting to see some rate movement in the compset. I will push our Mon-Wed rate to $179 as we continue to build our rate.
Pick-up outside of next week was minimal, with Saturday the 8th, being the peak of pick-up. I will push that Saturday's rate up by $10.
While we recognized that making specific action recommendations is not something Generative AI excels at (yet), we saw immense potential in automating data summarization and kicked off a collaborative project to see what we can do.
Step 1: Understanding context
A crucial part of the task was identifying key data points and structuring the summary accordingly. The Data R&D team at Lighthouse holds a unique advantage with access to high-quality data and easily accessible industry experts.
We’ve spent considerable time collaborating with Revenue Managers (RM) to grasp what they were looking for in the data and how they were deciding what was important to highlight. This was an invaluable opportunity to document expert knowledge and understand how exactly certain analyses are performed.
We realized that there is a set of standards that almost every RM seeks:
Summarizing yesterday’s performance
Analyzing the books and pickup segmentation
Reviewing numbers for the next two weeks (and comparing them to the previous year)
Conducting a high-level analysis of the next few months to find specific outliers/trend changes
For each of these parts, there are a handful of specific data points that are relevant to talk about in the summary.
With this information, our Data R&D team focused on writing specific queries to extract the most important data.
Step 2: Data gathering
Collecting and extracting tabular, numeric data was relatively straightforward. For each customer, we already store their data in a clean format and in scalable infrastructure provided by Google Cloud Platform.
The bottleneck was the training data for the actual textual summaries. First, we tried to use out-of-the-box Large Language Models (LLMs) to “write a summary of the data points like a Revenue Manager in a hotel”.
We quickly noticed that this was not the right direction - the summaries lacked industry context and the tone was rather “flat”. The LLM struggled to differentiate positive from negative pieces of information, and hallucinations were pretty bad (e.g. “OTB” being expanded to Out of The Box instead of On The Books).
We decided that the best way forward was to work closely with our colleagues from Revenue Strategy Services. Over several months, they generated hundreds of examples for anonymized hotels, covering all sorts of different performance scenarios.
We also ensured that we could link the "numerical data" to the "textual summaries" based on date and hotel. This allowed us to later prompt the models with "summaries of similar days in the past.
Step 3: Searching for the best solution
Research setup
At Lighthouse, the Data R&D team follows a structured research approach. Our Data Scientists test different options and compare the results.
Our team is composed of experts from diverse backgrounds: from economics to theoretical physics, with experiences spanning business, academia, and various industries like telecommunications, food, finance, and taxi services.
But Generative AI and Large Language models specifically are not something any of us has worked with before (as their widespread use only began a few months ago).
So, we made sure we spent enough time acquiring knowledge about them and understanding best practices.
We realized that one of the bigger challenges with this awesome new technology is that there is no easy evaluation method as there is for well-established predictive AI models. We decided that the best way to deal with that is to - again - involve our RSS colleagues.
With each new iteration, we asked them to rate generated summaries and used that score to define progress.
Challenges we faced
Prompt engineering is hard
Coding is something Data Scientists at Lighthouse are good at. It’s structured, precise and clean.
Initially, we expected that a similar way of communicating with the LLM would be the best way to go. However, we quickly figured out that it’s not so simple and that there's actually not much engineering in “prompt engineering”: there's a lot of linguistics and semantics you need to take into account.
We had to consider different meanings and nuances of language, anticipate how the model might misinterpret instructions, and provide just the right amount of context without overwhelming the model.
It’s a delicate balance, and each prompt needs to work well across a variety of scenarios.
Too much context is confusing
Early in our research, we realized that LLMs (Large Language Models) struggle with a lot of context on various topics and points at the same time.
To address this, we decided to split the textual data into a few smaller segments:
Yesterday’s performance
Best and 2nd best-performing segment
Near future (Today & Tomorrow)
Far future (Next 2 weeks)
Monthly
This of course meant we had to rework the data provided initially to fit nicely into the data structure needed for the AI models.
To fine-tune or to RAG?
I am not going to delve deeply into technical details in this post, but we mainly tested 2 approaches (after figuring out that a “plain” LLM will not work for our use cases):
Fine-tuning - this involves updating the weights of the final layers in the model using specifically prepared datasets (formatted into question-answer pairs). Think of it as a model focusing a bit more on writing Revenue Management focused data summaries as a last course in its learning process.
RAG (Retrieval-Augmented Generation) - this method adds an additional step between the question and the answer that fetches relevant context or examples using embeddings. It’s like providing specific examples of how a similar day could be summarized when asking, "How did yesterday perform?"
We tried both methods and their performance was very similar (big thanks again to our colleagues for the evaluation).
Considering all the downsides of the fine-tuning option (cost, lack of flexibility, higher maintenance), we decided to go with the RAG approach.
Choosing the right model
We had to figure out which LLM we wanted to use out of the providers available. We compared the cost, the speed, and, of course, the quality of various models provided by Google and OpenAI.
To ensure quality we generated summaries for the same data points and asked Revenue Managers to choose the best one (without knowing which model was used for which summary). We also added some human-generated summaries to the set, just for the sake of comparison (more details below).
It was very interesting to see how some models were more elaborate and verbose, while others were really to the point. We ended up with the best cost/quality relationship.
Hallucinations
When it comes to hallucinations, there's only so much we can do. We managed to reduce them to a minimum with all the things described above, but with this technology, there is never 100% certainty.
Here are a few examples:
Model: "Tonight and tomorrow are forecasted to be undersold compared to the compset. Next weekend is forecasted to be sold out, and we are currently priced well above compset."
RM Feedback: "We can't know if we are going to be undersold compared to the comp set. 92% does not indicate a sellout"
Model: "Tonight and tomorrow are forecasted to be in the low 40%. We currently have 106 rooms OTB tonight and 88 tomorrow. Rates can be increased for Thursday through Saturday as forecasted occupancy is above 45% and our rate is currently lower than compset."
RM Feedback: "Next two weeks: the occupancy for tomorrow is 38%, not in the low 40s - room counts/occupancies are very literal. I do not agree with the rate lift suggestion"
In our final validation sessions, we concluded that errors happen very rarely and that we feel confident that they are not more common than the ones possibly made by humans.
Below you can see a distribution of the scores given to Smart Summaries by our Revenue Managers compared to the baseline score which was the “average score for human written summaries.”
1. Bad: The summary is wrong and/or unusable.
2. Decent: Correct but would need to be rewritten for a large part before sending to the client.
3. Decent summary, but would need some manual modifications to make it sound more knowledgeable.
4. Good summary. Not indistinguishable from an RM, good enough to be sent without modifications.
5. Perfect summary and written as it is, indistinguishable from an RM.
We continuously perform checks on the data and ask our RMs to randomly review summaries. We also closely monitor feedback and make adjustments as needed.
Step 4: Building for scale
Building a toy project to showcase a potential feature is easy; building something production-grade, at scale, and integrated within our product is not.
Additionally, moving LLM-based features into “production” was also not something we had experience with - figuring out what technical solutions are best was part of the challenge.
We decided on a setup written mostly in Python and the hottest new programming language - English (for prompt engineering). Using the Google Cloud Platform infrastructure and Google LLM (Gemini), we ensured scalability from a few beta users to thousands of customers, if things go well.
We also wrapped the whole setup in an API that can communicate with our products - so that a summary can be generated “on the fly” if the right numeric input data is provided.
Another consideration is that the model providers are deprecating their model versions annually. That means that every year we need to switch to a new model, which carries a risk - there is no guarantee that the new model will behave in the same way as the previous one.
While building the production solution, we made sure switching to a new model could be done easily and that we could focus on testing and potentially improving the prompts - without working out a new technical setup.
Step 5: Model evaluation: human in the loop
And then there is a big “not solved yet” problem - how to keep the quality up without having labor-consuming manual processes in place. The current development loop looks as follows:
We are evaluating some simple approaches (just measuring the number of people opting out, number of issues reported, etc.) and some more tech-heavy ones: using another model to verify if the responses are correct (but from our first test - we are not convinced that’s the right way to go either).
Lessons learned & what’s coming next
The main lesson we learned is that collaboration between domain experts and engineers is key to using AI to help Revenue Managers.
It has been an invaluable experience that helped us understand each other and made everyone genuinely excited about the next experiments to work on together.
Here are our 5 key takeaways from the challenges we faced:
Prompt engineering is challenging and requires a “non-engineering” approach to avoid model misinterpretation.
Large Language Models can struggle with handling a lot of different contexts, so it's better to split textual data into smaller segments.
Using Retrieval-augmented generation in combination with a foundation model is flexible and requires lower maintenance than a fine-tuned model.
Choosing the right foundation LLM requires consideration of cost, speed, and quality, with the latter being hard to evaluate.
Hallucinations can still occur despite efforts to reduce them, but with the right approach, errors are rare and not more common than those made by humans.
And while Lighthouse’s Smart Summaries already got featured in the innovation category by the Hotel Tech Report, we are definitely not stopping innovating and experimenting with GenAI.
Our Data R&D team, in collaboration with Product Managers & domain experts, is continuously working on our innovation roadmap.
You can expect exciting new AI-driven features soon!