Solving Training-Serving Discrepancies

Introduction

In machine learning, one critical challenge is ensuring that the features used during model training (offline) match those used during inference time (online serving). Discrepancies between training and serving features can lead to significant performance degradation, making it crucial to identify and address these inconsistencies as fast as possible.

At Taboola, we specialize in content discovery and native advertising, enabling users to find and engage with personalized content across the web. Our advanced machine learning powered recommendation systems serve billions of recommendations every day, helping publishers, advertisers, and brands reach their target audiences effectively.

In this blog post, I’m going to discuss the challenges of training-serving feature discrepancies in machine learning models and how they can affect model performance. I’ll explain how we tackle these discrepancies at Taboola, including the design and implementation of our solution, Sherlock. Finally, I’ll discuss some of the key discoveries made and the significant impact Sherlock has had on our recommendation systems.

Discrepancies?? No Way…

Training-serving feature discrepancies can occur for several reasons. The most common reason is the way we handle large features. Due to their significant storage requirements, these features are often not reported back as-is after serving in the online environment. Instead, before training, they are re-calculated from sampled data and various data sources. The re-calculated features may differ from the original calculations performed during serving. This may be due to data differences, changes in external data between serving and training times, or calculation logic differences.

Another cause of discrepancies can be cache misses and database queries timeouts in the online serving environment. For Taboola to be able to return recommendations to the client in a matter of hundreds of milliseconds we must wrap database queries in a very strict limit. These misses and timeouts can result in certain features having no values in serving while in the report back pipeline they will have values due to less strict timeouts.

Additionally, integration bugs can introduce further discrepancies. Poorly integrated components or erroneous data pipelines can cause mismatches between the features used during training and those available during serving.

Meet Sherlock

In order to tackle this issue, we developed Sherlock. Sherlock is a robust system designed to detect and alert on training-serving feature discrepancies in our models. Developed as part of our continuous efforts to improve the reliability and performance of our recommendation systems at Taboola, Sherlock allows us to promptly detect discrepancies and quickly address them, ensuring consistent and accurate model performance.