Improving information retrieval in the Elastic Stack: Introducing Elastic Learned Sparse Encoder, our new retrieval model

Learn about the Elastic Learned Sparse Encoder (ELSER), its retrieval performance, architecture, and training process.

Seamlessly connect with leading AI and machine learning platforms. Start a free cloud trial to explore Elastic’s gen AI capabilities or try it on your machine now.

In this blog, we discuss the work we've been doing to augment Elastic's out-of-the-box retrieval with a pre-trained language model: the Elastic Learned Sparse Encoder (ELSER).

In our previous blog post in this series, we discussed some of the challenges applying dense models to retrieval in a zero-shot setting. This is well known and was highlighted by the BEIR benchmark, which assembled diverse retrieval tasks as a proxy to the performance one might expect from a model applied to an unseen data set. Good retrieval in a zero-shot setting is exactly what we want to achieve, namely a one-click experience that enables textual fields to be searched using a pre-trained model.

This new capability fits into the Elasticsearch _search endpoint as just another query clause, a text_expansion query. This is attractive because it allows search engineers to continue to tune queries with all the tools Elasticsearch already provides. Furthermore, to truly achieve a one-click experience, we've integrated it with the new Elasticsearch Relevance Engine. However, rather than focus on the integration, this blog digs a little into ELSER's model architecture and the work we did to train it.

We had another goal at the outset of this project. The natural language processing (NLP) field is fast moving, and new architectures and training methodologies are being introduced rapidly. While some of our users will keep on top of the latest developments and want full control over the models they deploy, others simply want to consume a high quality search product. By developing our own training pipeline, we have a playground for implementing and evaluating the latest ideas, such as new retrieval relevant pre-training tasks or more effective distillation tasks, and making the best ones available to our users.

Finally, it is worth mentioning that we view this feature as complementary to the existing model deployment and vector search capabilities in the Elastic Stack, which are needed for those more custom use cases like cross-modal retrieval.

ELSER performance results

Before looking at some of the details of the architecture and how we trained our model, the Elastic Learned Sparse Encoder (ELSER), it's interesting to review the results we get, as ultimately the proof of the pudding is in the eating.

As we discussed before, we use a subset of BEIR to evaluate our performance. While this is by no means perfect, and won't necessarily represent how the model behaves on your own data, we at least found it challenging to make significant improvements on this benchmark. So we feel confident that improvements we get on this translate to real improvements in the model. Since absolute performance numbers on benchmarks by themselves aren't particularly informative, it is nice to be able to compare with other strong baselines, which we do below.

The table below shows the performance of Elastic Learned Sparse Encoder compared to Elasticsearch's BM25 with an English analyzer broken down by the 12 data sets we evaluated. We have 10 wins, 1 draw, and 1 loss and an average improvement in NDCG@10 of 17%.