Finding similar articles with Elasticsearch

warren.boult
Published on April 16th 2020
A close up of a library
Here at Candide, we have lots of really interesting articles and how-to guides in our Discover area, covering topics from the latest gardening news, to plant propagation tips, to suggestions for activities to do in your garden. As a user of the app, I therefore would love to be able to browse and discover good articles without having to go looking for it, and I’m sure editors of the article content would appreciate for the article curation process to be less manual. One way of achieving this, could be by suggesting similar articles...

More Like This

This is where a tool such as Elasticsearch’s More Like This query could come in handy. We use Elasticsearch for all our search-related stuff in the app, so it’s an ideal fit.
The More Like This query type finds documents similar to a given document or term, and allows you to configure it to search on certain fields, specify an ‘unlike’ clause, and other juicy bits to customise it for your use case. The way it evaluates what is a ‘similar’ document is via TF-IDF scoring the documents. I won’t go into how TF-IDF scoring works here, but this website explains it well if you’re interested. Essentially, it is trying to find important words shared between documents as a measure of similarity.

Let's try it!

So for our use case of finding similar articles, we probably at least want to compare the title and the article body’s text for finding similar articles, and since articles by the same author may well be more likely to be similar, let’s throw that in too.
For our example, let’s use this story about plant propagation
This gives us a query looking like this:
1GET _search
2 {
3 "query": {
4 "more_like_this": {
5 "fields": ["text", "title", "author"],
6 "like": [
7 {
8 "_index": "story-gb-0",
9 "_id": "story_a904ecbb-f243-4d72-82cd-65f22523432f"
10 }
11 ],
12 "min_term_freq" : 1
13 }
14 },
15 "size": 10
16 }
We have chosen the article text, title and author to compare on, and are limiting it to 10 results.
And here is a sample of the response:
1 {
2 "took": 43,
3 "timed_out": false,
4 "_shards": {
5 "total": 10,
6 "successful": 10,
7 "skipped": 0,
8 "failed": 0
9 },
10 "hits": {
11 "total": 1621,
12 "max_score": 48.61967,
13 "hits": [
14 {
15 "_index": "story-gb-0",
16 "_type": "_doc",
17 "_id": "story_9b828568-5caa-4b56-83b4-b6fe3cf3ca7b",
18 "_score": 48.61967,
19 "_source": {
20 "type": "story",
21 "id": "9b828568-5caa-4b56-83b4-b6fe3cf3ca7b",
22 "title": "Propagating From Nodes",
23 "text": "Plant propagation is a way of getting new plants for free..."
24 }
25 }
26 ]
27 }
28 }
We can see the returned articles have a score used to rank them, and promisingly the top-scoring article mentions propagation. Let’s look at some of the articles that were retrieved:
We’ve got lots more articles about plant propagation! And it did better than just retrieving articles that also have the word propagation in the name. Not a bad start.
Next up is to play around with the parameters some more, and maybe give it a go with some of the other entities in our app: plants, businesses, posts, and more!

Be the first to download the app

Help us build a place where community meets knowledge. Try it out and let us know what you think.
Download on the App StoreGet it on Google Play