Just a few days a go elasticsearch released version 1.3.0 of their flagship product. The first one is the most waited for feature called the Top hits aggregation. Basically this is what is called grouping. You want to group certain items based on one characteristic, but within this group you want to have the best matching result(s) based on score. Another very important feature is the new support for scripts. Better security options when using scripts using sandboxed script languages.
In this blogpost I am going to explain and show the top_hits feature as well as the new scripting support.
Top hits
I am going to show a very simple example of top hits using my music index. This index contains all the songs I have in my itunes library. First step is to find songs by genre, the following query gives (the default) 10 hits based on the match_all query and the terms aggregation as requested.
GET /mymusic/_search { "aggs": { "byGenre": { "terms": { "field": "genre", "size": 10 } } } }
The response is of the format:
{ "hits": {}, "aggregations": { "byGenre": { "buckets": [ {"key":"rock","doc_count":1910}, ... ] } } }
Now we add the query to the request, songs containing the word love in the title.
GET /mymusic/_search { "query": { "match": { "name": "love" } }, "aggs": { "byGenre": { "terms": { "field": "genre", "size": 10 } } } }
Now we have less hits, still a number of buckets and the amount of songs that match our query within that bucket. The biggest change is the score in the returned hits. In te previous query the score was always 1, now the score is different due to the query we execute. The highest score now is the song Love by The Mission. The genre for this song is Rock and the song is from the year 1990. Time to introduce the top hits aggregation. With this query we can return the top song containing the word love in the title per genre
GET /mymusic/_search { "query": { "match": { "name": "love" } }, "aggs": { "byGenre": { "terms": { "field": "genre", "size": 5 }, "aggs": { "topFoundHits": { "top_hits": { "size": 1 } } } } } }
Again we get hits, but they are not different from the query before. The interesting part is in the aggs part. Here we add a sub aggregation to the byGenre aggregation. This aggregation is called topFoundHits of type top_hits. We only return the best hit per genre. The next code block shows the part of the response with the top hits, I did remove the content of the _source field in the top_hits to keep the response shorter.
{ "took": 4, "timed_out": false, "_shards": { "total": 3, "successful": 3, "failed": 0 }, "hits": { "total": 141, "max_score": 0, "hits": [] }, "aggregations": { "byGenre": { "buckets": [ { "key": "rock", "doc_count": 52, "topFoundHits": { "hits": { "total": 52, "max_score": 4.715253, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "4147", "_score": 4.715253, "_source": { "name": "Love", } } ] } } }, { "key": "pop", "doc_count": 39, "topFoundHits": { "hits": { "total": 39, "max_score": 3.3341873, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "11381", "_score": 3.3341873, "_source": { "name": "Love To Love You", } } ] } } }, { "key": "alternative", "doc_count": 12, "topFoundHits": { "hits": { "total": 12, "max_score": 4.1945505, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "7889", "_score": 4.1945505, "_source": { "name": "Love Love Love", } } ] } } }, { "key": "b", "doc_count": 9, "topFoundHits": { "hits": { "total": 9, "max_score": 3.0271564, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "2549", "_score": 3.0271564, "_source": { "name": "First Love", } } ] } } }, { "key": "r", "doc_count": 7, "topFoundHits": { "hits": { "total": 7, "max_score": 3.0271564, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "2549", "_score": 3.0271564, "_source": { "name": "First Love", } } ] } } } ] } } }
Did you note a problem with my analyser for genre? Hint R&B!
More information on the top_hits aggregation can be found here:
Scripting
Elasticsearch has support for scripts for a long time. The default scripting language was and is mvel up to version 1.3. It will change to groovy in 1.4. Mvel is not a well known scripting language. The biggest advantage is that mvel is very powerful. The disadvantage is that it is to powerful. Mvel does not come with a sandbox principle. Therefore is is possible to write some very nasty scripts even when only doing a query. This was very well shown by a colleague of mine (Byron Voorbach) who created a query to read private keys on developer machines who did not safeguard their elasticsearch instance. Therefore dynamic scripting was switched off in version 1.2 by default.
This came with a very big disadvantage, now it was not possible anymore to use the function_score query without resorting to stored scripts on the server. In version 1.3 of elasticsearch a much better way is introduced. Now you use sandboxed scripting languages like groovy to keep using the flexible approach. Groovy can be configured to include object creation and method calls that are allowed. More information about this is provided in the elasticsearch documentation about scripting.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
Next is an example query that is querying my music index. This index contains all the songs from my music library. It queries all he songs after the year 1999 and calculates the score based on the year. So the newest songs get the highest score. And yes I know a sort by year desc would have given the same result.
GET mymusic/_search { "query": { "function_score": { "query": { "range": { "year": { "gte": 2000 } } }, "functions": [ { "script_score": { "lang": "groovy", "script": "_score * doc['year'].value" } } ] } } }
The score now becomes high, since we do a range query we get back only scores of one. Using the function_score as the multiplication of the year with the score, the end score is the year. I added the year as the only field to return, some of the results than are:
{ "took": 4, "timed_out": false, "_shards": { "total": 3, "successful": 3, "failed": 0 }, "hits": { "total": 2895, "max_score": 2014, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "12965", "_score": 2014, "fields": { "year": [ "2014" ] } }, { "_index": "mymusic", "_type": "itunes", "_id": "12975", "_score": 2014, "fields": { "year": [ "2014" ] } } ] } }
Next up is the last sample, a combination of top_hits and scripting.
Top hits with scripting
We start with the sample from top_hits using my music index. Now we want to sort the buckets on the score of the best matching document in the bucket. The default is the number of documents in the bucket. As mentioned in the documentation you need a trick to do this.
The top_hits aggregator isn’t a metric aggregator and therefor can’t be used in the order option of the terms aggregator.
GET /mymusic/_search?search_type=count { "query": { "match": { "name": "love" } }, "aggs": { "byGenre": { "terms": { "field": "genre", "size": 5, "order": { "best_hit":"desc" } }, "aggs": { "topFoundHits": { "top_hits": { "size": 1 } }, "best_hit": { "max": { "lang": "groovy", "script": "doc.score" } } } } } }
The results of this query again with most of the _source taken out is following. Compare it to the query in the top_hits section. Notice the different genres that we get back now. Also check the scores.
{ "took": 4, "timed_out": false, "_shards": { "total": 3, "successful": 3, "failed": 0 }, "hits": { "total": 141, "max_score": 0, "hits": [] }, "aggregations": { "byGenre": { "buckets": [ { "key": "rock", "doc_count": 37, "topFoundHits": { "hits": { "total": 37, "max_score": 4.715253, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "4147", "_score": 4.715253, "_source": { "name": "Love", } } ] } }, "best_hit": { "value": 4.715252876281738 } }, { "key": "alternative", "doc_count": 12, "topFoundHits": { "hits": { "total": 12, "max_score": 4.1945505, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "7889", "_score": 4.1945505, "_source": { "name": "Love Love Love", } } ] } }, "best_hit": { "value": 4.194550514221191 } }, { "key": "punk", "doc_count": 3, "topFoundHits": { "hits": { "total": 3, "max_score": 4.1945505, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "7889", "_score": 4.1945505, "_source": { "name": "Love Love Love", } } ] } }, "best_hit": { "value": 4.194550514221191 } }, { "key": "pop", "doc_count": 24, "topFoundHits": { "hits": { "total": 24, "max_score": 3.3341873, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "11381", "_score": 3.3341873, "_source": { "name": "Love To Love You", } } ] } }, "best_hit": { "value": 3.3341872692108154 } }, { "key": "b", "doc_count": 7, "topFoundHits": { "hits": { "total": 7, "max_score": 3.0271564, "hits": [ { "_index": "mymusic", "_type": "itunes", "_id": "2549", "_score": 3.0271564, "_source": { "name": "First Love", } } ] } }, "best_hit": { "value": 3.027156352996826 } } ] } } }
This is just a first introduction into the top_hits and scripting. Stay tuned for more blogs around these topics.