Sometimes you are indexing data and want to have as little to do in the input, or maybe even no influence on the input. Still you need to make changes, you want other content, or other fields. Maybe even remove fields. In elasticsearch 1.3 a new feature is introduces called Transform. In this blogpost I am going to show some of the aspects of this new feature.

Insert the document with the problem

The input we get is coming from a system that puts the string null in a field if it is empty. We do not want null as a string in elasticsearch index. Therefore we want to remove this field completely when indexing a document like that. We start with the example and the proof that you can search on the field.

PUT /transform/simple/1
{
  "title":"This is a document with text",
  "description":"null"
}

Now search for the word null in the description.

For completeness I’ll show you the response as well.

Response:
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "transform",
            "_type": "simple",
            "_id": "1",
            "_score": 0.30685282,
            "_source": {
               "title": "This is a document with text",
               "description": "null"
            }
         }
      ]
   }
}

Change mapping to contain transform

Next we are going to use the transform functionality to remove the field if it contains the string null. To do that we need to remove the index and create a mapping containing the transform functionality. We use the groovy language for the script. Beware that the script is only validated when the first document is inserted.

PUT /transform
{
  "mappings": {
    "simple": {
      "transform": {
        "lang":"groovy",
        "script":"if (ctx._source['description']?.equals('null')) ctx._source['description'] = null"
      },
      "properties": {
        "title": {
          "type": "string"
        },
        "description": {
          "type": "string"
        }
      }
    }
  }
}

When we insert the same document as before and execute the same query we do not get hits. The description field is no longer indexed. An important aspect is that the actual _source is not changed. When requesting the _source of the document you still get back the original document.

GET transform/simple/1/_source
Response:
{
   "title": "This is a document with text",
   "description": "null"
}

Add a field to the mapping

To add a bit more complexity, we add a field called nullField which will contain the name of the field that was null. Not very useful but it suits to show the possibilities.

PUT /transform
{
  "mappings": {
    "simple": {
      "transform": {
        "lang":"groovy",
        "script":"if (ctx._source['description']?.equals('null')) {ctx._source['description'] = null;ctx._source['nullField'] = 'description';}"
      },
      "properties": {
        "title": {
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "nullField": {
          "type": "string"
        }
      }
    }
  }
}

Notice that we script has changed, not only do we remove the description field, now we also add a new field called nullField. Check that the _source is still not changed. Now we do a search and only return the fields description and nullField. Before scrolling to the response think about the response that you would expect.

GET /transform/_search
{
  "query": {
    "match_all": {}
  },
  "fields": ["nullField","description"]
}

Did you really think about it? Try it out and notice that the nullField is not returned. That is because we did not store it in the index and it is not obtained from the source. So if we really need this value, we can store the nullField in the index and we are fine.

PUT /transform
{
  "mappings": {
    "simple": {
      "transform": {
        "lang":"groovy",
        "script":"if (ctx._source['description']?.equals('null')) {ctx._source['description'] = null;ctx._source['nullField'] = 'description';}"
      },
      "properties": {
        "title": {
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "nullField": {
          "type": "string",
          "store": "yes"
        }
      }
    }
  }
}

Than with the match all query for two fields we get the following response.

GET /transform/_search
{
  "query": {
    "match_all": {}
  },
  "fields": ["nullField","description"]
}
Response:
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "transform",
            "_type": "simple",
            "_id": "1",
            "_score": 1,
            "fields": {
               "description": [
                  "null"
               ],
               "nullField": [
                  "description"
               ]
            }
         }
      ]
   }
}

Yes, now we do have the new field. That is it, but wait there is more you need to know. There is a way to check what is actually passed to the index for a certain document.

GET transform/simple/1?pretty&_source_transform
Result:
{
   "_index": "transform",
   "_type": "simple",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "description": null,
      "nullField": "description",
      "title": "This is a document with text"
   }
}

Notice the null description and the nullField in the _source.

Final remark

You cannot update the transform part, think about what would happen to your index when some documents did pass the transform version 1 and others version 2.

I would be gentile with this feature, try to solve it before sending it to elasticsearch, but maybe you just have the usecase for this feature, now you know it exists.

In my next blogpost I dive a little bit deeper into the scripting module.

Transform the input before indexing in elasticsearch
Tagged on: