Learning about ElasticSearch

ElasticSearch logo

This week I had a training at Trifork Amsterdam by Martijn and Uri from ElasticSearch. This training was a very nice in depth look at the capabilities of ElasticSearch. Like with all trainings and conferences I get motivated to try out the technology immediately. I am working on a plugin for ElasticSearch using AngularJS. More on this in a next blog at Trifork.

In this blog I am going to tell you the steps I took to get going with ElasticSearch. I show some of the steps for installing ElasticSearch, explain the first steps in using the groovy library. Next to that I’ll show you how to index this wordpress blog using ElasticSearch. When everything is indexed I will of course show some of the queries you can perform.

Let us get going.

Installing and configuring

I do not want to make this part to extensive. The documentation at the site of ElasticSearch is rigorous and therefore my extra lines would be to much. Some of the things I want to accomplish in the configuration are:

Start elastic search with the logs in your command prompt: -f
Specify the location of the configuration file: -Des.config=/path/to/config/file

Change the name of the cluster and the node: cluster.name: jc-elasticsearch and node.name: “Node-gridshore”
Disable the multiple shards and replicas for local development: index.number_of_shards: 1 and index.number_of_replicas: 0
Path to the location of the configuration files: path.conf: /path/to/conf
Other paths, like work folder, logs folder, data folder and plugin folder.

Install the head plugin to have a look at whay is happening to your cluster. I have chosen to install the plugin in the normal directory and than copy it to my configured plugins folder.

Don’t forget to install and use java 7, al that java 6 stuff is really old and slow compared to 7. ElasticSearch can do a better job when running on 7.

Those are the first commands to get everything running, so once more the steps:

Download elastic search: http://www.elasticsearch.org/download/. I used 0.20.2
Create a directory structure outside of the downloaded elastic search folder: config, data, logs, plugins, work.
Copy the elastic search.yml and logging.yml to the config folder and make the changes as mentioned.
Install the plugin and copy it to your plugins folder: bin/plugin -install Aconex/elasticsearch-head
Move to the main folder of you ElasticSearch download and execute the command (with your location of the config file)

bin/elasticsearch -f -Des.config=/.../elasticsearch/projects/gridshore/config/elasticsearch.yml

Now you can browse to the head plugin and see that not a lot has happened yet.

http://localhost:9200/_plugin/head/index.html

Before going to the index creation you can play around with some aspects of the REST api that ElasticSearch exposes: Try a few of these requests with curl to see the data that ElasticSearch exposes. Don’t use the pretty=true in production.

http://localhost:9200/_cluster/state?pretty=true
http://localhost:9200/_cluster/nodes?pretty=true

Reading the data using Groovy

Of course you can create the index and the required mapping using the curl based api. You can also just insert a document and the index will be created automatically and the default mapping will be created. In our case I first want to introduce you to the groovy client I use to read data from the blog using xmlrpc and the data model. Than we move on to the mapping.

In the sample code I use gradle to configure the project and my environment (Intellij).

dependencies {
	groovy 'org.codehaus.groovy:groovy:2.0.6'
    groovy 'org.codehaus.groovy:groovy-all:2.0.6'
    groovy ('org.codehaus.groovy:groovy-xmlrpc:0.8') {
        exclude module: 'groovy-all'
    }
    groovy 'commons-cli:commons-cli:1.2'
    groovy 'log4j:log4j:1.2.16'

    groovy ('org.elasticsearch:elasticsearch-lang-groovy:1.2.0') {
        exclude module: 'groovy-all'
    }
}

There is some strange thing going on with the groovy and groovy-all libraries. But I had some issues combining intellij and gradle. This seems to work for me. Notice that we need the groovy-xmlrpc library and of course the elasticsearch-lang-groovy library.

Using the following groovy class we can read all posts from the gridshore blog. Since the code is straightforward, I am not going in any details.

package nl.gridshore.wordpress

import groovy.net.xmlrpc.XMLRPCServerProxy

class WordPressReader {
    private String xmlrpcUrl
    private String username
    private String password

    private XMLRPCServerProxy serverProxy;

    def WordPressReader(xmlrpcUrl, username, password) {
        this.xmlrpcUrl = xmlrpcUrl
        this.username = username
        this.password = password

        serverProxy = new XMLRPCServerProxy(xmlrpcUrl)
        serverProxy.setBasicAuth(username, password)
    }

    def obtainMostRecentPosts(int number = 10) {
        def posts = []
        def foundPosts = serverProxy.metaWeblog.getRecentPosts(1, username, password, number)
        foundPosts.each {post ->
            def blogItem = new BlogItem()
            blogItem.id = post['postid']
            blogItem.link = post['permaLink']
            blogItem.status = post['post_status']
            blogItem.keywords = post['mt_keywords']
            blogItem.title = post['title']
            blogItem.createdOn = post['dateCreated']
            blogItem.content = post['description']
            blogItem.categories = post['categories']
            blogItem.author = post['wp_author_display_name']
            blogItem.slug = post['wp_slug']
            posts.add(blogItem)
        }
        return posts
    }
}

Time to put some stuff into ElasticSearch

Put some data into the index

The groovy library is very easy to use. The following code block shows the opening and closing of the connection. We make use of a client node that does not contain data. Take special notice of the line where we set a property for GXContentBuilder, this is required to enable the configuration using a closure. In the constructor we open the connection and in you have to close the connection using the close method.

package nl.gridshore.elasticsearch

import nl.gridshore.wordpress.BlogItem
import org.elasticsearch.action.index.IndexResponse
import org.elasticsearch.groovy.common.xcontent.GXContentBuilder
import org.elasticsearch.groovy.node.GNode
import org.elasticsearch.groovy.node.GNodeBuilder

import static org.elasticsearch.groovy.node.GNodeBuilder.nodeBuilder

class ElasticSearchGateway {
    GNode node

    ElasticSearchGateway() {
        GXContentBuilder.rootResolveStrategy = Closure.DELEGATE_FIRST; // required to use a closure as settings

        GNodeBuilder nodeBuilder = nodeBuilder();
        nodeBuilder.settings {
            node {
                client = true
            }

            cluster {
                name = "jc-elasticsearch"
            }
        }

        node = nodeBuilder.node()
    }

    public close() {
        node.stop().close()
    }
}

Now that we know how to obtain a connection, let us obtain some data and store the documents in the ElasticSearch index. The next method is also coming from the ElasticSearchGateway class that I have created. This is the most basic version that auto creates and index and the complete mapping.

    public indexBlogItem(BlogItem blogItem) {
        def future = node.client.index {
            index = "gridshore"
            type = "blog"
            source {
                blogId = blogItem.id
                link = blogItem.link
                status = blogItem.status
                keywords = blogItem.keywords
                title = blogItem.title
                createdOn_date = blogItem.createdOn
                content = blogItem.content
                categories = blogItem.categories
                author = blogItem.author
                slug = blogItem.slug
            }
        }

        future.success = {IndexResponse response ->
            println "Indexed $response.index/$response.type/$response.id"
        }
    }

With the following script we obtain 100 items from my blog and add them to the ElasticSearch index.

import nl.gridshore.elasticsearch.ElasticSearchGateway
import nl.gridshore.wordpress.BlogItem
import nl.gridshore.wordpress.WordpressReader

def rpcUrl = "https://www.gridshore.nl/xmlrpc.php"
def username = "?"
def password = "?"

def reader = new WordPressReader(rpcUrl,username,password)

def posts = reader.obtainMostRecentPosts(100)

ElasticSearchGateway gateway = new ElasticSearchGateway()

posts.each {BlogItem item ->
    println item.title
    gateway.indexBlogItem(item)
}

System.in.withReader {
    print 'input: '
    println it.readLine()
}

gateway.close()

Execute queries on the data

Now that I told you I have data in the index, let us create a query to check if there is actually something in the index. Of course we use groovy to query the index and to print the results.

    public queryIndex(theTerm) {
        def search = node.client.search {
            indices : "gridshore"
            types : "blog"
            source {
                query {
                    term(_all: theTerm)
                }
            }
        }

        search.response.hits.each {SearchHit hit ->
            println "Got hit $hit.id from $hit.index/$hit.type with title $hit.source.title"
        }
    }

By running the following script I get the shown results.

import nl.gridshore.elasticsearch.ElasticSearchGateway

ElasticSearchGateway gateway = new ElasticSearchGateway()

gateway.queryIndex("groovy")

System.in.withReader {
    print 'input: '
    println it.readLine()
}

gateway.close()

Got hit Huib3BPiTEua7Yvy6kc83w from gridshore/blog with title Doing more with groovy
Got hit lp79gd1UQ6mSdp3yp7O8Jg from gridshore/blog with title Analyzing beet results with groovy
Got hit CoRdYwD9ScyNnocqDvLGYg from gridshore/blog with title Cleaning up your maven repository with groovy
Got hit _U2q3miSRn6dTdUJB9PNYA from gridshore/blog with title Cleaning up artifactory with a groovy script
Got hit StrcQtnwTJSIvlu0pYojpA from gridshore/blog with title Exposing jmx through jmxmp and reading the jmx data with groovy
Got hit iQz0QFtYT3af9-Rls4d8Zw from gridshore/blog with title Use Grails and Axon to create a CQRS application (part II)
Got hit fajgAgrgST6AnNwCeyyocA from gridshore/blog with title Recap of the year 2010
Got hit V6h65a_rQ2mOvpE-POTF4Q from gridshore/blog with title Use Grails and Axon to create a CQRS application (part I)
Got hit xdi2KN9zQf2JD8qdbhBmdQ from gridshore/blog with title Using the NOS open data API with the springframework and jackson
Got hit DOiahRyMS0KUbg9e6BQHpQ from gridshore/blog with title Doing grails, yes I like it

The last function I want to show is counting documents. The following code block shows how to count all documents in the gridshore index of type blog. Thinking about the input for the amount of blogitems we imported, the result should be obvious.

    public countAllDocuments() {
        def count = node.client.count {
            indices : "gridshore"
            parameterTypes : "blog"
        }
        
        println "Number of found blog items : $count.response.count"
    }

Number of found blog items : 100

This is it, you can find the sources online @github. Check my GridshoreSamples project, in there you’ll find a small project called groovy-es-client.

https://github.com/jettro/GridshoreSamples

Learning about ElasticSearch

Tagged on: elasticsearch