Scheduled bulk data loading to Elasticsearch + Kibana 5 from CSV files

Authors: @frsyuki @hiroyuki-sato
Created at: 2015-02-24

This article shows how to:

Bulk load CSV files to Elasticsearch.
Visualize the data with Kibana interactively.
Schedule the data loading every hour using cron.

This guide assumes you are using Ubuntu 16.10 Precise or macOS.

Setup Elasticsearch and Kibana 5

Step 1. Download and start Elasticsearch.

You can find releases from the Elasticsearch website. For the smallest setup, you can unzip the package and run ./bin/elasticsearch command:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.3.0.zip
unzip elasticsearch-5.3.0.zip
cd elasticsearch-5.3.0
./bin/elasticsearch

Step 2. Download and unzip Kibana:

You can find releases from the Kibana website. Open a new console and run following commands:

wget https://artifacts.elastic.co/downloads/kibana/kibana-5.3.0-linux-x86_64.tar.gz
tar zxvf kibana-5.3.0-linux-x86_64.tar.gz
cd kibana-5.3.0-linux-x86_64
./bin/kibana

Note: If you’re using macOS, https://artifacts.elastic.co/downloads/kibana/kibana-5.3.0-darwin-x86_64.tar.gz is the URL to download.

Now Elasticsearch and Kibana started. Open http://localhost:5601/ using your browser to see the Kibana’s graphical interface.

Setup Embulk

Step 1. Download Embulk binary:

You can find the latest Embulk binary from GitHub Releases. Because Embulk is a single executable binary, you can simply download it to ~/.embulk/bin directory and set executable flag as following:

curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Step 2. Install Elasticsearch plugin

You also need Elasticsearch plugin for Embulk. You can install the plugin with this command:

embulk gem install embulk-output-elasticsearch

Embulk includes CSV file reader in itself. Now everything is ready to use.

Loading a CSV file

Assuming you have a CSV files at ./mydata/csv/ directory. If you don’t have CSV files, you can create ones using embulk example ./mydata command.

Create this configuration file and save as seed.yml:

in:
  type: file
  path_prefix: ./mydata/csv/
out:
  type: elasticsearch
  index: embulk
  index_type: embulk
  nodes:
    - host: localhost

In fact, this configuration lacks some important information. However, embulk guesses the other information. So, next step is to order embulk to guess them:

embulk guess ./mydata/seed.yml -o config.yml

The generated config.yml file should include complete information as following:

in:
  type: file
  path_prefix: ./mydata/csv/
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: ''
    null_string: 'NULL'
    skip_header_lines: 1
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out:
  type: elasticsearch
  index: embulk
  index_type: embulk
  nodes:
  - {host: localhost}

Note: If the CSV file contains timestamp in local time zone, set default_timezone parameter to parser config as following (since time zone is assumed as UTC by default).

  parser:
    default_timezone: 'Asia/Tokyo'

Now, you can run the bulk loading:

embulk run config.yml -c diff.yml

Scheduling loading by cron

At the last step, you ran embulk command with -c diff.yml file. The diff.yml file should include a parameter named last_path:

in: {last_path: mydata/csv/sample_01.csv.gz}
out: {}

With this configuration, embulk loads the files newer than this file in alphabetical order.

For example, if you create ./mydata/csv/sample_02.csv.gz file, embulk skips sample_01.csv.gz file and loads sample_02.csv.gz only next time. And the next diff.yml file has last_path: mydata/csv/sample_02.csv.gz for the next next execution.

So, if you want to loads newly created files every day, you can setup this cron schedule:

0 * * * * embulk run /path/to/config.yml -c /path/to/diff.yml