Efficent api-call logging from nginx

Mar 21, 2014

This is about how we build up an infrastructure for logging API-calls and prepare the logs to be analyzed.

data collection

The starting point are a minimum of 100 million api calls to be logged per day. These calls are HTTP GET- and POST-Requests originating from a nginx server which acts as a load-balancer and SSL-offloader. The log has far more information than a plain commonlog (information about used backend-servers and performance for example).

Some piece of information has to be extracted by nginx_lua first - for it’s not available as standard nginx-variables (e.g. first x bytes of POST-bodies and some headers to be tracked). The log is then send to local syslog-ng via nginx-syslog which distributes the log to logstash (running behind a HA-Proxy).

data transformation

We need data transformation in this case for

  • “Sessionize” data
  • to enrich log information with coordinates via geo-ip
  • user-agent extraction
  • ensure data privacy with hashing ip-adresses
  • take all performance information to same unit (milliseconds)
  • aggregate some information into new fields (e.g. map different caching states to simple boolean)
  • make real maps/lists out of “CSV-encoded” variables

All data transformation is done via logstash.

data analysis

The transformed data is fed into two outputs: elasticsearch and HDFS/Hive

For real-time-analysis via kibana and custom queries (show availibilty of service to customers) elasticsearch is used. We used the out-of-the-box output already present in logstash to write data into one seperate index per day.

For long-term-analysis and -storing we decided to go for the json-SerDe and snappy-compression. The problem at this point was that there was no usable hdfs output for logstash (append support, compression, no heavy dependencies), so we did our own logstash-webhdfs module (available via github).

Result of reports on Hadoop/Hive are fed back into elasticsearch via EsStorageHandler, so there are available to customers without the need to access the hadoop-cluster.

the past: a wrong track

Our first setup was to send the log data from syslog-ng to Hadoop/HDFS/Hive via HDFS-sink of flume-ng (long-term) and to logstash/elasticsearch (real-time).

The incoming log was simply appended and a table (partitioned by date and customer) with regex-SerDe was created in the Hive metastore. Transformation into a RCFILE/snappy-compressed was done by Hive-QL ‘insert into table … select ‘ with a cron-based job. This new table was the starting point for long-term-analysis.

Cons for this solution

  • non-centralized transformation led to misunderstandings in data (from different field-names to different units)
  • logs were sent to two destinations and processed in these two destinations
  • Hive-QL/flume with morphline is quite limited in transformation compared to logstash

the future: scaling

Minimal to no processing on log-data is done on the nginx-servers. When neccessary the log protocol can be changed in future from syslog to something like zeromq.

The logstash machines are clustered by HA-Proxy, so scaling out is a no-brainer. Elasticsearch clustering comes out-of-the-box, so scaling is about adding memory, more memory and more machines.