Efficent api-call logging from nginxMar 21, 2014
This is about how we build up an infrastructure for logging API-calls and prepare the logs to be analyzed.
The starting point are a minimum of 100 million api calls to be logged per day. These calls are HTTP GET- and POST-Requests originating from a nginx server which acts as a load-balancer and SSL-offloader. The log has far more information than a plain commonlog (information about used backend-servers and performance for example).
Some piece of information has to be extracted by nginx_lua first - for it’s not available as standard nginx-variables (e.g. first x bytes of POST-bodies and some headers to be tracked). The log is then send to local syslog-ng via nginx-syslog which distributes the log to logstash (running behind a HA-Proxy).
We need data transformation in this case for
- “Sessionize” data
- to enrich log information with coordinates via geo-ip
- user-agent extraction
- ensure data privacy with hashing ip-adresses
- take all performance information to same unit (milliseconds)
- aggregate some information into new fields (e.g. map different caching states to simple boolean)
- make real maps/lists out of “CSV-encoded” variables
All data transformation is done via logstash.
For real-time-analysis via kibana and custom queries (show availibilty of service to customers) elasticsearch is used. We used the out-of-the-box output already present in logstash to write data into one seperate index per day.
For long-term-analysis and -storing we decided to go for the json-SerDe and snappy-compression. The problem at this point was that there was no usable hdfs output for logstash (append support, compression, no heavy dependencies), so we did our own logstash-webhdfs module (available via github).
Result of reports on Hadoop/Hive are fed back into elasticsearch via EsStorageHandler, so there are available to customers without the need to access the hadoop-cluster.
the past: a wrong track
Our first setup was to send the log data from syslog-ng to Hadoop/HDFS/Hive via HDFS-sink of flume-ng (long-term) and to logstash/elasticsearch (real-time).
The incoming log was simply appended and a table (partitioned by date and customer) with regex-SerDe was created in the Hive metastore. Transformation into a RCFILE/snappy-compressed was done by Hive-QL ‘insert into table … select ‘ with a cron-based job. This new table was the starting point for long-term-analysis.
Cons for this solution
- non-centralized transformation led to misunderstandings in data (from different field-names to different units)
- logs were sent to two destinations and processed in these two destinations
- Hive-QL/flume with morphline is quite limited in transformation compared to logstash
the future: scaling
Minimal to no processing on log-data is done on the nginx-servers. When neccessary the log protocol can be changed in future from syslog to something like zeromq.
The logstash machines are clustered by HA-Proxy, so scaling out is a no-brainer. Elasticsearch clustering comes out-of-the-box, so scaling is about adding memory, more memory and more machines.