ログの一部を抽出して計算するには sampling-filter + numeric-monitor の組み合わせがぴったり!

大規模データ収集のログコレクタとして fluentd を使用している場合、全部のログの中から一部を抽出してデータ分析を行いたい場合がないだろうか。例えば、サービスのレスポンスタイムが遅くなったり(早くなったり)していないかを測定することができれば、異常検知に役立つ。つまり、ログ(母集団)をデータベースまで運ぶ間にちょっとつまみ喰い(標本)するといった具合だ。平均や特定のパーセンタイル値を出しておけばパフォーマンス監視としては十分だと思う。こういった要求を満たすのに便利な fluentd プラグインの組み合わせが fluent-plugin-sampling-filter と fluent-plugin-numeric-monitor だ。

nginx のアクセスログの一部を抽出してパフォーマンス測定を行う場合

ここでは具体例として nginx のアクセスログを考えてみる。例えば以下のアクセスログから vhost ごとの reqtime をサンプリングしてパフォーマンスの指標として vhost ごとの平均と 99 パーセンタイルを出してみる。

access.log

time:15/Mar/2018:01:00:00 +0900  host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.001   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:00 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.001   vhost:www.ayakumo.net
time:15/Mar/2018:01:00:00 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.002   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:00 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.002   vhost:www.ayakumo.net
time:15/Mar/2018:01:00:00 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.003   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:02 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.004   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:03 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.005   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:05 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.006   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:06 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.007   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:08 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.008   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:08 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.009   vhost:blog.ayakumo.net
time:15/Mar/2018:01:00:08 +0900 host:10.0.0.123 forwardedfor:12.123.123.123 req:GET /index.html HTTP/1.1    status:200  size:123    referer:-   ua:-    reqtime:0.010   vhost:blog.ayakumo.net

fluent プラグインの組み合わせによる実装

以下 4 つのプラグインを組み合わせることによりパフォーマンス測定を実装する。rewrite-tag-filter は昔の rewriteruleN 記法ではなく新しい <rule> ... </rule> 記法を使用することでマッチした部分をタグとして利用できるのが素晴らしい。

fluent-plugin-rewrite-tag-filter を使用して vhost ごとにタグを切り分ける。ここでは blog と www を切り分けている。
fluent-plugin-sampling-filter を使用して 1/100 のサンプリングを行う。ここでは 100行のうち 1 行を抽出している
fluent-plugin-numeric-monitor を使用してパーセンタイルの計算を行う。ここでは50と99を算出している。
fluent-plugin-prometheus を使用して prometheus サーバから pull できるように加工する。

sample_td_agent.conf

<source>
  @type forward
  port 88888
</source>

<source>
  @type prometheus
</source>

<match nginx.access>
  @type rewrite_tag_filter
  <rule>
    key vhost
    pattern ^(.*)\.ayakumo\.net$
    tag nginx.access.$1
  </rule>
</match>

<match nginx.access.*>
  @type sampling_filter
  interval 100 # 1/100 sampling
  sample_unit all
  add_prefix sampled
</match>

<match sampled.nginx.access.*>
  @type forest
  subtype numeric_monitor
  <template>
    tag reqtime.${tag_parts[3]}
    unit minute
    aggregate all
    monitor_key reqtime
    percentiles 50,99
  </template>
</match>

<match reqtime.*>
  @type forest
  subtype copy
  <template>
    <store>
      @type prometheus
      <metric>
        name reqtime_percentile_50_${tag_parts[1]}
        type gauge
        desc The 50 percentile of reqtime per minute.
        key  percentile_50
      </metric>
      <metric>
        name reqtime_percentile_99_${tag_parts[1]}
        type gauge
        desc The 99 percentile of reqtime per minute.
        key  percentile_99
      </metric>
    </store>
    <store>
      @type stdout
    </store>
  </template>
</match>

<match **>
  @type null
</match>

nginx のアクセスログを擬似的に発行するために fluent-post を以下のように使用する。reqtime のところは $RANDOM を使う代わりに shuf -i 1-100 -n 1 でもよいかもしれない。

fluent-post.sh

for sld in www blog; do
  for i in `seq 1 100`; do
    r=$(($RANDOM % 100))
    /opt/td-agent/embedded/bin/fluent-post -p 88888 -t nginx.access -v time=time:15/Mar/2018:01:00:01 -v host=10.0.0.123 -v forwardedfor=12.123.123.123 -v req="GET /    index.html HTTP/1.1" -v status=200 -v referer="-" -v ua="-" -v reqtime=0.${r} -v vhost=${sld}.ayakumo.net
  done
done

こうすると @type stdout で td-agent.log に出力している numeric-monitor の基本機能により avg や num は自動的に計算されていることがわかる。www と blog の 2 つの sld(Second Level Domain) に対して 100 回(seq 1 100) 投げているが、サンプリング粒度を 1/100 にしているので(interval 100) 標本のサイズはそれぞれ 10 個づつ("num": 10) となる。

td-agent.log

2018-03-15 01:02:32 +0900 [info]: out_forest plants new output: numeric_monitor for tag 'sampled.nginx.access.www'
2018-03-15 01:02:32 +0900 reqtime.blog: {"num":10,"min":0.1,"max":0.98,"avg":0.587,"sum":5.87,"percentile_50":0.6,"percentile_99":0.94}
2018-03-15 01:02:32 +0900 [info]: out_forest plants new output: copy for tag 'reqtime.www'
2018-03-15 01:02:32 +0900 reqtime.www: {"num":10,"min":0.14,"max":0.79,"avg":0.433,"sum":4.33,"percentile_50":0.36,"percentile_99":0.7}

prometheus のメトリクス名としているパーセンタイル 50, 99 も取得できる。

$ curl -s http://localhost:24231/metrics/ | grep -v '^#'
reqtime_percentile_50_blog 0.6
reqtime_percentile_99_blog 0.94
reqtime_percentile_50_www 0.36
reqtime_percentile_99_www 0.7

fluentd のプラグインは便利なものがたくさんそろっているので、組み合わせ次第でいろいろなことができて本当に助かっている。(^_^)