The Heroku's Ugly Secret blog post went viral last week. I wrote in defense of Heroku, which has now responded with an official Routing Performance Update.

Random request queuing has been discussed in the past in Tim Watson's post based on a response by Heroku's Adam Wiggins. While the documentation may not have been accurate or even somewhat misleading, we, at Artsy, understood the strategy and the limitations of the routing mesh for quite sometime. Therefore, we have been making continuous efforts to improve our application's performance and reduce the negative impact of random routing inside the routing mesh over the past few months.

One thing we didn't do, was to measure the actual wait time inside a dyno. In restrospect, it seems obvious that we should have. In this post we'll describe a middleware to do so. This is entirely based on the work of David Yeu, Jason R Clark and RG's own Andrew Warner.

With this code in place, here's a 12 hour graph of our website's API performance. The dyno wait time for our application, in green, averaged 61.1ms for a total of 301ms average per request, which is 1/5th of the total request time. It's certainly a lot, but we do spend a lot more time in our own code.

Note that the single peak on the right of the graph corresponds to a dyno auto-scale job. We double the number of dynos with early morning traffic, which causes new dynos to boot up and accumulate requests before they are "warm" enough to process requests at their normal rate.

Queue Logger Middleware

Heroku adds an X-Request-Start header as documented here into every request it routes. We can then subtract the value of this header from Time.now once we're inside our code. We're also removing the the X-Heroku-Queue-Wait-Time header, as it's mostly zero with the current Heroku routing strategy and gets used as queue time by the NewRelic RPM. Finally, we're setting env['HTTP_X_QUEUE_TIME'], which will be picked up by NewRelic as documented here and adding a X-Queue-Time header to be able to see the queue time in every response with client tools.

config/queue_time_logger.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# https://gist.github.com/a-warner/f5db30857ed3423cea79
# combination of https://gist.github.com/daveyeu/4960893
# and https://gist.github.com/jasonrclark/d82a1ea7695daac0b9ee
class QueueTimeLogger
  attr_reader :app

  def initialize(app, options = {})
    @app = app
  end

  def call(env)
    now = Time.now.to_f

    env.delete("HTTP_X_HEROKU_QUEUE_WAIT_TIME")

    microseconds = (now * 1_000_000).to_i
    env["HTTP_X_MIDDLEWARE_START"] = "t=#{microseconds}"

    perf_headers = {}
    if (request_start = env["HTTP_X_REQUEST_START"])
      request_start_microseconds = request_start.gsub("t=", "").to_i * 1_000
      queue_time_microseconds = [ microseconds - request_start_microseconds, 0 ].max
      env["HTTP_X_QUEUE_TIME"] = "t=#{queue_time_microseconds}"

      queue_time_milliseconds = (queue_time_microseconds / 1_000).to_i
      perf_headers["X-Queue-Time"] = queue_time_milliseconds.to_s
    end

    status, headers, body = app.call(env)
    [ status, headers.merge(perf_headers), body ]
  end
end

We insert this middleware into Rails. Remember that the middleware is executed in reverse order, so you should put this in the end of your config/environment.rb.

config/environment.rb
1
2
require File.expand_path('../queue_time_logger', __FILE__)
config.middleware.use QueueTimeLogger

Time Skew

It's important to note that since the X-Request-Start header is inserted by the router, we're not capturing queue wait time, we're capturing (queue wait time) + (clock skew between the router and the machine servicing the request). The time skew has a non-negligible contribution to the sum, especially that the sign of the clock skew contribution is unknown and we are replacing any negative time difference with 0. We can only hope that Heroku does a reasonable effort at synchronizing clocks between the router and the dyno servers.

What About Dumb Routing?

One of the basic issues with one-request-at-a-time web servers and random routing is how single-threaded web servers accept connections. It sounds technically feasible that the web server could report back to the router that it's currently processing a request and have the router pick another dyno, but there're two non-trivial difficulties with implementing this.

The first is that it would require cooperation from the Heroku router, as currently, closing a TCP socket would cause it to return a 503 to the client.

The second is in the way EventMachine accepts requests in a single-threaded scenario: a request will block the EventMachine reactor, and only once it has unblocked the reactor, will it accept more requests. Those requests will sit in the TCP queue for the duration of the long request, defeating the whole concept.

Improving Throughput on Heroku

It's important to understand that with every system you will get increasingly unfair scheduling at the load balancer when you have more than your serviceable load. To improve this on Heroku you have to either reduce the time to service each request or provision more dynos. All things considered, I think that being able to service long-running requests without any significant impact on the entire distributed system would be a luxury.

Links

Categories: heroku


Comments