Sunday, October 31, 2010

Catching the DeadlineExceeded error on App Engine

One of the neat features of App Engine's db interface is the way it allows you to request multiple items by key from memcache or the datastore, in an effort to improve performance of RPCs. This can have an unintended consequence though if you don't think about failures during iteration. In Browserscope's code base we were previously doing the following:

  stats = memcache.get_multi(browsers, **memcache_params)
  for browser in browsers:
    if browser not in stats: 
      # If we didn't find this key in memcache do some datastore call
      medians, num_scores = test_set.GetMediansAndNumScores(browser)
  # Update memcache with any new key/value pairs in the data structure
  memcache.set_multi(stats, **memcache_params)

The problem is that we were encountering DeadlineExceeded errors walking through the list of browsers when getting large result sets back from datastore calls to GetMediansAndNumScores(). Worse than a DeadlineExceeded error is doing nothing to prevent another one. Conveniently you can catch this in a try/except block on App Engine and then do something smart - store the state of things in memcache so that the next call doesn't start from scratch on a mission it will never complete. The updated code looks like this:

    from google.appengine.runtime import DeadlineExceededError
    dirty = False
    stats = memcache.get_multi(browsers, **memcache_params)
    try:
      for browser in browsers:
        if browser not in stats:
          dirty = True
          medians, num_scores = test_set.GetMediansAndNumScores(browser)
    except DeadlineExceededError:
      # Try to get what we've got so far at least in memcache.
      memcache.set_multi(stats, **memcache_params)
      logging.info('Whew, made it.')

    if dirty:
      memcache.set_multi(stats, **memcache_params)

This is just a cool pattern of defensive programming, and thankfully it works because the DeadlineExceededError allows for a short time secondary deadline where you get a little time to do a little work (like write something to memcache, as opposed to the datastore).

The end result of this is that test result tables in Browserscope should be delivering more consistently and more quickly - and if we respond with a 500 once, we at least might not have to on the next request. One thing I'm wondering about is adding a redirect to this process to try to run the process again - it will eventually succeed.