Friday, April 16, 2010

Production versus local development w/ App Engine - pychecker to the rescue!

This has been a furious week of coding, staying up til 3am, and banging my head on the keyboard. I wish I could say that in the end I feel like I was extremely productive, but there is the voice of rationality in the back of my mind saying I could have done much better.

I've been working on the Browserscope code base to add a feature to make it possible for users to run browser tests on their own servers and then make use of the Browserscope to store their results, download their results, and display their results with ease. We've worked really hard to build a system that should scale and maintain runtime medians (thanks to slamm). It seems like every week I see two or three new browser tests online and yet I don't see people collecting and aggregating the results in meaningful ways. It's hard! Hopefully we'll have this live for people to bang on in the next couple of weeks.

Meanwhile, I want to braindump a few lessons learned the hard way. After four days of intense development using dev_appserver.py I decided to deploy my code to a test app_id and see how it ran in production. I was rather miffed to see that my code was not working, and/or intermittently failing. That is pain.

My debugging experience led me to cut and paste Tracebacks (when I got them!) into Google to look for errors with django's urlresolvers.py file, and a seeming inability to load my custom_filters. But that didn't make much sense to me, and I could find nothing to fix in that regard. So then I started seeing errors about importing modules. And I couldn't reproduce them locally. Pain.

My officemate slamm suggested that I try using the built-in django with App Engine instead of appengine_django. This led me to start rewriting our main.py file. Fortunately, Guido's rietveld app is online on Google Code and contains a, in his own words, battle-worn main.py file which loads Django 1.1. After a fair amount of refactoring and generally feeling like I was making progress (I stopped seeing the errors about my custom_filters not loading after moving the add_to_builtins call into my main.py file), alas, I still kept getting errors loading some module that I could not explain.

So then I found a post in the Google Group mentioning that it can be troublesome to name your modules the same as any of the python system modules. Sure enough, I'd just created a new directory module named "users" (a pretty common directory one might have in a webapp). I got into a python console, and bingo - import users - loaded! Wow, so I then tried to import every directory and filename in my application. What a headache, but in the end, the only one I had named the same as a system module was users. Ok, so refactor and search/replace, not too big a deal. But it didn't fix my problem. Incidentally, Guido explained to me that the reason it is bad to name your modules the same as python system modules is that the sys.path in production puts your application directory at the END of the sys.path, while on dev_appserver it is on the front of the path. So that at least would explain why code that works in dev_appserver would not work in production if indeed you make this mistake. Apparently the fix is non-trivial, but at least it's in the "known-issues" list.

So I decided I had to have a circular import problem. I could come up with nothing else. And finding a circular import is pretty hard, following every import down a tree is very mind-challenging at 3 in the morning (as your app spews 500s). Searching for tools to help uncover recursive imports, I came across a mention of a tool called pychecker online (although it sounds like it has generally been deprecated by most in favor of pylint). Anyhow, I added a bunch of directories to my PYTHONPATH environment variable and ran "pychecker main.py myhandler.py" and bingo! "Error importing module foo". I could reproduce a problem! And it only took about 5 seconds for pychecker to run.. which is significant since my prior debugging path included making guesslike changes to my code, and redeploying (which takes more like 25 seconds) and praying and then banging my head. I found my recursive import, fixed the issue, and wow - no more 500s (well, except for the dynamic request limit ones).

Anyway, I just wanted to post this in case anyone starts searching and finds this post useful. Happy coding!

No comments: