Tangled in the ThreadsJon Udell, February 7, 2001
Website API discovery
From stunnel to ProxomitronIn the first of a two-parter, Jon reviews why and how to discover and use website APIs
For a couple of projects recently, I've ended up reverse-engineering website APIs and then writing scripts to control sites using those APIs. This is a black art that, we can only hope, will soon fade away as the web's new architecture of communicating services takes hold. But Internet time ain't what it used to be, so I expect I'll be doing this kind of thing for a while yet. Here are some examples of what I mean:
Automating common tasks in a web-based issue-tracker. A couple of weeks ago I mentioned RequestTracker. It's really handy, but the novelty of pointing and clicking wears thin when you're trying to process dozens or hundreds of similar items. So, I've written a script to power through these chores.
Verifying website security. One of my projects is a site that reacts to certain kinds of spider activity. The best way to test these defenses is to probe with a spider that impersonates an authenticated user.
Reformulating web statistics. For another project, I'm reformulating web statistics. This, by the way, is a perfect example of the kind of problem that SOAP-style interfaces will solve. DON'T lock users into a specific HTML presentation. DO offer interfaces, use them yourself to create a default presentation, but allow others to use them directly to create alternate presentations. That's the vision, anyway. In reality, I'll bet the web stats reprocessor I wrote yesterday won't be my last.
In cases like these, the name of the game is to first discover, and then use, the website's API. The fact that websites have APIs, even when they don't intend to, is one of the most remarkable aspects of the first-generation web. I've shown elsewhere elsewhere how it's possible to build novel web services using existing sites (AltaVista, Yahoo) as components. So what comprises a website's implicit API? Basically, just these things:
HTTP headers. The Cookie and Authentication headers are two of the ones you often need to manipulate, typically in response to Set-Cookie and WWW-Authenticate headers received from a server.
GET requests. Many sites, notably search engines, use GET exclusively. This is handy because the API is always visible on the browser's "command line" (aka URL input box), and queries that you form (e.g. http://alltheweb.com/cgi-bin/search?offset=10&type=all&query=udell+mindshare) are, in effect, little special-purpose programs that you can save, reuse, and distribute.
POST requests. When web UI grows more complex, GETs become unwieldy. So shopping carts and other forms-intensive sites tend to use POST requests, which send the form's name/value pairs as a bundle of data separate from the URL itself.
Since HTTP and HTML are both just ASCII text, it is in principle quite easy to discover and use website APIs made from this stuff. Although Perl, Python, and other scripting languages offer sophisticated HTTP modules, under the covers it's just text flowing through sockets. If you're wondering what headers a site sends, you can always do this:telnet www.byte.com 80 GET /
and the raw stuff will come spewing out.
What about secure sites?
In fact, you can't always just telnet to a site, or do the equivalent using script-language HTTP modules. When the site in question is secure, things get a lot more complicated. Or rather, they used to. Now that the RSA patent has expired, they're getting simple again, thankfully. You can, for example, use a marvelous open-source tool called stunnel to turn the OpenSSL (formerly SSLeay) libraries into a general-purpose encryptor/decryptor that will let you telnet (or Perl, or Python) a secure site:stunnel -d localhost:443 -c -r some.secure.site:443 telnet localhost 443 GET /
The "-d" tells stunnel to listen on the local port 443 (which could be any port, including a high-numbered one if you're on a Unix box without root and can't access the lower-numbered ports). The "-c" says "be an SSL client" with respect to the remote host specified by "-r".
This is a terrific enabler. Unlike ssh tunneling, it doesn't depend on a cooperating sshd on the far end. But wait! There's more! Watch this:stunnel -d localhost:443 localhost:80
Pretend that the service running at localhost:80 is plaintext, but you want to secure it. Maybe it's a homegrown tool, maybe it's Zope, whatever. From the outside looking in, port 443 is now a secure HTTPS service, though under the covers it only relays requests, decrypted, to an ordinary and unmodified HTTP service. The stunnel distribution comes with a default server certificate which, of course, isn't signed by VeriSign, so has to be accepted by the user. And I wouldn't recommend this approach for heavily-trafficked sites. But these limitations are quite acceptable in a great many situations where you need to deploy some service for a small but distributed team, and do it securely.
Snooping on website APIs
Let's return to the question of website reverse engineering. In principle, as I've said, it's easy because the web is pretty much an open book. But in practice, it's tedious to work out the sequences of requests and responses that define a website's API. GET requests that involve no header manipulation are a no-brainer, but POST requests that send complex form data, along with an HTTP authentication header (name/password), and maybe also a cookie header (with session state information), take a bit more doing.
You can of course issue requests from an HTTP-aware script language, for example Perl with its LWP module, and use the splendid facilities of Perl to analyze and programmatically respond to pages that you fetch from a site. Doing this on a secure site is straightforward too, thanks to Perl's Crypt::SSLeay module, which enables LWP to work with encrypted https-style pages as well as normal http pages. (Alternatively, if you can't or don't want to add SSL capability to your installation of Perl, you can put stunnel between Perl and the encrypted website.) But even for Perl hackers, it's a bit tedious to use Perl to both explore, and automate, website APIs. And suppose you'd like an ordinary civilian to do it. Why? One of the reasons you play this game is to develop software that drives a site through a complex series of interactions, in order to stress-test the site, or in order to develop a baseline profile that can then be used for regression testing -- that is, to ensure that the site continues to behave in the expected ways over time. It would be great if you didn't need a Perl hacker to gather this information, but could instead let an ordinary civilian with a browser -- and possibly more knowledge of the application domain -- do that instead.
Here's one example. To automate access to a password-protected site, I needed to capture and then retransmit a cookie. My first try at this, using LWP, failed because the page that sent the cookie did an immediate redirect to another page, which was the one that LWP returned to my script. Once I realized this was happening, it was straightforward to decompose that operation -- which was atomic from LWP's point of view -- into two parts. First, fetch the cookie and the URL of the next page. Then, fetch the next page, sending the cookie along with the request for that URL. But the problem was that I didn't realize this was even required -- until I watched the back-and-forth traffic in Proxomitron's log window.
Here's another testimonial from Peter Hess:
I picked all this stuff up a year or two ago when trying to get MS COM Internet Services (precursor to SOAP) to work through authenticating proxies. CIS worked by setting up a tunnel to port 593 on the remote server through the proxy, then shooting RPC over the tunnel. It depended on the proxy accepting CONNECT requests to arbitrary ports and simply passing them through untouched (and unSSLized, BTW). But the chief failing of CIS was that, while it knew that '200 - Connection Established' indicated success, it didn't know that '407 - Proxy Authentication Required' wasn't an error per se, but rather a challenge in the basic authentication scheme documented in the HTTP spec.
All that to say that, in the process of tracking down the problem, I discovered Proxomitron (courtesy of these newsgroups) and the HTTP log window. By telling Proxomitron to use an upstream proxy, we could watch the HTTP traffic without having to search it out of a network trace.
In short, you can use Proxomitron to do much more than "say goodbye to slow-loading cyberspam and other web-gimmickry," or confound browser detection scripts with the enigmatic signature:SpaceBison/0.01 [fu] (Win67; X; ShonenKnife)
instead of the expected MSIE or Netscape signature. Until web services are made available in a more rational way, with formal XML APIs (a la SOAP), integrators who choose to regard existing websites as programmable services will find tools like Proxomitron an indispensable aid.
Can Proxomitron do SSL proxying? The current version doesn't, though you can achieve a partial solution with the ever-useful stunnel. That led me to explore some alternatives, including a Perl-based proxy. Secure proxying turns out to be a fascinating subject, and while discussing it in the newsgroup, Peter Hess mentioned that a beta version of Proxomitron can, in fact, open a window onto the request/response flow between a browser and a secure site. I'll fill in the details next week but, to cut to the chase, the SSL-aware Proxomitron works beautifully, and this is great news for anybody who needs to reverse-engineer a secure site. Meanwhile, there are still a few unanswered questions, notably how to implement a Perl-based proxy that works the same way. If you know the right tool or technique, do drop by my newgroup and enlighten us!
Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He's now an independent Web/Internet consultant, and is the author of Practical Internet Groupware, from O'Reilly and Associates. His recent BYTE.com columns are archived at http://www.byte.com/index/threads
This work is licensed under a Creative Commons License.