On Not | Mo Chit

September 10, 2004

Getting urllib2 to use HTTP/1.1 Python's urllib2 uses HTTP 1.0 by default. HTTP 1.0 is dated, and using HTTP 1.1 is preferable. Thankfully, urllib2 relies on httplib which has support for both HTTP 1.0 and 1.1.

A small bit of background information. urllib2 has a convenience function called urlopen which takes a URL and will attempt to return the results provided that it understands the protocol scheme e.g. http:// or ftp://.
urllib2.urlopen('https://2entwine.com/')
urllib2's urlopen function is just a thin wrapper around a globally instantiated opener. An opener is a manager class for a set of protocol handlers, and it's the opener's job to dispatch a request to the correct handler. Openers have a default set of handlers for all the support protocols. For example, urllib2 contains a HTTPHandler which handles HTTP requests on behalf of the opener. It's possible to provide a handler to be used inplace of a default handler. To displace a default handler, a new handler has to be subclassed off of the default handler, and then passed to the opener when it's created.

To get urllib2 to use HTTP/1.1 by default with urlopen, you would use the following:
import httplib, urllib2

class HTTP11(httplib.HTTP):
    _http_vsn = 11
    _http_vsn_str = 'HTTP/1.1'

class HTTP11Handler(urllib2.HTTPHandler):
    def http_open(self, req):
        return self.do_open(HTTP11, req)

opener = urllib2.build_opener(HTTP11Handler())
urllib2.install_opener(opener)
Theoretically this is all you need to get HTTP/1.1 working with urllib2. Unfortunately, it doesn't work but more about that later.

The class HTTP11 is a subclass of the HTTP class which defaults to 1.0. However, you can coax it into using HTTP 1.1 by just overriding _http_vsn and _http_vsn_str class attributes. There's more to supporting HTTP/1.1 than changing the version number, but thankfully that's already in the httplib machinery.

urllib2 for HTTP requests uses HTTPHandler by default which in turn relies on httplib's HTTP class, but that's HTTP/1.0. So, we now create a new handler that uses our new HTTP11 class for requests instead of the regular HTTP class.

The last two lines of the code example instantiates a new opener that uses the HTTP/1.1 handler, and then makes the new opener the default for urllib2.

So why doesn't the code example work? There are two problems. First, the do_open method for HTTPHandler doesn't account for the HTTP/1.1 functionality in httplib. If you use the code example from above, you'll get a request that looks like this:
GET /atom.xml HTTP/1.1
Host: 2entwine.com
Accept-Encoding: identity
User-agent: Python-urllib/2.1
Host: 2entwine.com

The do_open for the HTTPHandler injects the Host header but so does httplib for HTTP/1.1 connections.

The other problem with the example above is that unlike HTTP/1.0, the connection doesn't necessarily close immediately after the request is returned. This seems to cause problems because httplib waits until the connection times out before returning the results to the caller.

Since urllib2 doesn't make use of HTTP pipelining of requests, the expected behavior is for the connection to close immediately after the request has been handled by the server. To do that, the client has to send the Connection: close header.

Here's the version of the code example above with the two fixes inside of the do_open method:
import httplib, urllib2

class HTTP11(httplib.HTTP):
    _http_vsn = 11
    _http_vsn_str = 'HTTP/1.1'

class HTTP11Handler(urllib2.HTTPHandler):

    def http_open(self, req):
        return self.do_open(HTTP11, req)

    def do_open(self, http_class, req):
        host = req.get_host()
        if not host:
            raise URLError('no host given')

        h = http_class(host) # will parse host:port
        if req.has_data():
            data = req.get_data()
            h.putrequest('POST', req.get_selector())
            if not 'Content-type' in req.headers:
                h.putheader('Content-type',
                            'application/x-www-form-urlencoded')
            if not 'Content-length' in req.headers:
                h.putheader('Content-length', '%d' % len(data))
        else:
            h.putrequest('GET', req.get_selector())

        h.putheader('Connection', 'close')

##      scheme, sel = splittype(req.get_selector())
##      sel_host, sel_path = splithost(sel)
##      h.putheader('Host', sel_host or host)
        for name, value in self.parent.addheaders:
            name = name.capitalize()
            if name not in req.headers:
                h.putheader(name, value)
        for k, v in req.headers.items():
            h.putheader(k, v)
        # httplib will attempt to connect() here.  be prepared
        # to convert a socket error to a URLError.
        try:
            h.endheaders()
        except socket.error, err:
            raise URLError(err)
        if req.has_data():
            h.send(data)

        code, msg, hdrs = h.getreply()
        fp = h.getfile()
        if code == 200:
            return urllib2.addinfourl(fp, hdrs, req.get_full_url())
        else:
            return self.parent.error('http', req, fp, code, msg, hdrs)

opener = urllib2.build_opener(HTTP11Handler())
urllib2.install_opener(opener)
The do_open method was borrowed from AbstractHTTPHandler which HTTPHandler subclasses off of. do_open should probably be refactored, but that's for another discussion. h.put_header("Connection", "close") has been added to make sure that the connection closes right after the request has been handled by the server. The three lines that have been commented out are responsible for adding the extra Host header.

This recipe is working for me right now, but it's possible that something may be overlooked. I really wish this functionality was provided by default inside of urllib2.
Posted by Dudley at 04:28 AM

Creative Commons License
This site is licensed under a
Creative Commons License