Finishing Off that Bug in djangocms-picture (part trois)

In the prior episode of this story, I had written a test to detect the bug I want to fix. The djangocms-picture plugin includes errant whitespace inside of HTML anchor tags, which renders as annoying blue underlines next to the images.

I had also discovered why my test didn’t fail as it should have, because Django’s assertHTMLEqual() doesn’t detect all semantically significant whitespace differences. I’m sure it seemed a good idea at the time. But sorry, that’s not working for me.

Fortunately, it’s not that difficult to redesign assertHTMLEqual().

The workhorse of assertHTMLEqual() is django.test.html, a 200-something-line module based on Python’s HTMLParser that does two things: (1) validates that each HTML start tag has a corresponding, appropriately nested end tag and (2) assembles the HTML into an abstract syntax tree that can be compared and stringified to canonical form. If I’m willing to give up the start-end-tag validation (which I am), then all we really care about is putting the HTML in canonical form. If two different HTML snippets are put into canonical form, we can compared them even more quickly and easily than we can two data structures.

Why would I be willing to give up the validation? First of all, (a) I’m testing against expected HTML which I know to be valid. One doesn’t write test assertions to verify that one’s test data is correct. Rather, the code under test validates the test data. And if I’m truly concerned, I can use an external HTML validator, one time while writing the test. Secondly, (b) if the other HTML snippet is invalid, it won’t match the expected HTML. Therefore, the validation feature is superfluous.

The thing is, django.test.html doesn’t need an AST at all. It’s way overdesigned for this purpose. I guess I could imagine a use case in which we’d want to navigate the structure of the HTML in order to do tests. But for that, I suspect something like html5lib would be more useful. django.test.html is not designed to be extensible and is only used for comparing HTML.

I reduced the implementation down to ~40 lines of code that, for now at least, I’ve just put into my test module as private code.

The first thing I did is to stop calling it a “parser.” Yes, it’s based on HTMLParser, but that’s really an implementation detail. Its purpose is to normalize the HTML so that it can be compared to similarly normalized HTML.

class _NormalizedHTML(HTMLParser):

We derive from HTMLParser because it knows how to distinguish between tags and text and each is normalized differently. But we don’t need to “parse” them. All we need to do is to store the normalized strings. I put them in an array I call segments, which will ultimately be joined together to produce the full normalized HTML.

I also have the constructor accept the original HTML string, which it automatically feeds to the parser. That gets rid of an unnecessary step in initializing the class. (If needed, the html argument could be made optional. But that’s not needed at this time.)

    def __init__(self, html):
        HTMLParser.__init__(self)
        self.segments = []
        self.feed(html)
        self.close()

To normalize HTML start tags, I normalize the attributes by sorting them and also by normalizing the class attribute value.

    def handle_starttag(self, tag, attrs):
        normalized_attrs = [
            (name, " ".join(sorted(value.split(" "))))
            if name == "class"
            else (name, value)
            for name, value in sorted(attrs)
        ]

This code is taken almost verbatim from the original, except that I changed the name of the variable to be more self-documenting and sorted the attributes in the same statement.

(Now that I’m writing this post, I’m questioning whether this code is correct. It seems to me that the attributes should be split on arbitrary amounts of whitespace, and not on the space character. That is value.split() instead of what’s above. But we’ll stick with this for purposes of this discussion.)

The Python community has sometimes railed against Perl’s way of writing this same code. Apparently, Python is supposed to be more intuitive and more beautiful, and Perl is more cryptic and more ugly. Just for yuks, let’s compare how I would write this in Perl:

my @normalized_attrs = map {
    my ($name => $value) = @$_;
    $value = join(' ', sort split(' ', $value))
        if $name eq 'class';
    [ $name => $value ]
} sort _as_vectors @attrs;

There are a couple of significant differences here.

  • The Perl code actually has a block within the map statement, rather than Python’s transformed_value if condition else original_value. I could’ve done something similar to the Python code, maybe using Perl’s ?: operator, but I think that would’ve been less readable. I’m sorry— I’m confused: is the Python way still supposed to be the one best way to do this? I know that in Perl there’s more than one way to do it, but we still recognize that there are plenty of bad ways to do it, and we can choose not to do it those ways.

  • I also still find Perl’s join syntax more intuitive. If we were to make join a method, it should be a method on the array, not on the join string. This is what Perl 6 does. The equivalent split-sort-join expression in Perl 6 would look like this: $value.split(' ').sort().join(' '). Explain to me again the one best way to do this?

  • Since I’m talking about Perl 6 (which is actually a whole other language than Perl, and a much more powerful one at that), the whole statement would look like this in Perl 6 (or at least this is one of the better variations):

    my @normalized_attrs = @attrs.sort().map(
        -> ($name, $value) {
            if $name eq 'class' {
                ($name, $value.split(' ').sort().join(' '))
            } else {
                ($name, $value)
            }
        }
    );
    

Also (returning to the Perl 5 snippet above), the _as_vectors comparator needs to be written.

sub _as_vectors {
    for (0..$#{$a}) {
        my $r = $a->[$_] cmp $b->[$_]
        return $r if $r;
    }
    return @$a <=> @$b;
}

I actually do love Python’s orthogonality. I love that you can write sorted(foo), and it will All Just Work™. That’s the power of polymorphism built into Python’s DNA. You see something similar in lines like from my.package import my_module, which has no Perl equivalent.

Back to _NormalizedHTML. Once we have an array of normalized attributes, we stringify the entire tag, including attributes, and append it to the segments array.

        attr_str = " ".join([
            '%s="%s"' % (name, value)
            if value
            else "%s" % name
            for name, value in normalized_attrs
        ])
        tag_str = "<%s %s>" % (tag, attr_str)
        self.segments.append(tag_str)

You may notice that if the tag has no attributes, it will be stringified with an excess space, e.g., "<p >". While we don’t usually write HTML this way, as long as this is our “normal” form, it works. This code could include an additional conditional to stringify tags-with-attributes separately from tags-without-attributes, but it’s not necessary. (Although it does have a nice rhythm to it.) The only advantage would be to make the diff output more easily readable in the case of a test failure.

I also handled the end tags similarly to the start tags, although that was much easier.

    def handle_endtag(self, tag):
        tag_str = "</%s>" % tag
        self.segments.append(tag_str)

To normalize string data between tags, we need to collapse whitespace. But other than that it’s the same old same old.

    def handle_data(self, data):
        normalized_data = re.sub(r'\s+', " ", data)
        self.segments.append(normalized_data)

The original code split this into several paragraphs over a number of lines, an example of the sort of over-engineering I was referring to before.

WHITESPACE = re.compile(r'\s+')

And then:

def normalize_whitespace(string):
    return WHITESPACE.sub(' ', string)

And then elsewhere:

        if isinstance(element, six.string_types):
            element = force_text(element)
            element = normalize_whitespace(element)
            if self.children:
                if isinstance(self.children[-1], six.string_types):
                    self.children[-1] += element
                    self.children[-1] = normalize_whitespace(self.children[-1])
                    return

I elided the call to force_text. If I understand correctly, that might be useful for Python 2 compatibility, but I’m foregoing that for now. (The code I’m writing is only intended to run on Python 3. I would not, however, submit a pull request until adding Python 2 support. That’s another pain, BTW, that we don’t often have to deal with in the Perl world.)

I also am letting HTMLParser handle all the entity strings, which means that I don’t have to handle the case in which one block of text is concatenated to a previous block of text. But even if I did get two blocks of text in sequence, I believe the code as I have it would work just fine; we’d just end up with multiple text strings appended to segments.

Lastly, to generate the normalized HTML, we join all the segments together. We can also safely strip off any initial and final whitespace, as it’s likely to be semantically insignificant. The only case in which it might have significance is if it’s inserted into a context that requires no whitespace. But I’d like to leave that issue for higher-level integration.

This happens when we stringify the _NormalizedHTML object.

    def __str__(self):
        return ("".join(self.segments)).strip()

I also provide a convenience function to use the object and return the normalized HTML.

def _normalized(html):
    normalized_html = _NormalizedHTML(html)
    return str(normalized_html)

And I need to override assertHTMLEqual in my test class.

    def assertHTMLEqual(self, html1, html2, msg=None):
        self.assertEqual(_normalized(html1), _normalized(html2), msg)

This implementation elides some diagnostics that are present in Django’s assertHTMLEqual. However, the stock assertEqual does a really good job, and the final failure message has all the elements of a proper bug report: (a) what we were doing, (b) what we expected, (c) what happened instead.

Which reminds me, now the test (finally) failed.

Fixing the bug was easy and straightforward: I added {% spaceless %} and {% endspaceless %} around the template code.

Reran the test, saw it pass.

Then I did an end-to-end system test (manually with the browser), and the errant blue underscores disappeared. A somewhat anticlimactic denouement, however.

In the end, I was disappointed with what I saw of Django. I really love its architecture and its feature set. However, when even an almost-trivial fix like this requires so many hours of hacking with poorly-designed source code, that’s not a good sign.

(Part of my grief, to be sure, had nothing to do with Django, but with Django CMS. But the other major part was Django itself.)

So I don’t know what I’m actually going to do with my website revamp. I don’t think I’m going to pursue Django any further on this project. Get out before I’ve dug the hole any deeper.

For a long time, I’ve been thinking about an alternative web architecture, one that would allow me to pull together all my sites under a single, lightweight app, while providing a decoupled administration and content-creation UI that wouldn’t even have to run on a public server. I might pursue that (and it might even be the next post in the website-revamp series).

I also need to revise the content and design of my hub site. I’d like to make it more like an online resume, with colorful graphics and independent sections for each area of accomplishment.

Until next time…

Still typing…

This entry was posted in website revamp and tagged , , , . Bookmark the permalink.

Leave a reply