I’m not a hackathon kinda guy. I don’t get off on solving hard or novel problems. I don’t believe in contests. I’m not particularly energized by the idea that a whole bunch of us are trying to solve the same problem at once— I do believe in community, and I appreciate being able to interact with other developers, but on a different level than just getting into the same (virtual or physical) room together, eyes to the computer screen.
So why then did I participate in the Dallas/Fort Worth Perl Mongers
Winter 2013 Hackathon?
If we’re looking for root causes, the article that ultimately inspired my motivation, I think, was Paul Graham’s essay, “Beating the Averages.” In this famous piece, Paul tells the story of how, back in 1995, he and Robert Morris created Viaweb, which was an early e-commerce platform (bought by Yahoo! in 1998, to become Yahoo! Store). They chose to develop Viaweb using Lisp, because, Graham says:
Our hypothesis was that if we wrote our software in Lisp, we’d be able to get features done faster than our competitors, and also to do things in our software that they couldn’t do. And because Lisp was so high-level, we wouldn’t need a big development team, so our costs would be lower… We would end up getting all the users, and our competitors would get none, and eventually go out of business… Somewhat surprisingly, it worked.
The thing is, this sounds an awful lot like what Buddy Burden calls “Getting Shit Done”:
With other languages, I have to spell out every little thing. With Perl I can do the coding equivalent of saying “you know: do the thing with the thing” and it will just trundle off and do that. People say that magical variables like
$_are hard to learn. They’re not. They’re hard to teach. They’re sometimes hard to understand completely, difficult to grok, we might say. But they’re not hard to learn, because there’s nothing to learn. They just work.
Writing in a computer language must accomplish two things: (1) tell the computer how to solve the problem with which it’s being tasked, and (2) tell humans how the computer is solving the problem with which it’s being tasked. The first you can accomplish in Brainfuck. The second makes high-level development possible and requires a much more expressive language.
And that is what Perl was designed to do.
But this is not really an article about the power of Perl. I’m a long-time Perl devotee, true, but that’s because I believe, on faith and experience, in the value of the Perl ecosystem.
However, in my software career, I have often had to battle the forces of quick-and-dirty development, and the prevalent—usually false—belief that there is a time and place for “quick and dirty.”
How fast can a good developer produce quality software using a disciplined process?
The hackathon had a simple aim: to develop a Perl program that could scan 100GB worth of files and find all the duplicates, as fast as possible. But the contest had other competition categories, besides Fastest Runtime. In particular, it had categories for Most Features and Best Documentation.
So I decided to participate in the hackathon, and read up on deduplication theory (most of which has to do with deduplicating database records, not files), and analyzed the problem and decided on a suitable algorithm.
Briefly, I would hash files using a variety of hash functions, grouping them by hash value. The simplest, broadest hash function would simply look up and return each file’s size in bytes. (Yes, file-size is indeed a hash function.) Any collisions in file size would be resolved by using progressively better hash functions, culminating in something like SHA1 (used by git to detect universal uniqueness).
By this time, most of the month-long contest period had already elapsed. My first commit happened on December 29, for a January-8 drop-dead date (but we needed to submit our code for judging a couple days before that).
It seemed a perfect opportunity to test my preferred development values:
- Develop a proper architecture, with separation of concerns; not a 500-line hack script. (And I’ve done plenty of those in my day, too.)
- Unit tests, always and forever, and they have to pass.
- And write the unit tests first, before writing the code that makes them pass.
- Refactor the code often to simplify how it expresses its ideas, and to improve the solution’s architecture.
- Write documentation, lots of it. Write (most of) it while coding, not months after you’ve forgotten what the code was supposed to do.
I was looking to develop high-quality software. If the contest had had a category for Highest Design Quality, I would have been so on that. But good software design and good documentation, in my view, are not goods in themselves, but means to ease future software development. So Most Features was the primary competition category I was aiming for, together with a reasonably fast runtime (because if my solution didn’t run reasonably fast, it wouldn’t matter how flexible or easy it was use).
Other twists: I chose to use p5-mop + signatures, for a few reasons: Firstly, these provided a succinct syntax for my Perl objects and subroutines. These cutting-edge features would also represent the “client insists on the ‘cool’ technology” factor in the experiment, to make it more realistic. And finally, it gave me an excuse to develop something semi-involved using p5-mop, which I had been anxious to do. (So I myself was the “client” who insisted on the “cool” technology.)
I packaged my solution using
Dist::Zilla, for ease of testing and distribution. And so that I could try out
Dist::Zilla. (Also, part of the rules of the contest endowed special recognition on anyone who provided a solution in the form of a distributable package. And this also served as an incentive.)
Would you just let me figure this out?!
The hackathon challenge was simple in concept. We had a month (or in my case, about a week) to develop an application that would:
- Scan 100GB worth of files.
- Find and report which of them are duplicates.
- Run as fast as possible.
- In Perl.
Sounds doable— Oh, and by the way, links are not dups. The test data includes hard links and symbolic links, and your program needs to handle both gracefully.
Well, I guess that’s reasonable— Oh, and by the way, if the same file has multiple hard links, the path of the first hard link in sorted order is the one that must be reported as the authoritative copy.
Well, that’s an interesting question. I had assumed it wouldn’t matter which— Oh, and by the way, zero-length files are just like any other file; they should be reported as duplicates of each other.
Uh… My initial solution works a little differently, but maybe I can tweak— Oh, and by the way, there’s now a formal output specification that your program needs to conform to, so that we can verify that it is correctly reporting the correct results.
Hrmm… In other words, just like a real-life software-development project.
I started writing code on December 29. By December 30, I had a basic deduplication engine class
Data::Dedup::Engine, and had poked it, tweaked it, grappled with it, extended it, and refactored the hell out of it. This class can deduplicate any set of objects, using any set of hashing algorithms. So it encapsulates the deduplication logic, separating it from the object (file) data and the hash functions used to interpret that data.
As I expected, I was able to develop this module without any supporting framework, other than the unit tests that I had written during its development. And the code turned out more complex than I had originally thought it would. Frankly, I can’t imagine how I would have developed this module, at least not as quickly as I did, without unit testing.
(Originally, the engine class was called
DeDup::Engine—yes even with different capitalization. During the course of the week, I ended up renaming
DeDup::, by stages, to
The following day, December 31, I finished a simple file scanner (
Data::Dedup::Files) and primitive command-line interface. The file scanner encapsulated knowledge about how to scan the filesystem and interpret file data. The primitive CLI was there only so that I could experiment with the now-basically-complete file-deduplication code, using the real files on my actual hard drive.
That was New Year’s Eve. I spent my New Year handling zero-length files and hard links as per the new requirements.
For the next few days, I played with different hash functions and experimented with optimizing the code. The architecture I had developed made changing up hash functions exceptionally easy: there was really no code to modify; just call a different block of code or different CPAN module. I ran the code through the NYTProf profiler, trying to wring more performance out of it, especially on optimized or high-speed filesystems. I dumped data structures from memory, trying to reduce memory footprint. In both of these efforts, p5-mop was the bottleneck (which I’ve mentioned elsewhere).
On January 5, I refactored the CLI main script
dedup_files to use a new
Data::Dedup::Files::CLI class, which uses
CLI::Startup. I also added
Dist::Zilla support. That gave me just enough time to shine the porcelain before…
Code submission for judging submitted on January 6. Further tweaks, documentation, cleanup, for presentation on the 8th.
Official runtime: 321 seconds, not blazing fast, but respectable. This was done in 763 lines of code using 522MB RAM, both toward the high end of the scale. And I was awarded prizes for Most Features, Best Documentation, and Best Effort (measured by counting significant git commits).
The power of disciplined development?
So does this tell us anything about how easy (or difficult) it is to develop complex software features using a disciplined approach? I’m not sure it does, because the experiment was uncontrolled. That is, there was no similar developer trying to accomplish the same thing as I, using a more ad-hoc process.
However, I was quite happy with my progress and with the ease of development. And my gut reaction is that the next time I face a similar scenario, too many features to develop in too short a time, I’ll similarly insist on a disciplined approach.