Some Simple Stats with Perl 6

When my daughter told me she was taking statistics, I wondered how she could do that without knowing any calculus. Then on second thought, I guess there is a certain amount you can do with statistics, even without knowing any calculus.

Ah, yes. And last night’s homework assignment drilled that home. She asked me to help her. First, she needed to find a quantitative data set she could use to construct a histogram. Looking around the living room, I suggested, “What about the heights of the books on my book shelves? You could measure the spines and write them down in a data table.”

Now, I currently have nine full shelves of books in my living room, everything from Dave Barry and Dilbert to Peopleware to marketing and memoirs, theology and psychology, short stories and the theory of writing, to Holly Lisle, Douglas Adams, Robert Heinlein, and The Sisterhood of the Traveling Pants. (And I am not making this up.) And those are just the books that are actually on shelves, not the books that are still in boxes.

I think she was a little overwhelmed by the idea.

“You don’t have to include all the books, just one or two shelves worth. Maybe that one over there, half full of mass-market paperbacks all the same size.” Easy to measure, easy to count.

So we chose that shelf and the shelf just under it, a nice cross-section of fiction in mass-market, trade paperback, and hardcover. And then while she was manually collating the data, inspiration struck me. Why don’t I write a quick Perl 6 script to check her results. It’ll give me something to write about; plus I’ll see whether Perl 6 is really useful for anything, especially minor scripts.

And as it turns out, it might be.

Ha! I bet you thought I was joking about Holly Lisle, Douglas Adams, Robert Heinlein, and The Sisterhood of the Traveling Pants.

Geting the Data into Memory

Perl has always been good at reading and parsing streams of text data, and Perl 6 is no different. I knew I wanted to enter the book heights as a simple text file, one height per line, with an optional quantifier. So for example:

8.5 x 3
9 x 2

This sequence would indicate measurements for 7 books, 4 of them 8.5″ tall, one of them 7″ tall, and the remaining 2 of them 9″ tall. (We rounded all measurements to the nearest ¼”… more or less. And if we happened to measure wrong, well, no one will ever find out, anyhow.)

So the first thing I needed to do to write my script was to read in the data either from standard-in or from a file listed on the command line. In Perl 5, we used to do that with while (<>) magic. Perl 6 has a better way.

for lines() {
    # . . .

Why is that better? Because no magic, that’s why. This does exactly what it says, no more, no less. For each of the lines in the default input, do… Do what?

Well, I want to parse out the height and quantifier using one of those très kewl Perl-6 regular expressions.

for lines() {
    if m/^ \s* (\d+[\.\d+]?) [\s* x \s* (\d+)]? \s* $/ {
        # . . .

Okay, I didn’t actually use a Perl 6 grammar—That’s a topic for a different post. But I did write this simple regex in Perl-6 format. If you know standard Perl-5-inspired regular expressions, this probably looks familiar. But don’t be fooled! It doesn’t say what you think it does. Let’s take it one piece at a time.

Firstly, the match operator m//. Everything inside of the slashes is the regular expression. By default, in P6, whitespace in the regular expression does not matter. The whitespace is only there to make the regex more readable, not to change how the regex behaves.

Get it? Carrots and Lettuce
Photo © 2009 Jim Forest CC BY-NC-ND 2.0

The first thing in the regular expression is a caret ^, which matches the beginning of the string. Not the beginning of a line, as in Perl 5; the beginning of the whole string. In P6, all regexes are multi-line. That doesn’t matter to us one way or the other, because we’re processing one line at a time. Our regex m/^ ... $/ matches everything from the beginning to the end of the string, which is one line long. (Don’t worry, Perl 6 still lets you match the beginnings and ends of individual lines in a multiline string, if you want, with the ^^ and $$ operators.)

The next thing I tell P6 to do is to discard any whitespace that might happen to be at the beginning of the line. I do this by matching \s*, which reads just like it does in Perl 5. This whitespace is not captured by parentheses.

However, the next term on the line is captured by parentheses. (\d+[\.\d+]?). This is one or more digits followed optionally by a decimal point plus one or more additional digits. The square brackets [] are not a character class; in P6 they indicate a non-capturing group. (In P6, enumerated character classes are rolled into the more general-purpose “extensible metasyntax,” which is a different blog post.)

(Note, this is a very restrictive rule for a number. It requires that you provide an integer part to the number, even if it’s 0; and if you use a decimal point, you must also provide a fractional part. So numbers like 5. and .25 won’t be parsed. But this code is quick and dirty and should suffice for our purposes. Remember, this is a quick hack that I needed to throw together and get working in a few minutes.)

In the input, we can follow this captured number by an optional x Int, the quantifier, which is parsed by [\s* x \s* (\d+)]?. That integer, if it exists in the input, is captured by another set of parentheses.

Can You Hash It?

Okay! Now I’m ready to take these data and stuff them into a hash:

my %num_of_height;
for lines() {
    if m/^ \s* (\d+[\.\d+]?) [\s* x \s* (\d+)]? \s* $/ {
        my $height = +$0;
        my $num = +($1 || 1);
        %num_of_height{$height} += $num;

The $0 and $1 variables are shorthand sugar for $/[0] and $/[1], which are lexically scoped and hold the value of the first two parenthesis captures. (All of this is different than Perl 5, but is more orthogonal, as you can shorthand some other stuff from the match object $/ in a similar way.)

Each of these I evaluate in a numeric context; that’s the unary + in each assignment. In my first try, I forgot to do this, and the code that came after only partially worked, and I got strange behaviors and type errors, because P6 was processing the data as a string (as I had told it to), because I had forgotten to convert it to numeric.

Also, if the quantifier, assigned to $num, is not specified in the input, it defaults to 1. So with no x Int in the input, $1 will evaluate to False, and evaluation will fall through to the || 1 in that line.

Lastly, I increase the quantity of books in the %num_of_height hash. I refer to the hash element for the corresponding height, and add to it the quantity of books from the current line of input.

Whew! The P6 code is way more succinct than the English.

Histogram, Histogram, Where Have You Been?

I coded all of the above in a few minutes. It took much longer to explain it.

Unfortunately, I’ve run out of space in this week’s post, so I’ll just leave you with one more thing for now.

The first task my daughter needed to do for her homework assignment was to draw a histogram of book heights, which involved counting the number of books of each height. I did not draw any fancy ASCII-art histogram graphics. But I did have my script report the number of books of each height:

say "Histogram data:";
for sort { $^a <=> $^b }, keys %num_of_height -> $height {
    my $num = %num_of_height{$height};
    say "$height x $num";

This loops through all of the keys in %num_of_height, which are the heights of the books, sorted from smallest to largest. For each height, I retrieve the number of books of that height and report it on standard-out.

Next time, I’ll point out a couple of interesting P6 features in this code, and show you how I then computed the minimum, maximum, median, and lower and upper quartiles, each with just a few lines of code.


P.S. This is part of an extended series of posts on Perl 6. It started with a summary of Perl 6’s top 3 coolest features.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a reply