More Simple Stats with Perl 6

Last time, a time long, long ago, in a universe far, far away, I left you with the following Perl 6 code (or something like it). This code reads in a series of book heights and displays the number of books of each height (in order from shortest to tallest).

my %num_of_height;
for lines() {
    if m/^ \s* (\d+[\.\d+]?) [\s* x \s* (\d+)]? \s* $/ {
        my $height = +$0;
        my $num = +($1 || 1);
        %num_of_height{$height} += $num;
    } elsif ! m/^\s*$/ {
        note "Invalid line ignored: $_";
    }
}

#| unique book heights seen, in order from shortest to tallest
my @heights = sort { $^a <=> $^b }, keys %num_of_height;

say "Histogram data:";
for @heights -> $height {
    my $num = %num_of_height{$height};
    say "$height x $num";
}

As you may recall, this little bit of somewhat less-than-useful P6 script was inspired by a homework assignment in my daughter’s high-school statistics course. While she manually counted up the heights of the books we measured, in order to draw a histogram of the data, I wrote a Perl script to do the same counting.

Since then, for kicks, I’ve also hacked together about 50 lines of code to display an actual histogram of these data, with horizontal bars of *****. I’m not going to go into that code here, because I did not write it at the time, but if you want, you can check it out on Github.

However, I did discover how succinctly Perl 6 could answer the other questions from the assignment.

Tops, Bottoms, and Middles

After drawing a histogram of the data, the assignment asked for the minimum, maximum, and median of the data.

The min and max are easy enough to compute. Perl 6 even has operators to do that.

say "min = ", [min] @heights;
say "max = ", [max] @heights;

The P6 min and max infix operators simply return the minimum or maximum of their arguments. So in the classic algorithm, they could be used in a loop:

my $min = -Inf;
my $max = +Inf;
for @heights -> $height {
    $min = $min min $height;
    $max = $max max $height;
}

However, why code up several lines of code when a single line will do? In Perl 6, you can put square brackets around any infix operator to turn it into a reduce operator. That is, [min] and [max] do to the lists that follows them… Well, they do exactly what the above loop does, but in one sentence rather than three.

A brief note on “pointy blocks,” the -> $height { ... } in the code above. I love pointy blocks. I read it thusly: “For array ‘heights’ as $height…” I’m not saying this is easier to read than Perl’s foreach my $height (@heights) { ... }, because it’s not. But P6 pointy blocks can be used anywhere, and can be read the same way. For example, @heights.grep( -> $height { $height > 3 } ) to find all the heights greater than three… Okay, that’s a silly example, but only because the lambda is so short. If it were just a few lines longer, the example would no longer be silly.

So we have our minimum and maximum values. However, in this case, @heights is already sorted, in order from smallest to largest. Therefore, we can find the minimum and maximum height simply by taking the first and last elements of the array, respectively.

say "min = ", @heights[0];
say "max = ", @heights[*-1];

This last line uses the “whatever” star, which is spelled *, and is pronounced “whatever,” and means “whatever makes sense here.” In the context of an array index, * refers to the number of elements in the array. So a common Perl 6 idiom is to use @heights[*-1] to pull off the last height in the array.

Medians, and Medians of Medians

I found a suitable Perl 6 algorithm to find the median on Rosetta Code. The bulk of the algorithm is only one line long. But it requires all the data as distinct elements in a sorted array. In other words, I need an array not of unique book heights, but of all the books’ heights, every single one, even duplicates, all in order from shortest to tallest.

#| all books' heights, in order from shortest to tallest
my @all_heights = @heights >>xx<< %num_of_height{@heights};

As you may be able to figure out from the context, >>xx<< takes each element of the array on the left and xxs it with each corresponding element from the array on the right. In fact, you can take any infix operator and put it inside >>□<< in order to turn it into a “hyper operator.”

Now that we have an ordered list of all the heights, we can compute our median directly:

sub median (@a) {
    return ( @a[ @a.elems / 2 ] + @a[ @a.end / 2 ] ) / 2;
}

say "median = ", median(@all_heights);

Agoura Road Median

I ripped off the above algorithm from Rosetta Code, and refactored it slightly. All you may need to know is that .elems is the number of elements in the array. In some languages, this is called the “length” of the array. But Perl 6 generally avoids the word “length,” because it’s vague. Length? Length in terms of what? Similarly, .end is the last array index (which is .elems - 1).

Lastly, the assignment called for us to find the first and third quartiles. Checking Wikipedia, I see that no one can agree on which algorithm to use to find the quartiles. The concept is simple: you split the data at the median, and then find the median of each half. But in practice, the answer you get depends on whether you include the median itself in the data halves, and on how you handle the situation where a quartile needs to be an average of two data points. Fortunately, my daughter knew which algorithm she had been taught to use.

my $idx_median = @all_heights.end / 2;
my $median_is_datum = @all_heights.end %% 2;

say "first quartile = ", median(@all_heights[0 .. $idx_median]);
say "third quartile = ", median(@all_heights[
    ($median_is_datum ?? $idx_median !! $idx_median+1) .. *
]);

Nothing much magic going on here. The index of the median is one half the top index of the data. (Try it; it’s true.) This number may be “something and a half,” in which case the median is actually the average between two other data points. Therefore, we can find out whether the median is a data point by determining whether the top index is divisible by 2.

Then we take the median of each slice. The first quartile is the median of the first slice, from index 0 up to the median. If the median is “something and a half,” this expression will do what we expect and truncate the index down to an integer. According to the desired algorithm, we include the median in both slices if it is itself a data point. But if the median falls between two data points, the second half begins at the next data point. And the second half continues through to “whatever” (i.e., the end of the array).

Oh, I almost forgot. ?? !! is the way Perl 6 spells ? :, and that’s because the ? and : are used for other purposes now.

And there you have it!

Of course, the code only took about a half hour to write. All in all, it was an informative experiment. I discovered that Perl 6′s expressiveness made the code easy to write. And going back to it now, the code still reads well; it communicates my intent, not just the algorithm I used. I’m not ready to use P6 for every project that comes down the pike, because it’s still not widely and deeply enough supported. And then there are the performance issues. But I think a class of problems exists for which ease and quality of development trumps execution speed, and Perl 6 should increasingly become a desirable option for these solutions.

-TimK

P.S. This is part of an extended series of posts on Perl 6. It started with a summary of Perl 6′s top 3 coolest features.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

4 Responses to "More Simple Stats with Perl 6"

Leave a reply