Common Plisp: List Processing in Perl

I was recently talking on the phone with a person who lives at least 2,200 miles away and whom I’d never met or spoken to before. This is a surprisingly common occurrence in this day and age. I was explaining some of the things that I like about Perl. When I got to the part about how I love writing 4 or 5 lines of code where programmers of other languages have to write 20 or 30, my new friend hinted that he thought I was talking about completely unreadable Perl code. I was quick to point out that I wasn’t talking about the Perl golf that looks like encrypted text, but rather about using higher-order functions like map and grep. I’ve written about this sort of thing before. See, for example, the article titled Concise Programming and Super General Functions.

But, here’s another example of how Perl can be very concise. I’m going to present a program with 4 lines of code that can read a quoted CSV file (with a column-headings row) and parse it into an array of hash references that allow you to access a piece of data by line number and field name, like this:

print "keyword: $hdata[2]->{keyword}\n",
      " visits: $hdata[2]->{visits}\n";

The above code would print

        keyword: convert soap to rest
         visits: 81

Here’s the code:

#!/usr/bin/perl
use Text::ParseWords;

my @data= map {s/\s$//g; +[quotewords ',', 0, $_]} <>;
my @fields= map {lc $_} @{shift @data};
my @hdata= map {my $v= $_; +{map {($_ => (shift @{$v}))} @fields}} @data;

# Print the top 3 keywords and their visits
print "$_->{keyword} => $_->{visits}\n" for @hdata[0..2];

Here’s the sample CSV file:

keyword,visits,pages_per_visit,avg_time_on_site,new_visits,bounce_rate
"perl thread pool",210,1.00,00:00:00,23.81%,100.00%
"soap vs rest web services",152,1.00,00:00:00,100.00%,100.00%
"convert soap to rest",81,1.00,00:00:00,12.50%,100.00%
"perl threads writing to the same socket",63,1.00,00:00:00,0.00%,100.00%
"object oriented perl resumes.",52,1.00,00:00:00,20.00%,100.00%
"perl thread::queue thread pool",54,1.00,00:00:00,0.00%,100.00%,
"donnie knows marklogic install",43,1.00,00:00:00,25.00%,100.00%
"perl threaded server",45,2.00,00:01:28,75.00%,25.00%
"slava akhmechet xpath",44,1.50,00:08:46,0.00%,75.00%
"donnie knows",36,6.67,00:02:56,66.67%,0.00%

And here’s an example of how the code stores a line of that CSV file in
memory:

{
   keyword => "perl thread pool",
   visits => 210,
   pages_per_visit => 1.00,
   avg_time_on_site => "00:00:00",
   new_visits => "23.81%",
   bounce_rate => "100.00%"
}

The program doesn’t use any modules that don’t ship with Perl, so you don’t have to install anything beyond Perl itself to make this program work. Also, once you learn a few standard Perl functions for processing lists and a little about Perl data structures, the code is actually very readable.

Let’s take a detailed look at the code to see how so little of it can accomplish so much. We’ll start with the definition of @data.

my @data= map {s/\s$//g; +[quotewords ',', 0, $_]} <>;

List processing and filtering functions are often best read backwards. Let’s start with the <> operator, which in list context reads all the lines from the file name you provide to the program. If you save the program as read.pl, for example, and you run it like this:

./read.pl file.csv

Then the <> operator will read the contents of file.csv and return it as an array of lines. The map function accepts some code and an array and returns a new array. Each element in the new array consists of the result of applying the given code to the corresponding element in the original array. Here’s an example of how to use the map function:

@n_plus_2 = map {$_ + 2} qw/1 2 3/;

@n_plus_2 ends up with the values 3, 4, and 5.

The function we pass to map in the @data assignment removes trailing spaces from the each line of text, then splits the line of text into values at the commas—excluding quoted commas, of course, and allowing escaped quotes within a value. So @data ends up looking like this (only the first two lines of the example CSV file included here):

(
   ["keyword", "visits", "pages_per_visit", "avg_time_on_site",
     "new_visits", "bounce_rate"],

   ["perl thread pool", 210, 1.00, "00:00:00", "23.81%", "100.00%"],

   ["soap vs rest web services", 210, 1.00, "00:00:00", "23.81%",
    "100.00%"]
   .
   .
   .
)

The @fields assignment simply pulls the first row out of the @data array, lower-cases each column title, and assigns the resulting array to @fields.

Finally, the @hdata assignment converts each array reference in @data into a hash reference. It does so by associating a key (a column title) with each value in the array reference. The resulting @hdata array contains hash references.

How many lines of code does it take to do this kind of stuff in your favorite language?

Leave a Reply

Your email address will not be published. Required fields are marked *