Predicting a User’s Movie Preferences

One thing that online video distributors (such as Netflix) don’t do yet (and that they will most certainly do in the future–some of the tech is already in place) is to take into account the content of the movies, such that they can learn a user’s likes in a way that is independent from (and complimentary to) the likes of other users with similar preferences.

It’s been possible for while, for example, to automatically identify “types” of documents and to tag or cluster those documents with a high degree of accuracy. Early last year, using a euclidean-distance algorithm that worked on trigram-vector representations of text, I created an example of just such a thing, document affinity, using Wikipedia articles. In the example, you could choose a Wikipedia article and the system would list the other Wikipedia articles in order of similarity to the one you chose. Despite the fact that this method is rather well known, the results of that experiment were nothing short of astonishing, and provided a quality of links among Wikipedia articles that exceeds anything that exists today, much better even than the human-made links that are in the articles. Moreover, consider that the bag-of-trigrams method I just described is archaic by comparison to some of the algorithms that research groups are exploring today, like recurrent neural networks using word vectors (my favorite).

Imagine if, in addition to resorting to the standard preference algorithms, an online video distributor (such as Eros Now or Netflix) were to analyze the script of the movie (subtitles, for example), identify similar movie scripts, and account for that when choosing other movies that you might like. The distributor’s predictions would improve in a dramatic way.

The technology to analyze the text of the movies exists already. But, if you look into the future, you’ll find that soon computers will be able to analyze the visual part of the movie, the frames themselves, the motion, and the sounds. Computers will be able to combine that type of analysis with script analysis for a full-content analysis that will increase the preference predictions even more markedly. Google and others have already started to do some work in related areas, largely based on the contributions to deep learning that Geoffrey Hinton and his students have made in the last decade, including rectified linear units, dropout, ways of combining convolutional neural-network layers with fully-connected layers for higher abstraction, and word-vectors and phrase-vector training.

We’re going to see far better preference predictions in the near future, probably by smaller companies in the beginning. In the long run, this kind of prediction, and the more advanced ones that I described above, which encroach into the realm of human cognitive abilities, will become widespread.

My Wishlist for the Future

I’m am waiting for the following developments and contributing in any way that I can to their fruition:

Technology

  • All languages to converge toward a Lisp-like language
  • Trillion-fold increase in computer processing power
  • Superhuman-level artificial general intelligence
  • Human mind uploads
  • Indefinite human lifespans

Government

  • Dissolution of all international borders
  • Fully distributed, internet-like, free-market, and non-invasive economy, with multiple distributed-ledger based currencies
  • Consolidation of all taxes, including income tax, sales tax, gas tax, local taxes, and property taxes into a single, monthly bill, at the local level
  • Completely open government process that collects and incorporates feedback from citizens, uses revision control, and uses open-source software to automate all aspects of government
  • Replacement of all government-issued business licenses (including medical licenses, legal licenses, engineering licenses, education licenses, and driving licenses) with commercial rating systems as well as legal and enforcement systems that can encourage honesty and deter wrongdoers
  • 1 year sunset on all regulations

I fully expect all of these developments to happen in my lifetime. They will bring tremendous wealth, health, equality, meaning, and worthiness to all of humanity.

Cleaning House by Open Sourcing

I’m going to start open sourcing some of the code that I currently have in private repos. It’s not going to be much at first, because I tend to collaborate with others when writing code, or I write it for companies that would never consider open sourcing their code. However, I’m going to start now and I’m going to try to talk collaborators into letting me open source stuff that I write in the future.

Open sourcing some of this code might encourage me to make it more usable.

I’m starting with some code that I use to tinker with neural networks: dc-ann

It’s disorganized right now, but, in time, I’ll put up some examples of how to use it, convert it to a quicklisp project, and make it easier to use. Furthermore, the code, as released, leaves out the transfer-function and weight-initialization changes that make the neural network a deep-learning neural network. I’ll get those in as soon as I clean up that code in the private repo.

Here are some other bits that I’m tossing into the open source community now:
dc-affinity (Lisp code to demonstrate the effects of measuring Euclidean Distance between trigram vectors)
Utyls (Perl utilities that I use frequently)

More bits that I plan to release soon include a ton of Emacs Lisp code and more Perl utilities.

API for a Mobile App in Common Lisp

Not too long ago, I released a mobile app called Lil’ Baby Names, for iOS and Android. I had a few friends help me. Mark Edelsberg designed the UI, Liu Xinjun built the Android front end, and Pritesh Shah built the iOS front end. I built the API that the devices use. What’s interesting about the API is that it’s written in Common Lisp. Why? Most of all because I love Lisp code. To my eye, the same algorithm written in Lisp or any other language always looks prettier in Lisp. But, I’ll get to that in another post. Here are some other reasons why I chose Lisp:

  • The code compilation process is fast and transparent
  • The code compiles directly into machine code
  • I can recompile functions in a running program, so I don’t have to restart the production server if I change any code
  • The REPL allows me to test pieces of my code as I go, providing continuous feedback on the work that I’m doing
  • Lisp macros and other lisp features make the code concise and easy to follow
  • The resulting service is really fast

For example, the Lil’ Baby Names API has a single endpoint and here’s how I define that endpoint:

(map-routes
  (get "/api/names" api-names))

Here’s another example. This code reads a small map of genders to gender IDs from the database:

(flatten (db-cmd *db* :query
                 (:select 'gender 'id :from 'genders)))

Note how I write my SQL in Common Lisp. That’s possible thanks to Lisp macros. This is not just beautiful, but also convenient: now, my editor can properly highlight, indent, and detect syntax errors in my SQL code.

As a final example, take a look at code that formats the results into JSON or plain text, depending on the value of the format variable:

(case format
  (:json
   (ds-to-json
    (ds `(:map :time ,(format nil "~ss" (read-time event))
               :total-matches ,results-length
               :page ,page
               :page-size ,page-size
               :regex ,regex
               :starts-with ,starts-with
               :ends-with ,ends-with
               :contains-string ,(get-parameter "contains-string")
               :contains-letters ,(get-parameter "contains-letters")
               :min-length ,min-length
               :max-length ,max-length
               :format ,format
               :sort ,sort
               :results ,(ds-list sorted-paged-results)))))
  (otherwise
   (format nil "~{~a~^~%~}"
           (mapcar (λ (x) (gethash :name x))
                   sorted-paged-results))))

Note the use of the Common Lisp format function, and how concise it can make the code. Also, notice how easy it is to transform a Lisp object or nested list into JSON.

Sure, there are some negative aspects of the language. For one, Common Lisp is a big language with a number of fairly powerful functions and macros (think format and loop). It can take a little time to get the hang of those. Furthermore, the best Common Lisp editor is probably Emacs, and who wants to learn that? However, for those brave programmers who choose to overcome those obstacles, Common Lisp provides a real glimpse into the future of programming.

Valentines for Nina and Lola

Nina and Lola,

Whenever I’m away from you and I spend a lot of time writing computer code, I think about the pretty patterns that your names make in binary code. Your names, Nina and Lola, are wonderful in so many ways that I could never describe them all. Here’s one of those ways.

If you use the following Common Lisp expression, you can find the graphical binary representation of your names in binary code.

(defun binary-name (name &optional (zero #\Space) (one #\O))
    (map 'string (lambda (c) (case c (#\1 one) (#\0 zero) (t c)))
         (format nil "~%~{~b~%~}" (map 'list #'char-code name))))

After defining a function like that, you can call the function with your names, like this, for example:

(binary-name "Nina")

Which returns the following pattern:

    O  OOO 
    OO O  O
    OO OOO 
    OO    O


That pattern represents your name! The letters are in rows. So, ‘N’ is ‘O OOO’ and ‘i’ is ‘OO O O’. Here’s the same pattern with letters in front of each row:

    N => O  OOO 
    i => OO O  O
    n => OO OOO 
    a => OO    O


One interesting thing is that, in your name at least, the capital letters start with ‘O ‘ and the lower-case letters start with ‘OO’.

Lola, here’s the pattern for your name:

    O  OO  
    OO OOOO
    OO OO  
    OO    O


Here’s the pattern for ‘Daddy’:

    O   O  
    OO    O
    OO  O  
    OO  O  
    OOOO  O


Notice how the pattern for ‘Daddy’ is bigger than the patterns for ‘Lola’ and ‘Nina’? That’s just because ‘Daddy’ has more letters.

Here’s the pattern for ‘loves’:

    OO OO  
    OO OOOO
    OOO OO 
    OO  O O
    OOO  OO


Here’s the pattern for ‘&’:

    O  OO


So, what do you think this means:

    O   O      OO OO   
    OO    O    OO OOOO     O  OOO               O  OO  
    OO  O      OOO OO      OO O  O              OO OOOO
    OO  O      OO  O O     OO OOO               OO OO  
    OOOO  O    OOO  OO     OO    O    O  OO     OO    O

Happy Valentines to you both! I love you more than anything else in the whole world and I’ll always be there for you.

–Daddy

Catching the Human Mind

I was benchmarking PCs to try to determine where to put some Chattermancy code and I started to wonder once more where we are in the unintentional technological race to build a cheap computer that can outperform the human brain. I’ve done this exercise a half-dozen times over the years. Here’s what I found this time.

Calculating how long it’s going to take for PCs to become as powerful as the human brain is easy if you can make a few assumptions. Here are my assumptions:

  • Processing power of the human brain: 100 teramips
  • Processing power of an inexpensive PC today: 0.025 teramips
  • Rate of increase of computer processing power: doubling every year

With those assumptions and a tool like a simple spreadsheet, a small computer program, or a pencil and some paper, anyone can calculate more or less when an inexpensive PC will have the processing power of the human brain. I chose the computer program path. Here’s an example in Common Lisp:

(defun forecast (&key (today-teramips 0.025) (target-teramips 100))
  (let ((power (- (/ (log target-teramips) (log 2))
                  (/ (log today-teramips) (log 2)))))
    (format nil "~a~a~a~%~a~3$~a~%~a~$"
            "Human brain throughput: " target-teramips " teramips"
            "PC throughput: " today-teramips " teramips"
            "Years until inexpensive PC outperforms human brain: " power)))

If you run that program like this

(forecast :today-teramips 0.025 :target-teramips 100)

the program will spit out the following text:

Human brain throughput: 100 teramips
PC throughput: 0.025 teramips
Years until PC outperforms human brain: 11.97

If your assumption about the processing power of the human mind is off by an order of magnitude, such as 100 teramips would be if the human mind actually ran at 1,000 teramips, the result differs little (15 years instead of 12, in this example).

If you make the same assumptions that I made, then the calculation itself is easy and probably not disputable. However, the assumptions are definitely quite disputable, so I will explain how I arrived at those.

Processing power of the human brain

This number is the hard one to pin down. Picking the right value for this variable is the key to this exercise. Here are a few starting points:

Processing power of an inexpensive PC today

To figure this out, I started by downloading the BYTEmark benchmarking software from here: http://www.tux.org/~mayer/linux/bmark.htmlI then extracted, compiled, and ran the software on a quad-core core-2 duo machine with 6GB of RAM, a machine that is worth about US$800.00 today. One number that I used from the results of that test was 81.153, the original BYTEmark results integer index. This number means that the computer is roughly 81 times faster than the baseline for this test, a Pentium 90. I found information on the Internet that indicated that the Pentium 90 ran at approximately 150 mips.

 

That information allowed me to calculate a mips rating for my computer:

81 * 150 mips = 12172 mips

But my computer has 4 processors and this test measures the throughput of a single processor. I multiplied the result by 2 rather than by 4 to obtain a conservative 25,000 mips (or 0.025 teramips) rating for my computer.

Rate of increase of computer processing power:

This was the easiest of all assumptions to make. I simply checked the processing power of the fastest computer you can buy for US$1000 today and it turned out to be about double the of the fastest computer you could buy a year ago for the same amount of money. However, I believe that this assumption of computer power doubling every year is extremely conservative because there are now desktop supercomputers (still expensive–in the ten-thousand-dollar range) that, for some tasks, are hundreds of times faster than a PC. This indicates that in the near future we’re going to see inexpensive computers with hundreds of processors that are likely going to make the conservative estimate we’re using here (computer power doubles every year) seem far too conservative.

Common Plisp: List Processing in Perl

I was recently talking on the phone with a person who lives at least 2,200 miles away and whom I’d never met or spoken to before. This is a surprisingly common occurrence in this day and age. I was explaining some of the things that I like about Perl. When I got to the part about how I love writing 4 or 5 lines of code where programmers of other languages have to write 20 or 30, my new friend hinted that he thought I was talking about completely unreadable Perl code. I was quick to point out that I wasn’t talking about the Perl golf that looks like encrypted text, but rather about using higher-order functions like map and grep. I’ve written about this sort of thing before. See, for example, the article titled Concise Programming and Super General Functions.

But, here’s another example of how Perl can be very concise. I’m going to present a program with 4 lines of code that can read a quoted CSV file (with a column-headings row) and parse it into an array of hash references that allow you to access a piece of data by line number and field name, like this:

print "keyword: $hdata[2]->{keyword}\n",
      " visits: $hdata[2]->{visits}\n";

The above code would print

        keyword: convert soap to rest
         visits: 81

Here’s the code:

#!/usr/bin/perl
use Text::ParseWords;

my @data= map {s/\s$//g; +[quotewords ',', 0, $_]} <>;
my @fields= map {lc $_} @{shift @data};
my @hdata= map {my $v= $_; +{map {($_ => (shift @{$v}))} @fields}} @data;

# Print the top 3 keywords and their visits
print "$_->{keyword} => $_->{visits}\n" for @hdata[0..2];

Here’s the sample CSV file:

keyword,visits,pages_per_visit,avg_time_on_site,new_visits,bounce_rate
"perl thread pool",210,1.00,00:00:00,23.81%,100.00%
"soap vs rest web services",152,1.00,00:00:00,100.00%,100.00%
"convert soap to rest",81,1.00,00:00:00,12.50%,100.00%
"perl threads writing to the same socket",63,1.00,00:00:00,0.00%,100.00%
"object oriented perl resumes.",52,1.00,00:00:00,20.00%,100.00%
"perl thread::queue thread pool",54,1.00,00:00:00,0.00%,100.00%,
"donnie knows marklogic install",43,1.00,00:00:00,25.00%,100.00%
"perl threaded server",45,2.00,00:01:28,75.00%,25.00%
"slava akhmechet xpath",44,1.50,00:08:46,0.00%,75.00%
"donnie knows",36,6.67,00:02:56,66.67%,0.00%

And here’s an example of how the code stores a line of that CSV file in
memory:

{
   keyword => "perl thread pool",
   visits => 210,
   pages_per_visit => 1.00,
   avg_time_on_site => "00:00:00",
   new_visits => "23.81%",
   bounce_rate => "100.00%"
}

The program doesn’t use any modules that don’t ship with Perl, so you don’t have to install anything beyond Perl itself to make this program work. Also, once you learn a few standard Perl functions for processing lists and a little about Perl data structures, the code is actually very readable.

Let’s take a detailed look at the code to see how so little of it can accomplish so much. We’ll start with the definition of @data.

my @data= map {s/\s$//g; +[quotewords ',', 0, $_]} <>;

List processing and filtering functions are often best read backwards. Let’s start with the <> operator, which in list context reads all the lines from the file name you provide to the program. If you save the program as read.pl, for example, and you run it like this:

./read.pl file.csv

Then the <> operator will read the contents of file.csv and return it as an array of lines. The map function accepts some code and an array and returns a new array. Each element in the new array consists of the result of applying the given code to the corresponding element in the original array. Here’s an example of how to use the map function:

@n_plus_2 = map {$_ + 2} qw/1 2 3/;

@n_plus_2 ends up with the values 3, 4, and 5.

The function we pass to map in the @data assignment removes trailing spaces from the each line of text, then splits the line of text into values at the commas—excluding quoted commas, of course, and allowing escaped quotes within a value. So @data ends up looking like this (only the first two lines of the example CSV file included here):

(
   ["keyword", "visits", "pages_per_visit", "avg_time_on_site",
     "new_visits", "bounce_rate"],

   ["perl thread pool", 210, 1.00, "00:00:00", "23.81%", "100.00%"],

   ["soap vs rest web services", 210, 1.00, "00:00:00", "23.81%",
    "100.00%"]
   .
   .
   .
)

The @fields assignment simply pulls the first row out of the @data array, lower-cases each column title, and assigns the resulting array to @fields.

Finally, the @hdata assignment converts each array reference in @data into a hash reference. It does so by associating a key (a column title) with each value in the array reference. The resulting @hdata array contains hash references.

How many lines of code does it take to do this kind of stuff in your favorite language?

Perl Sockets Swimming in a Thread Pool

I’ve written a simple multithreaded TCP server that uses sockets and a thread pool to handle client requests. This server is packaged into a class (DcServer) that can be trivial to use. Here’s how you might use DcServer to build a server that accepts text from clients and returns the same text with the characters in reverse order:

use DcServer;
my $server= DcServer->new(processor_cb => \&reverse_text);
$server->start;

sub reverse_text {
    my $data= shift;
    join('', reverse(split //, $data));
}

Run server.pl like this: perl server.pl

The server defaults to 10 threads, so you could have clients connecting concurrently to have their text reversed. And you could probably serve a great many of these clients per second.

I’ve also created code to help me build clients. You can, for example, build a client for the above server like this:

use DcClient;
my $message= shift;
die "Usage: perl client.pl string\n" unless defined($message);
my $c= DcClient->new;
print "$message => ", $c->query($message), "\n";

Run client.pl like this: perl client.pl "hello"

The code for the server is short and bare-bones and illustrates how to pass a socket from one thread to another.

Yes, I suffer from the NIH (not-invented-here) syndrome. I have no excuse for having written this. Let’s leave it at that. Nevertheless, I hope the code is useful to someone out there, directly or as a simple example of how to set up thread pools and how to share sockets among threads.

Here’s a little reference for how to use the DcServer module, which depends on these modules only:
– threads
– threads::shared
– Thread::Queue
– IO::Socket
– Time::HiRes

Overview

DcServer is a class that encapsulates a TCP server that implements a thread pool to handle client requests. The class allows the instantiator to provide call back functions for processing client requests, performing background work, and performing tasks after the server has shut down. The instantiator can shut down the server from within two of these callback functions. For example, the instantiator can shut down the server when certain arbitrary conditions are met (regardless of whether clients are connecting or not) or when a client explicitly requests it.

DcServer Methods

new
This method instantiates a server. It takes the following parameters.

  • host An IP address or host name that resolves to an IP address. This is to tell the server where it will be running. Clients should specify the same host name in order to connect. This value defaults to ‘localhost’ if you don’t specify it.
  • port The port on which the server will listen for connections. This value defaults to 8191 if you don’t provide it.
  • thread_count The number of threads that you want the server to start for its thread pool. The server uses the thread pool for handling client connections. This value defaults to 10 if you don’t specify it.
  • eom_marker This is the sequence of characters that the server and the client use to tell each other that they’re done transmitting data. This value defaults to “\\n\\.\\n”, which means “new line, period, new line”.
  • main_yield The server allows the instantiator to specify a callback function that the server will call in a loop. After calling this function, the server sleeps for main_yield seconds before calling the function again.. The value of main_yield defaults to 5 seconds.
  • main_cb A reference to a function that the server calls over and over again, independently of accepting and processing client connections. The server calls the function specified via main_cb with two positional parameters, $counter and $fnstop. The first parameter, $counter, contains an integer with the number of times that the server has called the main_cb function. The second parameter, $fnstop, is a code reference that, when called as a function, causes the server to stop. You can call the function in $fnstop like this: $fnstop->()

    When you call the $fnstop function, the function returns immediately, but by then the call has already initiated a server shutdown.

  • done_cb A reference to a function that the server calls when it has completed its shutdown procedure.
  • processor_cb A reference to a function that the server calls to process the data that the client has sent. The function should accept a string (the client’s request) and it should return another string (the result of processing the client’s request). More specifically, the server calls the processor_cb function with the following positional parameters: $data, $ip, $tid, and $fnstop. There following list describes these parameters:
    • $data: The data that the client sent to the server. This amounts to the client’s request. It is up to the instantiator to interpret and process that request.
    • $ip: The IP address of the client.
    • $tid: The ID of the thread that is handling the request.
    • $fnstop: A reference to a function that you can call to stop the server. You can call this function like this: $fnstop->()

    The processor_cb function should return a string consisting of an answer to the client’s request. The processor_cb function should be able to call the $fnstop function and still return data to the client. But if you do that, the call to $fnstop should occur immediately before the function returns, not before the function processes the client’s data.

start
This method takes no parameters. It simply starts the server that you previously instantiated with the new method.

Acknowledgements

While writing this code, I ran into some serious trouble with passing sockets from one thread to another. I was able to resolve the issue thanks to

something about sockets and threads that BrowserUk posted back in 2006.

Thanks BrowserUk! (And thanks PerlMonks!) I am not worthy of your guidance.

Source Code

You can view or download the code for the

DcServer and DcClient modules here.

You’re welcome also to add the code to CPAN or to POD it or to attempt to convince me to do any or all of this. I know that I should, but I’m such a sloth that I’ve never contributed even a single module to CPAN. The only thing I ask is that you let me know if you manage to improve the code in any significant way.

Concise Programming and Super General Functions

Some time ago, I started to develop programs using tiny, super general, higher-order functions and I began to borrow a little from functional programming. I noticed interesting things about these new programs. They called for special language features that I had until then heard of but never really needed. I’m referring to language features like support for first-class functions, closures, and lexical binding. The programs I produced were smaller and easier to debug than programs written in a traditional procedural style. Quite suddenly, I was able to write programs faster than ever before, I was delivering much higher-quality code, and I had found a whole new love of programming.

I work with data sets with millions of records, so often I need to write a program that I use only once. Sometimes, not by choice, I have to use PHP, Java, XQuery, C#, Visual Basic.NET, the old ASP, and even C or C++ to create or maintain applications. To deliver code or accomplish tasks at the highest possible rate, I use Perl most of all, together with a number of widely-employed programming techniques, such as object-oriented programming, extensive use of libraries (I try as much as possible to avoid the not-invented-here syndrome), and revision control. But I also use techniques that are not in such widespread use, despite the fact that they’ve been around for quite a while. I am going to describe one of those techniques here.

If you are a Lisp programmer, you probably already use this technique. For me, learning Lisp elevated the idea of abstraction and conciseness in code to a whole new level. I am a far better Perl programmer for having learned a tiny bit of Lisp. This is all the more impressive if you consider that I’ve never even put a Lisp program into production. Most of what I learned I learned by solving problems in projecteuler.net and by extending my text editor, Emacs.

If you are not a Lisp programmer, however, you might be wondering what the advantages of concise code are. By “concise code”, I don’t mean code that uses fewer characters or lines because of shorter variable names, less white space, longer lines, or other such economizing measures. Instead, I mean code that uses super general functions to achieve a high degree of abstraction and that borrows from functional programming to reduce the use of intermediate variables to generate simpler, more reliable modules.

Functional Programming

Functional programming means writing code such that the return value of the function depends completely on the function’s parameters and nothing else. Functions have no side effects. Coding in a functional style provides a number of advantages. Functional programs are trivial to parallelize or distribute among large clusters. They are easier to maintain, enhance, and debug. Large functional programs are easier to transform. You’d be lucky to learn that a program that you need to diagram or convert to another language was written in a functional style. And functional programs require fewer intermediate variables because functions are easy to chain together.

Another interesting thing about functional programming is that the order in which functions are executed flows obviously from the interdependencies of the functions.

To grasp the concept of functional programming, just think of a spreadsheet. You enter functions in the cells, never really worrying about which function is going to execute first or how the cells that your function references arrive at their values. There’s a reason why spreadsheets evolved to use functional programming: they had to be easy to maintain. Users could not be made aware that they were actually programming. Everyone knows that programming is difficult, and few users would have adopted spreadsheets if it meant that the users had to program spreadsheets in a traditional procedural fashion.

The code I describe here is not strictly functional. It is multi-paradigm code. The functions that the code defines are all actually methods that sometimes change the value of attributes of the objects to which they’re attached. But the code borrows from functional-style programming.

Super General Functions

A super general function is one that performs a task that is general enough that the function can be used for a whole class of problems rather than just a specific kind of problem. Typically, a super general function is a higher-order function, which means that the function accepts one or more other functions as parameters.

Loop keywords are super general functions in my view, because they are like functions that accept code, in the form of the body of the loop. But when I think of of super general functions in Perl, I’m thinking more about the built-in super general functions map, grep, and sort. And also about the myriad of other such functions and methods within the libraries of CPAN, like reduce and part. I will discuss several Perl functions that are super general and then I’ll write about how to code custom super general functions, because identifying the need for such functions and then coding them is the most effective way to make your code extremely concise.

sort

A good example of a super general function is the Perl sort function. It sorts. That’s what makes it general. It doesn’t just sort string values alphabetically, in ascending order. And it doesn’t just sort numbers either. It’s a super general function because it can sort anything, in any which way. You can provide code to the sort function to indicate how to sort the array of strange objects that you are giving to the function.

Consider the array of 6 elements in Listing 1.

my @list= (
  +{name => 'Donnie', age => 41},
  +{name => 'Jim', age => 41},
  +{name => 'Frank', age => 45},
  +{name => 'Charles', age => 40},
  +{name => 'Bill', age => 41},
  +{name => 'Samantha', age => 40}
);

The Perl sort function (like the sort functions of many other languages) allows you to specify if you want to sort the elements of the array by name or age, and provides a mechanism for specifying how to compare the elements of the array. Here’s how you could sort the array by age, in ascending numerical order:

my @sorted_list= sort {$a->{age} <=> $b->{age}} @list;

To sort the list by name, in descending alphabetical order, you could do this:

my @sorted_list= sort {$b->{name} cmp $a->{name}} @list;

Of course, you can use Perl’s flexible sort function in a more complex expression or combine it with other functions to perform a great many tasks. For example, the following code selects the name of the oldest and youngest persons on the list:

my ($o, $y)= map {$_->{name}} (sort {$b->{age} <=> $a->{age}} @list)[0, $#list];

While this code works well for small lists, I would never use such code to find the oldest and youngest persons if I had a very large list of records. Instead, I might use the reduce function or maybe a combination of the minmax function and a hash. But this example certainly shows how general Perl’s sort function can be.

Now imagine that you need a list of distinct ages with person counts next to each age and with the ages sorted by that count, like this one:

  41 => 3
  40 => 2
  45 => 1

This result shows that 41 is the prevailing age in the group. Results such as these are useful when graphing the distribution of age of group. To obtain such a result set from the array in Listing 1, you could do this:

# Create a hash where the keys are the ages of people in @list and the
# values are the count of people in @list with the given age.
my %x;
for (map {$_->{age}} @list) {$x{$_}= $x{$_} ? $x{$_} + 1 : 1}

# Take the keys from the new hash, sort them by the people counts
# associated with those keys, then create strings showing the keys (ages)
# and their corresponding values (counts)
my @age_counts= map {"$_ => $x{$_}"} sort {$x{$b} <=> $x{$a}} keys %x;
print join("\n", @age_counts), "\n";

Think about how many lines of Java code, for example, would be required to implement the functionality of that Perl code. Yet that functionality is trivial to follow if you understand what the sort and map functions are doing.

In the same fashion, understanding a program with custom super general functions can be trivial if you first take the time to understand what the custom super general functions do. With that understanding, the rest of the program will seem short and simple.

By providing sort functionality in the form of a higher order function, Perl has given you the ability to perform sorts of arbitrary complexity. You could sort a list of strange objects by some field first, then another, and then another. And you could make the sort function assume a default value when it encounters a field with a null value or an object that doesn’t even have the field you’re sorting on.

The flexibility that the sort function provides by virtue of being a higher-order function comes with the added benefit of allowing you to write concise, readable code. You can sort things in complex ways using a small fraction of the code that would be required in other languages and your code will also be easier to follow.

grep

Like the sort function, the grep function performs a simple task in a very general way that allows you to apply the function to great variety of problems. The grep function is like the Lisp remove-if-not function. It provides the simplest mechanism for choosing some of the items in a list. Given a list and a test expression (in the form of some code), the grep function returns a new list that contains elements from the first list that pass the given test.

Let’s take a look at an example. EAN stands for European Article Number. It’s like a UPC, the number that you see in bar codes on packaged cheese and DVD players. EANs are not so European any more, but rather an international standard. Any standard 12-digit UPC in the US can be represented as an EAN by prepending a 0 (zero) to the UPC. Most bar code readers these days understand EANs and see UPCs as EANs.

An ISBN (International Standard Book Number), which you’ll find on the cover of most books in print, is also an EAN. ISBNs start with the digits 978 or 979. So if you see an EAN that starts with those digits, you know that you’re looking at an ISBN.

If you needed to choose the ISBNs from a given list of EANs, you could accomplish the task with procedural code like this:

my @isbns;
foreach (@eans) {
  push @isbns, $_ if /^(978|979)/;
}

Those who like Perl like it in part because of how clearly and succinctly most tasks can be represented in code. The above code is a good example of this. But Perl allows you to improve on that greatly. I could use the grep function to perform the same task, like this:

my @isbns= grep {/^(978|979)/} @eans;

There are two really nice things about using the grep function. Typically, it is faster than the equivalent loop. And, the code is much shorter and clearer, which results in code that is easier to follow, to embed into or combine with other code, and to debug. The only requirement it places on the reader of your code is that the reader must understand what the grep function does.

Listing 2 shows a couple of more contrived examples of selecting specific EANs.

# Collect each EAN in the list where the sum of the EAN's last 2 
# digits is 10.
@list= grep {substr($_, -2, 1) + substr($_, -1, 1) == 10} @eans;

# Get a list of the EANs that are prime numbers
@list= grep {is_prime($_)} @eans;

# Quick and dirty is_prime function
sub is_prime {
  my ($n, $a, $b)= (shift, 3, 1);
  $n == 2 ? 1 : $n < 2 || !($n % 2) ? 0 :
    do {while($a <= int(sqrt($n))) {$a+= 2; unless($n % $a) {$b= 0; last}} $b}
}

As you can see, the grep function can make task of coding of filters almost like not coding at all. And the resulting code is clear, more so than almost any custom code that could accomplish the same task.

reduce

The reduce function will serve as my last example of a super general function that is distributed with Perl (it’s in the List::Util Perl module).

Most people I’ve worked with have never used the reduce function, which is surprising because it applies to such a large and common-place set of problems. I didn’t use it for a large portion of my career. The reduce function is useful any time you need to arrive at a single value given an array of values. For example, you can use the reduce function if you need to find the largest or smallest value in an array, or if you need to add all the values in an array.

Suppose you need to find the average of all the numbers in array @numbers. Here’s some procedural code to do that:

my $sum= 0;
foreach my $n (@numbers) {
  $sum+= $n
}
my $average= $sum / @numbers;

And here’s a solution using the reduce function:

my $average= (reduce {$a + $b} @numbers) / @numbers;

Don’t forget to include the following use line at the top of your file if you want to use the reduce function:

use List::Util qw/reduce/;

What if you wanted to find the largest value in the @numbers array?

$largest= reduce {$a > $b ? $a : $b} @numbers;

The reduce function, like the other super general functions I’ve described so far, tells the programmer reading your code exactly what you’re trying to do. And it does so at first glance. You’ll have shorter, faster, clearer code at once if the problem you’re tackling falls into the broad category that the reduce function can solve. And the same is true for many other super general functions accessible to Perl, especially those that are implemented in C.

Custom Super General Functions

Though using super general functions can definitely make your code significantly more succinct, it is writing super general functions that can shorten, simplify, and enhance the quality of your code the most. When a super general function contains a bug, that bug is more likely to surface than a bug that some lone loop in the program might contain. The loop will probably run once and will perform one specific task. The super general function, on the other hand, will likely have to run more than once and might perform slightly different functions during each run. Many different parts of the program might call on the super general function to perform slightly different tasks. Accordingly, in the end, the stress on the code in the super general function is higher and a program is less tolerant of faults in the super general function than of faults that exist in other parts of the program.

You might tend to reuse super general functions. Once you’ve written one, you’ll typically start viewing more problems in terms of that super general function. So you might end up using it in more than just one program. When you write a super general function, you’re solving a much wider range of problems than when you write a regular function, so you’re more likely to need the super general function again and again. This creates an effective natural pressure to make the super general function work correctly, without bugs.

Let’s create a program that demonstrates the usefulness of custom super general functions. We’ll start with a simple program that sends a file to an FTP server in a reliable fashion. Then we’ll add the ability to reliably retrieve and delete files from the FTP server. We’ll create put, get, and delete methods to perform those functions. Once all of that is working, we’re going to factor out the code that makes the put, get, and delete methods reliable. The factored-out code will go in a super general function that we’ll be able to use to make any operation more reliable.

Our example program is going to start as two files. We’re going to create a Perl module called ReliableFtp that will work much like the CPAN Net::FTP module works, except that it will retry an operation in a reasonable fashion when the operation fails. The other file is going to be ftp.pl, which will simply instantiate the ReliableFtp module and call its methods.

Later, we’ll change things around a bit and we’ll end up with a simple with_retry method that will allow us to perform any operation reliably. The put, get, and delete methods will become very simple when they start using the with_retry function.

Listing 3 shows the ftp.pl program.

#!/usr/bin/perl

use lib '.';

die "Usage: ftp.pl put filename.ext\n" unless @ARGV == 2;

# Fetch parameters
my ($function, $file)= @ARGV;

# Connect to FTP server
my $ftp= ReliableFtp->new(
  host => 'ftp.donnieknows.com',
  username => 'username',
  password => 'password',
  remote_dir => 'ftp-test');

# Execute the requested FTP function
print $ftp->$function($file) ? "Done!\n" : "Failed!\n";

The program accepts two command-line arguments: a function and a file name. If the program doesn’t see two arguments, it displays a usage message and exits.

The ReliableFtp module’s put method returns a true value upon success.

Listing 4 shows the first version of the ReliableFtp module.

package ReliableFtp;

# FTP capabilities from CPAN
use Net::FTP;

use strict;
use warnings;

# Constructor
sub new {
  # This constructor accepts the named parameters host, username, and
  # password.
  my ($proto, %params)= @_;
  my $class= ref($proto) || $proto;
  bless +{

    # Convert key/value pairs in %params into attributes of this class
    (
      map {$_ => exists $params{$_} ? $params{$_} : ''}
        qw/host username password remote_dir/
    ),

    ftp => ''

  } => $class
}

sub AUTOLOAD {
  # Whenever a program calls any method of ReliableFtp, the method call
  # is processed by this code first. This code ensures that the method
  # that the caller wants is a valid method and then forwards the call
  # to the method.
  my $self= shift;
  my $function= $1 if our $AUTOLOAD =~ /^.+\:\:(.+)$/;
  my %valid_function= map {$_ => 1}

    # This is a list of the methods that ReliableFtp supports
    qw/put/;

  unless($valid_function{$function}) {
    die "Invalid function $function. Supported functions: ",
      join(', ', keys %valid_function), ".\n"}

  $self->$function(@_);
}

sub DESTROY {}

sub connect {
  # This function returns a Net::FTP object if it is able to connect
  # to an FTP server. If there's a problem, the function returns a false
  # value.
  my ($self, $ftp, $error)= shift;

  # Connect
  eval {
    if($self->{ftp} && $self->{ftp}->pwd) {
      # Already connected
      return $self->{ftp}}

    else {
      # Establish a connection and remember it
      $ftp= Net::FTP->new($self->{host});
      $self->{ftp}= $ftp if $ftp}};

  # Check for any errors that might have occured in the eval above
  if($@ || !$ftp) {
    $error= $@ || "Unknown reason.";
    print STDERR "Failed to connect to $self->{host}: $error\n";
    return 0}

  # Login
  unless($ftp->login($self->{username}, $self->{password})) {
    print STDERR "Failed to log in to $self->{host}: ", $ftp->message;
    return 0;
  }

  # Change to the default remote directory
  if($self->{remote_dir} && !$ftp->cwd($self->{remote_dir})) {
    print STDERR "Failed to change to remote directory $self->{remote_dir}: ",
      $ftp->message;
    return 0;
  }

  return $ftp;
}

sub put {
  # Accepts the absolute path to a file and uploads the file to the
  # current directory in the FTP host.
  my ($self, $file)= @_;
  my ($tries, $sleep_time, $sleep_factor, $status)= (3, 3, 10);
 TRY: while(1) {
    $status= eval {$self->connect->put($file)};
    last TRY if !$@ && ($status || $tries-- == 0);
    print STDERR "Retrying in $sleep_time seconds...\n";
    sleep $sleep_time; $sleep_time*= $sleep_factor;
  }
  return $status;
}

1;

The ReliableFtp module consists of one constructor, two methods, and and AUTOLOAD and DESTROY functions. The connect method connects to the FTP server, logs in, and changes to the default remote directory, all using values that you specify when you instantiate the ReliableFtp class with the new method.

The put method tries to send a file to the server (using NET::FTP’s put method). If the send fails, the method waits a little while and then tries to send the file again. Each time the send fails, the method waits a little longer before trying again. But the method gives up altogether after a few tries, returning a false value. If the send succeeds at any time, however, the method returns a true value.

The AUTOLOAD and DESTROY functions simply ensure that the methods that the caller is calling exist. If not, ReliableFtp exists with a message that includes a list of valid methods.

At this stage, there’s very little functionality that you could factor out. But let’s add a get method to the ReliableFtp module. This get method should work much like the put method, trying and retrying to fetch a file, returning a true value when successful and a false value after a number of repeated failures.

We’ll add a delete method too, for good measure.

Listing 5 shows the additional code for the ReliableFtp module, to support the new get and delete methods.

sub get {
  # Accepts the absolute path to a file and uploads the file to the
  # current directory in the FTP host.
  my ($self, $file)= @_;
  my ($tries, $sleep_time, $sleep_factor, $status)= (3, 3, 10);
 TRY: while(1) {
    $status= eval {$self->connect->get($file)};
    last TRY if !$@ && ($status || $tries-- == 0);
    print STDERR "Retrying in $sleep_time seconds...\n";
    sleep $sleep_time; $sleep_time*= $sleep_factor;
  }
  return $status;
}

sub delete {
  # Accepts the absolute path to a file and uploads the file to the
  # current directory in the FTP host.
  my ($self, $file)= @_;
  my ($tries, $sleep_time, $sleep_factor, $status)= (3, 3, 10);
 TRY: while(1) {
    $status= eval {$self->connect->delete($file)};
    last TRY if !$@ && ($status || $tries-- == 0);
    print STDERR "Retrying in $sleep_time seconds...\n";
    sleep $sleep_time; $sleep_time*= $sleep_factor;
  }
  return $status;
}
We'll need to tell the AUTOLOAD function about these new methods, but
doing so is trivial. Just locate the line with the list of methods that
ReliableFtp supports, then add the new methods to that list. The modified
line in the AUTOLOAD function should end up like looking like this:
# This is a list of the methods that ReliableFtp supports
qw/put get delete/;

And that’s it! Our program is now able to reliably put, get, and delete files on a remote FTP server. It works just as advertised. If we had to add another method, we could simply copy the code of one of the existing methods, put, get, or delete, and then change the code slightly to implement the new method. Or, we could factor out some code. Let’s think about that.

In the ReliableFtp module, put, get, and delete all have calls to the connect method. This makes sense because otherwise we’d have to include the code from the connect method in each of put, get, and delete, and then the program would be significantly longer and harder to follow and it would just look silly. Instead, the code to connect to a remote server exists in a separate function.

Still, the put, get, and delete methods retain code that is quite similar among the three methods. The only difference in the code between the methods is that each method calls a different method of the Net::FTP object. The put method calls Net::FTP’s put method, the get method calls Net::FTP’s get method, and so on. There’s no other difference. To factor out the code in the put, get, and delete methods, you’d have to create a function that contains all the code of one of the methods, but that accepts a parameter indicating the specific function to run. The idea is that with the help of such a function (we’ll call it with_retries), the put method would look a lot simpler. Listing 6 provides an example.

Listing 6: put with the aid of the with_retries function

sub put {
  my ($self, $file)= @_;
  with_retries(sub {$self->connect->put($file)});
}

The put function in Listing 6 is only 4 lines long. That’s a tremendous improvement over the original, already succinct put function in Listing 4, which required 13 lines. But let’s take a closer look at the definition of put in Listing 6. Several subtle but interesting things happen in that tiny function. To start, the with_retries function is a super general function. Also, note that put passes a parameter-less function to with_retries, so that with_retries is able to execute the function like this: $function->(). The $self and $file variables resolve to values at the time the definition of the anonymous sub executes and the anonymous sub doesn’t look at any parameters (@_) that might be passed to it later, when it is called.

The definition of the with_retries function appears in Listing 8.

sub with_retries {
  my $function= shift;
  my ($tries, $sleep_time, $sleep_factor, $status)= (3, 3, 10);
 TRY: while(1) {
    $status= eval {$function->()};
    last TRY if !$@ && ($status || $tries-- == 0);
    print STDERR "Retrying in $sleep_time seconds...\n";
    sleep $sleep_time; $sleep_time*= $sleep_factor;
  }
  return $status;
}

The with_retries function looks a lot like the original put method from Listing 4, but with_retries accepts a function as a parameter and executes that function, returning true upon success.

If the with_retries function encounters a problem executing the function, with_retries simply waits a little while and then tries executing the function again. Repeated failures will eventually cause with_retries to fail by returning a false value.

The with_retries function requires two things from the function that you pass in. The function must be parameter-less. And, the function must return a true value upon success and a false value upon failure. With a little code, almost any function can be made to behave like that, so with_retries should work with an enormous variety of functions.

Now that we have the with_retries method, we can easily rewrite the put, get, and delete methods. If we need to add other methods in the future, we can do so with minimal code (probably 4 lines) and a reduced chance of introducing a bug.

Listing 9 shows the complete code for the ReliableFtp module.

package ReliableFtp;

# FTP capabilities from CPAN
use Net::FTP;

use strict;
use warnings;

# Constructor
sub new {
  # This constructor accepts the named parameters host, username, and
  # password.
  my ($proto, %params)= @_;
  my $class= ref($proto) || $proto;
  bless +{

    # Convert key/value pairs in %params into attributes of this class
    (
      map {$_ => exists $params{$_} ? $params{$_} : ''}
        qw/host username password remote_dir/
    ),

    ftp => ''

  } => $class
}

sub AUTOLOAD {
  # Whenever a program calls any method of ReliableFtp, the method call
  # is processed by this code first. This code ensures that the method
  # that the caller wants is a valid method and then forwards the call
  # to the method.
  my $self= shift;
  my $function= $1 if our $AUTOLOAD =~ /^.+\:\:(.+)$/;
  my %valid_function= map {$_ => 1}

    # This is a list of the methods that ReliableFtp supports
    qw/put get delete/;

  unless($valid_function{$function}) {
    die "Invalid function $function. Supported functions: ",
      join(', ', sort keys %valid_function), ".\n"}

  $self->$function(@_);
}

sub DESTROY {}

sub connect {
  # This function returns a Net::FTP object if it is able to connect
  # to an FTP server. If there's a problem, the function returns a false
  # value.
  my ($self, $ftp, $error)= shift;

  # Connect
  eval {
    if($self->{ftp} && $self->{ftp}->pwd) {
      # Already connected
      return $self->{ftp}}

    else {
      # Establish a connection and remember it
      $ftp= Net::FTP->new($self->{host});
      $self->{ftp}= $ftp if $ftp}};

  # Check for any errors that might have occured in the eval above
  if($@ || !$ftp) {
    $error= $@ || "Unknown reason.";
    print STDERR "Failed to connect to $self->{host}: $error\n";
    return 0}

  # Login
  unless($ftp->login($self->{username}, $self->{password})) {
    print STDERR "Failed to log in to $self->{host}: ", $ftp->message;
    return 0;
  }

  # Change to the default remote directory
  if($self->{remote_dir} && !$ftp->cwd($self->{remote_dir})) {
    print STDERR "Failed to change to remote directory $self->{remote_dir}: ",
      $ftp->message;
    return 0;
  }

  return $ftp;
}

sub with_retries {
  my $function= shift;
  my ($tries, $sleep_time, $sleep_factor, $status)= (3, 3, 10);
 TRY: while(1) {
    $status= eval {$function->()};
    last TRY if !$@ && ($status || $tries-- == 0);
    print STDERR "Retrying in $sleep_time seconds...\n";
    sleep $sleep_time; $sleep_time*= $sleep_factor;
  }
  return $status;
}

sub put {
  # Accepts the absolute path to a file and uploads the file to the
  # current directory in the FTP host.
  my ($self, $file)= @_;
  with_retries(sub {$self->connect->put($file)});
}

sub get {
  # Accepts the name of a file and downloads the file from the current
  # directory in the FTP host to the current directory in the local
  # machine.
  my ($self, $file)= @_;
  with_retries(sub {$self->connect->get($file)});
}

sub delete {
  # Accepts the name of a file and deletes the file from the current
  # directory in the FTP host.
  my ($self, $file)= @_;
  with_retries(sub {$self->connect->delete($file)});
}

1;

The natural thing to do now might be to move the with_retries function to some kind of utility module, where it could co-exist with other super general functions.

Languages such as Perl, with full support for first-class functions and lexical closures, make the sort of factoring work described here much easier than other, more rigid languages. These language features can help you to build truly elegant programs and to avoid the typical boiler plate code that you see in less flexible programming languages.

Conclusion

It can take a little while to become accustomed to thinking in terms of super general functions and writing programs in a functional style. But once you start doing so, you begin to simplify otherwise complex tasks. And you start to feel like you’re capable of tackling more complex problems more elegantly, with less code that is more readable and reliable.