Monday, August 9, 2010

Perl light-fast basic tutorial

Got a good Perl job and needed to brush up some Perl. Based on "A Quick, Painless Introduction to the Perl Scripting Language" with my own additional contributions and corrections. There's much more with Perl of course, this is either to whet your appetite or to survive. It'll grow in time.

If you are curious Perl stands for Pathologically Eclectic Rubbish Lister or more seriously Practical Extraction and Report Language.

An IDE for Perl


This is optional, you shouldn't be scared by command line if you are interested in Perl, but you may want to go graphical anyway, especially for debugging. I recommend Padre (Perl Application Development and Refactoring Environment), an Italian word which means father :)

How to install? First check if there's a up-to-date working package for your Linux distro. If not this should work on any Linux version:

# cd /usr/local
# wget http://padre-perl-ide.googlecode.com/files/perl-5.11.4-xl-0.03.tar.bz2
# tar jxf perl-5.11.4-xl-0.03.tar.bz2
# rm http://padre-perl-ide.googlecode.com/files/perl-5.11.4-xl-0.03.tar.bz2

$ /usr/local/perl-5.11.4-xl-0.03/perl/bin/padre.sh

Prerequisites are: libpng12 and wxgtk. Without these installed Padre won't run.

File template, execution


Compilation check (syntax only):

$ perl -c file

-w enables many useful warnings e.g. for using variables not previously defined
use strict among other things requires variables to be declared
-n assumes "while (<>) { ... }" loop around program, useful for replacing Sed with Perl. It means "do this code for every line of input".
-Mmodule list same as "use Module LIST"
-e one line of program; since the semicolon is a statement separator, not a terminator it can be omitted after last command
See perlrun(1) for other options.

Help on a built-in function, e.g. push:

$ perldoc -f push

Help on a module, e.g.:

$ perldoc/man Test::More

Help on predefined variables:

$ perldoc perlvar

Search into the FAQ:

$ perldoc -q package

OO tutorial:

$ perldoc perltoot

#!/usr/bin/perl -w
use strict;

perl file.pl arg1 arg2 ...

$ARGV[0], $ARGV[1], ...

Getting help on builtin functions:

$ man perlfunc

Data types and variables


variable typeprefix
filehandle
scalar (integer, floating-point number, string or reference)$
array@
hash aka associative array%
subroutine (DEPRECATED in calls, use only with subroutine references)&

In Perl, the following evaluate to false in conditionals:

0
'0'
undef
''  # Empty scalar
()  # Empty list
('')

The rest are true. There are no barewords for true or false.

# Declaration - optional
$var;  # Note this is global variable, even if this statement appears in a subroutine

$var = value;

$var ||= default value;

# variable local to subroutine or block
my $var;
$var = value;
# simpler
my $var = value;

# Difference between "my" and "local" variable scope declarations.
# Both of them are used to declare local variables.
# A variable declared with "my" can live only within the block it was defined
# and cannot get its visibility in inherited functions called within that block,
# but one defined with "local" can live within the block and have its visibility
# in the functions called within that block.

# at the global level there's no difference between local and my
{
  local $l = 'l';
  my $m = 'm';
  p();
}

sub p {
  print "l=$l\nm=$m";
}

my $x;
my $y;
# simpler
my ($x,$y);

if (defined($var)) { ... }

# the ‘.’ is used to concatenate strings
$s .= " thousand";

# break $line into an array of tokens separated by " ", using split()
# (array names must begin with @)
@words_on_this_line = split(" ",$line);

$i++;

print "some string ",$var," another string\n"; # you can give print a list of values
print "some string $var another string\n"; # or just use interpolation of scalars into strings
print "some string ${var}another string\n";

# case-shifting backslash operators
$n = 'antonio';
print "\u$n"; # Antonio
print "\U$n"; # ANTONIO
print "\U$n\E is cool"; # ANTONIO is cool
$n = 'ANTONIO';
print "\u\L$n"; # Antonio

Built-in variables


$^O Operating system Perl was built on.

Conditionals

condition numbers strings
= == eq
!= ne
< < lt
<= le
> > gt
>= ge
# test equality of numbers
if ($x == 5) ...

# for strings
if ($s eq "five") ...

Arrays and lists

Perl arrays can grow dinamically, some elements can remain unset. Indexing starts from 0. An array can be assigned a list which is an array as well, but without a name.

@a = (1,(0,"3"),4);
is the same as
@a = (1,0,"3",4);
since array elements can only be scalars

$a[1] =  2;

# Negative indexes select the element from the end
print $a[-1];

Treat an array as a queue data structure:

$x = shift @a; # "output" of shift is the element shifted out (1)
push @a,1; # returns the new number of elements in the array

... or as a stack:

$x = pop @a; # pops and returns the last value of the array
push(@a,$x); # @a won't change

print @a, "\n"; # prints all the array elements with no space
print "@a\n"; # prints all the array elements with a space separator due to quotes
print "$_\n" for @a; # prints all the array elements each on a line

# The length of an array or list is obtained calling scalar() or by
# simply using the array name (though not a list) in a scalar context.
$k = scalar(@a);
$k = @a;
$k = (1,2,3); # k = 3

# empty an array
@a = ();

delete $a[2]; # $a[2] now = undef

# truncate to a 1-element array, same as $x=$a[0]
($x) = @a;

# array slicing, access a subset of an array
$b = $a[1..2];
$b = $a[0,2..3]; # can mix "," and ".."
@a[0,2] = @a[2,0]; # swaps elements 0 and 2

# how to get the last element of an array into a scalar variable:
$s = @a;
# and the first element:
($s) = @a;

The built-in special 'quote word' function can be used to save typing quotes when defining a list of constant words. You can use any non-alphanumeric, non-whitespace delimeter to surround the qw() string argument and your list gets constructed at compile time. You'll often see qw//, for example.

@p = ('Martin',"O'Toole",'Duffy');
@p = qw(Martin O'Toole Duffy);

Note quoting is not not mandatory for single words but it's a good practice to avoid clash with present and future reserved words.

Here's an array of array references (a.k.a. bidimensional array or matrix). Note that generally an array can contain references to arrays of differing sizes!

my @sator_square = (
  ['S','A','T','O','R'],
  ['A','R','E','P','O'],
  ['T','E','N','E','T'],
  ['O','P','E','R','A'],
  ['R','O','T','A','S']
);

Here's how to avoid quoting:

my @sator_quare = (
  [qw/S A T O R/],
  [qw/A R E P O/],
  [qw/T E N E T/],
  [qw/O P E R A/],
  [qw/R O T A S/]
);

The bolded element can be indexed as $sator_square[1]->[2] or $sator_square[1][2].

Hashes


$h{"key"} = "value";
%h = (name => "Antonio",
      age => 33);

print %h, "\n"; # the output will be: nameAntonioage33

print keys %h; # nameage
print values %h; # Antonio33

# testing for key existance, whatever the value is
if (exists $h{"key"}) { ... }

# looping over an hash
keys %h; # Reset the internal iterator so a prior each() doesn't affect the loop
while (my($key,$value) = each %h) { ... }

# sorting an hash by the hash key
print "h{$_}=$h{$_}\n" foreach (sort keys %h);

# sorting by the hash values
# "<=>" is a binary op which returns -1, 0, or 1 depending on whether the left
# argument is numerically less than, equal to, or greater than the right argument.
# It is commonly known with the funny name of spaceship operator.
# There's a version of it for comparing strings as well: the infix cmp operator.
print "$_ is the key of $h{$_}\n" foreach (sort {$h{$b} <=> $h{$a}} keys %h);

# Constant hash keys are strings and should be single or double quoted. You could leave off
# quotes when your key does not contains internal hyphens, spaces, or other special characters.
# But for consistency and enabling syntax-highlighting it is better to always quote hash keys.
$h{hash-key} = 'value'; # no quotes: this will set the '0' key to 'value', not what you wanted!

Another example: converting an array into a hash to the purpose of de-duplicating:

my @list = (1, 2, 1, 3, 2, 5, 7);
my %list;
$list{$_}++ for @list;  # we count duplicates
print "$_ " for keys %list;  # 1 3 7 2 5 

The only problem with this code is that original order is lost.

Selection, Loops


Braces can NOT be omitted in single-statement blocks.

if (...) {...;}
elsif (...) {...;}
else { ...;}

... if ...;
... unless ...; # same as "... if not ...;"

# case or switch statement, available as Perl 5.10
given (...) {
  when (...) { ...; }
  when (...) { ...; }
  ...
  default { ...; } # optional fallback
}

if (! cond) { command; }
# can be shortened in:
command unless cond;

while (...) {
  ...
  # break out of the loop if the specified condition holds
  last if ...;
  ...
}

do {
  ...
  # last cannot be used here.
} while (...);

# Old labels and gotos are supported by Perl. They are considered harmful :)
# Here how to implement a repeat-until loop with a label:
LOOP_START:
  ...
  goto LOOP_START unless ...;

# To "break out" of an if block (without a goto) just add another set of braces
# and use last. This is because any block (except for the pseudoblocks which
# are part of if, else, do etc.) is treated as a loop which loops once
if (...) {{
  ...
  last if ...;
  ...
}}

# C-style for loops
for ($count = 0; $count < 10; $count++) {
  print "$count ";
}

# loop over a list or array
for $i ((1,2,3)) { ... }
for $i (@a)  { ... }
for (@a) { ... $_ ... }
foreach (@a) { ... $_ ... }

String functions

chop $line; Default variable: $_; removes and returns the last character in line. It's actually a function. This is typically used to remove an end-of-line character.

Exceptions

To throw an exception (e.g. in a subroutine) just use die
sub my_method {
  ...
  # In case of errors.
  die 'Error message here' if condition;
  ...
  # normal return value
  return ...;
}
The eval() function takes the place of the try block. Catching exceptions is done by checking the predefined variable $@ which contains the error message from the last eval() operator or the null string if all went well.
eval { ... $obj->my_method(...) ... };
... if $@;
A common idiom is to catch the "Can't locate Module::Name in @INC..." error and doing nothing is a module is not available rather than allow the program to fail:
eval 'use Module::Name';
if (!$@) {
   ...
   # code that uses the optional module goes here
   ...
}
Note how we should eval a string not a block to avoid the error to be caught by Perl at compile time.

References

$i = 3;
$r = \$i; # as r = &i in C
# $r is a reference to a scalar, so $$r denotes that scalar
print $$r; # 3
$i = 6;
print $$r; # 6
print ref $r; # 'ref' operator tells you what type of reference your variable is (SCALAR)

# A constant can be referenced; this not allowed with pointers in C:
$r = \3;

$r = \@a;
print $$r[0]; # first element
print scalar(@$r); # number of elements
print ref($r); # HASH
Anonymous data is allocated to the heap and referred using references but with the -> operator:
$a = [1,2,3]; # Anonymous arrays use brackets instead of parentheses
print $a->[1]; # prints 2

$h = {name => "Antonio", age => 33}; # Anonymous hashes use braces
print $h->{age}; # prints 33. $${age} will also work
print $h; # prints something like HASH(0x98b9c20)

Files

open(INFILEHANDLE,"filename");

# <> construct means read one line; undefined response (undef) signals EOF
while ($line = <INFILEHANDLE>) { ... }

# Reads a line from STDIN (i.e. keyboard). 
$line = <>; # short for 
# get rid of the newline; note the assignment operator returns a reference
chop($line = <>);

# cat(1) in Perl. Each line read is saved into the default implicit variable
# $_ which is the default for many function arguments, print inclusive
while (<>) { print }

# List all files in current directory. Same output as 'ls -1'.
print "$_\n" for (<*>);
From perlretut: ...shift saves the first command line argument as the regexp to be used, leaving the rest of the command line arguments to be treated as files. while (<>) loops over all the lines in all the files. For each line, print if /$regexp/; prints the line if the regexp matches the line. In this line, both print and /$regexp/ use the default variable $_ implicitly...
#!/usr/bin/perl
# Syntax: grep pattern file

$regexp = shift;
while (<>) {
  print if /$regexp/;
}

Subroutines

Perhaps unfortunately Perl uses call by reference by default. If you want to pass-by-value it is your responsibility to make a copy.
sub var_in {
    my $v = $_[0];  # $v is 5

    $v += 1;  # $v is now 6
}

sub var_inout {
    # $_[0] is 5
    $_[0] += 1;  # $_[0] is now 6
}

my $a=5;
var_in($a);  # $a is unchanged
var_inout($a); # now $a is now 6 as well!
sub array_in {
    my @v = @_; # @v is (1,2,3)

    $v[0] += 1;  # @v is now (2,2,3)
}

sub array_inout {
    # @_ is (1,2,3)
    $_[0] += 1;  # @_ is now (2,2,3)
}

my @a=(1,2,3);
array_in(@a);  # @a is unchanged
array_inout(@a); # @a is now (2,2,3) as well!
Since looking only at the above calls is quite difficult to say if a certain parameter could be modified or not after a subroutine call, it is better to explicitly pass a reference when pass-by-reference is wanted. This makes clear that the value could be modified and that a variable should be passed, not a constant. Prepending a variable with a backslash (\) creates a reference.
# Better version, we don't use $_[0] directly.
sub var_inout {
    my $v = $_[0];  # $v is a reference to a variable

    $$v += 1;  # referenced value is now 6
}

my $a=5;
# More readable call.
var_inout(\$a);  # $a is now 6 as intented.
A common idiom to switch from the default call-by-ref to a safer call-by-value is to copy values from the @_ array which holds the arguments passed to a subroutine. Indexing this array is rarely used. Shifting and multiple variable assignments are the preferred ways.
sub routine {
  # @_ is the argument array
  # $_[0], $_[1], ... are the arguments
  # But it is discouraged to use them directly to avoid possible unwanted
  
  $f = shift @_; # get first argument, @_ could be understood
  $s = shift @_; # get second argument

  # This won't work for subroutines having a variable number of arguments.
  ($f,$s) = @_;

  # The return value from a subroutine can also be a list.
  return ($s,$f);
  # Any subroutine will return the last value computed. So the return keyword can be omitted:
  ($s,$f);
}

# If any of the arguments in a call are arrays,
# all of the arguments are flattened into one huge array, @_
sub f {
  for my $i (@_) {
    print "$i ";
  }
}

$n = 0;
@x = (1,2,3);
f($n,@x,4);  # output: 0 1 2 3 4 

# parenthesis enclosing arguments are optional, that is subroutines
# aka functions can be considered as operators or commands.
f $n,@x,4;

# This may not be what you wanted:
f fp1, g gp1, gp2;
# To make the last arg belong to subroutine f, round brackets must be used.
f fp1, g(gp1), fp2;

# also when calling subroutines with no arguments, as long as they are defined
# before so that Perl knows f is a subroutine and not the bareword "f"
f;

Modules and OOP

Perl OOP implementation was added later to the language, trying to fit it in existing Perl constructs as modules. See perlmod(1).
# File X.pm defining class X.
# Namespaces can also be nested, e.g. Project::Module::Class corresponds to the relative path
# Project/Module/Class.pm or Project/Module/Class/somefile.pm. You can add a new directory
# to the Perl search path by using the -I option when invoking Perl from the command line.
package X;

# Here's how to inherit from another class.
use base 'Package::Class';

# Class variables, if any, are stored in freestanding variables in the package file.
@array = (); # E.g. an empty array.
$scalar = undef; # An undefined scalar, for example.

# Constructor, class and instance methods are simply subroutines in the package file.

# Constructor. It's just like any other class method. You can give
# any name to it. It is common to name it new(), though.
sub new {
  # Note the repeated use of "my" to keep things local.

  # Here you can change/initialize class variables like @array, $scalar, if needed.

  # Class name (X) is first argument.
  my ($classname,$f1,f2,...) = @_; # Get the other arguments aka fields/properties.

  # Set up an object of class X that is simply an anonymous hash and point a reference
  # variable to it. The instance variables of the class will be the elements in this hash.
  my $r = {var1 => $f1, var2 => $f2};

  # Perform a bless operation to associate the hash with this class name,
  # i.e. this package file and return the reference.
  bless($r,$classname);
  return $r; # Optional since the "bless" operation returns the same reference.
}

# If a method is invoked on the class, $_[0] or first
# shift result, will point to the class, that is will be the classname "X".
sub class_method {
  ... $_[0] ...
}

# If a method is invoked on the object, $_[0] will
# point to the object that is an anonymous hash.
sub instance_method {
  my $r = shift; # object is first argument, corresponding to this in C++
  ... $r->{objvar} ....

  # $r equates to the this pointer in C++ and can be used
  # to call other instance methods from inside this method.
  ... $r->other_instance_method(...) ...

  # unlike C++ $r can also be used to call static methods (see note below)
  ... $r->class_method(...) ...
}

# Any package which contains subroutines must return a value.
1;  # Therefore you must include dummy return value as last line
Note there;s no difference in declaration of class and object methods. As far as Perl is concerned the same method can serve (i.e. be invoked) as either a class method or an instance method!
# Import all symbols of module Xinto the current package at compile time.
# Note that this does not import any methods, so the calling namespace
# and library namespace will remain cleanly separated.
use X;

# $X::scalar, @X::array

# inside new @_ will consist of X and the actual arguments "..."
$x = X->new(...);
print ref $x; # X

# ditto
X->class_method(...);

# this call would set @_ inside instance_method @_ to consist of
# a pointer to the object (i.e. $x) and then the other arguments 
$x->instance_method(...);

# Also allowed, but may not work depending on method implementation
$x->class_method(...);
X->instance_method(...);
Instance methods can be called with the -> operator only on blessed references because without the bless operation, the interpreter would not know the type of the referred object.

Regular expressions, pattern matching and grepping

# a pattern matching condition. Can be used whenever
# a conditional can appear, e.g. in if, while, etc.
... $str =~ /.../ ...

# match and substitute as a side effect of condition evaluation
... $str =~ s/.../.../ ...

# [charlist] in a regexp matches any of the characters listed
@oarray = grep { expression } @iarray;
# equivalent shorter syntax, but less flexible and not so easier to read
@oarray = grep expression, @iarray;
is equivalent to
@oarray = ();
foreach (@iarray) {
  push @oarray, $_ if expression;
}
E.g.
@text = grep { !/^$/ } @text; # weed out empty lines
@teens = grep { 13 <= $_ && $_ <= 18 } @ages; # filter out non-teenagers
@array = grep defined, @array; # filter out all the undefined elements from array
@array = grep { defined } @array; # ditto; "defined" is short for "defined $_"
In a scalar context, you can get a count of how many items match the condition, e.g.:
$empty_lines = grep /^$/, @text;

# true if @items does not contain any undef elements
# an empty array yields true as well
if (!grep !defined, @items) { ... }

Subroutine references

sub x {
  print "this is x\n";
}

sub y {
  print "this is y\n";
}

sub w {
  $r = shift;
  &$r();
}

w \&x;  # prints "this is x"
w \&y;  # prints "this is y"
w sub { print "this is anonymous\n"; }

Debugging

Invoking the built-in Perl debugger: $ perl -d file.pl Type the "h" command for help on debugger commands. You can also use the debugger to the purpose of starting an interactive console for Perl. Just use a trivial program, like so: $ perl -de 1 or $ perl -d -e 0 etc. The value "1" or "0" doesn't matter in this case, it's just a valid statement that does nothing. Here for example you can find out what the include paths are:
main::(-e:1):   1
  DB<1> x values %INC 
0  '/usr/lib/perl5/5.8.5/warnings/register.pm'
1  '/usr/lib/perl5/5.8.5/attributes.pm'
2  '/usr/lib/perl5/5.8.5/i386-linux-thread-multi/XSLoader.pm'
3  '/usr/lib/perl5/5.8.5/i386-linux-thread-multi/IO/Handle.pm'
4  '/usr/lib/perl5/5.8.5/Term/Cap.pm'
5  '/usr/lib/perl5/5.8.5/SelectSaver.pm'
6  '/usr/lib/perl5/5.8.5/warnings.pm'
7  '/usr/lib/perl5/5.8.5/Carp/Heavy.pm'
8  '/usr/lib/perl5/5.8.5/i386-linux-thread-multi/Config.pm'
9  '/usr/lib/perl5/5.8.5/Symbol.pm'
10  '/usr/lib/perl5/5.8.5/i386-linux-thread-multi/IO.pm'
11  '/usr/lib/perl5/5.8.5/Carp.pm'
12  '/usr/lib/perl5/5.8.5/Term/ReadLine.pm'
13  '/usr/lib/perl5/5.8.5/strict.pm'
14  '/usr/lib/perl5/5.8.5/Exporter.pm'
15  '/usr/lib/perl5/5.8.5/vars.pm'
16  '/usr/lib/perl5/5.8.5/perl5db.pl'
You can also inspect data structures from programs themselves using the data dumper. E.g.:
$ perl
@a=('Antonio',33,1.75);
use Data::Dumper;
print Dumper \@a;
$VAR1 = [
          'Antonio',
          33,
          '1.75'
        ];

Installing new modules

First check with your Unix distribution as it may provide ready-made packages for the most common modules. If a package exists it would be preferable to manual installation.
# cpan ...
# cpan upgrade ...
or:
# cpan
cpan> install ...
or:
cpan> force install ...
To avoid answering a lot of "yes":
cpan> o conf prerequisites_policy follow
cpan> o conf commit
Install a specific version of a module, e.g. XML::LibXSLT though tests fail:
cpan> force install PAJAS/XML-LibXSLT-1.62.tar.gz
Have a look at http://www.cpan.org to search for module names.

Uninstall a module

Change to the directory where a Module is kept, e.g.:
# cd /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi/Module
# rm -fr Module Module.pm

POD

Plain Old Documentation (POD) is a simple markup language to document your code that is ignored by the Perl parser.

You can either mix POD with code or put the POD at the beginning or end of the file. In the latter case you can use a __END__ marker to signal the interpreter that there is no more code to parse and help syntax highlighting in some editors as well.

Paragraphs are separated by an empty line. A paragraph is:

verbatim
if begins with whitespace; it is left completely unformatted
ordinary
if starts with something besides an = or whitespace
a command paragraph
if begins with an equal sign, followed by the name of a pod directive that usually formats only the rest of the paragraph
A pod directives must come at the beginning of a line, and all begin with an equal sign.

=cut
signals the end of a POD section

=headX text...
heading of level X, X=1..4

Bulleted and numbered lists definitions start with =over followed by a number of indent spaces and end with =back:

=over 4

=item * First item of a bulleted list.

=item * Second item.

=back


=over 12

=item 1 First item of a numbered list.

=item 2 Second item.

=back

There are also commands to mark sections as being in another language. The latter feature allows for special formatting to be given to parsers that support it. E.g. to embed an HTML document:

=begin html

<html>
<body>

<h1>Heading</h1>

<p>Text paragraph.</p>

</body>
</html>

=end html

Formatting codes are parsed into both ordinary and command paragraphs and follow the format L< text... >. If you need to embed a > into the text, use L<< text... >> or L<<< text... >>> to embed >> etc. Most common are B for bold, I for italic, U for underlined or C for monospaced, L for linking to a manpage or specific sections within a manpage or the same document.

Conversion filters are available on the command line: pod2html, pod2latex, pod2man and pod2text. Full official documentation.

Lambda functions and closures

These are features taken from functional languages as LISP. Lambda functions have no name. They can be called as created:

(sub { ... })->();
assigned into a variable:

my $callback = sub { ... };
$callback->();
returned from other functions:

sub named_subroutine {
    ...
    return sub { ... };
}
my $callback = named_subroutine();
$callback->();
or passed as a parameter - see subroutine references.

What are they useful for? Mainly for implementing callbacks.

If we had no lambda functions or references in Perl we could use strings and the eval function. But callback code will need to be compiled at runtime, slowing your program and without providing you with syntax checking before program execution, thus making debugging harder.

my $callback = '...';
...
# When written this way, exception trap is atomic and thus thread-safe.
die("The custom code died with error: $@") if !defined eval "{ $callback }" && $@;

Anyway plain Lambda function are nothing new compared to subroutine reference which Perl also provides: you can also use them as a callback although you have to treat references specifically:

sub callback { ... };

my $c = \&callback;
&$c();

Anonymous functions aka closures are something more powerful: they are lamdba functions that reference variables from higher scopes (*) than that in which they were declared. These variables are preserved until calling time, even if scopes that contained them are no longer existent. This for example allows a callback set up into a subroutine to access some of its parent subroutine local data although when the callback will actually be executed the subroutine will be over. Of course, when there are no more references to a certain scope, the Perl garbage collector will delete it freeing precious memory space.

(*) When you enter a new function, a new lexical scope is created. Each scope is something like an association table containing the local variables (name and value) and links back to its containing/outer/parent scope. When code needs to lookup an identifier, it will search that lexical environment, by iterating up that chain until it finds it.

Links

No comments: