Perl Tutorial
Charles D. Cavanaugh, Ph.D.
2012-03-14

Contents
· Basic syntax and semantics
· Searching and replacing text
· Summarizing data
· Working with CSV files 
· Accessing databases
· Executing system commands

I. Basic syntax and semantics
    A. Useful perldoc Sections
        1. perl & perlintro
            a. What is Perl?
                i. Practical Extraction and Report Language by Larry Wall
            b. Running Perl programs
                i. perl progname.pl
                ii. #! (shabang) on first line
                    #!/usr/bin/perl
                    $chmod 755 script.pl
                    $./script.pl
            c. Safety net
               #!/usr/bin/perl
               use strict;
               use warnings;
            d. Basic syntax overview (more at perlsyn)
                i. Perl statements end in a semi-colon:
                   print "Hello, world";
                ii. Whitespace is irrelevant except inside quoted strings:
                    print
                        "Hello, world"
                        ;
                    print "Hello
                        world";
                iii. Double quotes or single quotes may be used around literal
                     strings:
                     print "Hello, world";
                     print 'Hello, world';
                iv. Only double quotes "interpolate" variables and special
                    characters (e.g. \n):
                    print "Hello, $name\n"; # works fine
                    print 'Hello, $name\n'; # prints $name\n literally
                v.  Numbers don't need quotes around them:
                    print 42;
                vi. You can use parentheses for functions' arguments or omit
                    them according to your personal taste. They are only
                    required occasionally to clarify issues of precedence.
                    print("Hello, world\n");
                    print "Hello, world\n";
            e. Perl variable types (more at perldata & perlvar)
                i. Scalars ($)
                   Single value:
                   my $animal = "camel";
                   my $answer = 42;
                   print $animal;
                   print "The animal is $animal\n";
                   print "The square of $answer is ", $answer * $answer, "\n";
                   Special variable: default variable ($_)
                   print; #prints contents of $_ by default
                ii. Arrays (@)
                   List of values:
                   my @animals = ("camel", "llama", "owl");
                   my @numbers = (23, 42, 69);
                   my @mixed = ("camel", 42, 1.23);
                   Zero-indexed:
                   print $animals[0]; # prints "camel"
                   print $animals[1]; # prints "llama"
                   Special variable: index of last element ($#array)
                   Length of array: #$array+1 or simply @array when compared
                   with scalar
                   Array slices:
                   @animals[0,1]; # gives ("camel", "llama");
                   @animals[0..2]; # gives ("camel", "llama", "owl");
                   @animals[1..$#animals]; # gives all except the first element
                   Useful array functions and special array variables:
                   my @sorted = sort @animals;
                   my @backwards = reverse @numbers;
                   @ARGV (the command line arguments to your script)
                   @_ (the arguments passed to a subroutine)
                iii. Hashes
                   A hash represents a set of key/value pairs:
                   my %fruit_color = ("apple", "red", "banana", "yellow");
                   my %fruit_color = (
                       apple => "red",
                       banana => "yellow",
                   );
                   $fruit_color{"apple"};           # gives "red"
                   You can get at lists of keys and values with keys() and
                   values().
                   my @fruits = keys %fruit_colors;
                   my @colors = values %fruit_colors;
                   Lists and hashes within lists and hashes:
                   my $variables = {
                       scalar  =>  {
                                    description => "single item",
                                    sigil => '$',
                                   },
                       array   =>  {
                                    description => "ordered list of items",
                                    sigil => '@',
                                   },
                       hash    =>  {
                                    description => "key/value pairs",
                                    sigil => '%',
                                   },
                   };
                   print "Scalars begin with a \
                         $variables->{'scalar'}->{'sigil'}\n";
            f. Variable scoping
               my $var = "value"; # creates block-scoped variables - good
               $var = "value"; # creates globals - not good
               my $x = "foo";
               my $some_condition = 1;
               if ($some_condition) {
                   my $y = "bar";
                   print $x; # prints "foo"
                   print $y; # prints "bar"
               }
               print $x; # prints "foo"
               print $y; # prints nothing; $y has fallen out of scope
            g. Conditional and looping constructs
                i. if
                   if ( condition ) {
                       ...
                   } elsif ( other condition ) {
                       ...
                   } else {
                       ...
                   }
                   
                   # This:                     |# is the same as this:
                   unless ( condition ) {      |if ( !condition ) {
                       ...                     |    ...
                   }                           |}
                   
                   Traditional:
                   if ($zippy) {
                       print "Yow!";
                   }
                   Perlish post-condition:
                   print "Yow!" if $zippy;
                   print "We have no bananas" unless $bananas;
                ii. while
                    while ( condition ) {
                        ...
                    }
                   
                    # This:                    |# is the same as this:
                    until ( condition ) {      |while ( !condition ) {
                        ...                    |    ...
                    }                          |}
                iii. for
                     C style:
                     for ($i = 0; $i <= $max; $i++) {
                         ...
                     }
                iv. foreach
                    Perl style:
                    foreach (@array) {
                        print "This element is $_\n";
                    }
                
                    print $list[$_] foreach 0 .. $max;

                    # you don't have to use the default $_ either...
                    foreach my $key (keys %hash) {
                        print "The value of $key is $hash{$key}\n";
                    }
            h. Builtin operators and functions (more at perlop & perlfunc)
                i. Arithmetic
                   + addition
                   - subtraction
                   * multiplication
                   / division
                ii. Numeric comparison
                    == equality
                    != inequality
                    < less than
                    > greater than
                    <= less than or equal
                    >= greater than or equal
                iii. String comparison
                     eq equality
                     ne inequality
                     lt less than
                     gt greater than
                     le less than or equal
                     ge greater than or equal
                iv. Boolean logic
                    && and
                    || or
                    ! not
                v. Miscellaneous
                   = assignment
                   . string concatenation
                   x string multiplication
                   .. range operator (creates a list of numbers)
                vi. Combining with =
                    $a += 1; # same as $a = $a + 1
                    $a -= 1; # same as $a = $a - 1
                    $a .= "\n"; # same as $a = $a . "\n";
            i. Files and I/O (more at perlfunc & perlopentut)
               You can open a file for input or output using the open()
               function.
               open(my $in, "<", "input.txt") 
                   or die "Can't open input.txt: $!"; # $! = last error
               open(my $out, ">", "output.txt") 
                   or die "Can't open output.txt: $!";
               open(my $log, ">>", "my.log") or die "Can't open my.log: $!";
               You can read from an open filehandle using the <> operator.
               In scalar context it reads a single line from the filehandle:
               my $line  = <$in>;
               In list context it reads the whole file in,
               assigning each line to an element of the list:
               my @lines = <$in>; # slurping (a memory-intensive task)
               Typical line-by-line reading using a while loop:
               while (<$in>) { # assigns each line in turn to $_
                   print "Just read in this line: $_";
               }
               Printing to a STDERR or a file:
               print STDERR "This is your final warning.\n";
               print $out $record;
               print $log $logmessage;
               Closing a file (good practice but not necessary):
               close $in or die "$in: $!";
            j. Regular expressions (more at perlrequick and perlretut)
                i. Simple matching
                   if (/foo/) { ... } # true if $_ contains "foo"
                   if ($a =~ /foo/) { ... } # true if $a contains "foo"
                ii. Simple substitution
                    s/foo/bar/;        # replaces foo with bar in $_
                    $a =~ s/foo/bar/;  # replaces foo with bar in $a
                    $a =~ s/foo/bar/g; # replaces ALL INSTANCES of foo
                                       # with bar in $a
                iii. More complex regular expressions (more in perlre)
                     Special characters:
                     . a single character
                     \s a whitespace character (space, tab, newline, ...)
                     \S non-whitespace character
                     \d a digit (0-9)
                     \D a non-digit
                     \w a word character (a-z, A-Z, 0-9, _)
                     \W a non-word character
                     [aeiou] matches a single character in the given set
                     [^aeiou] matches a single character outside the given set
                     (foo|bar|baz) matches any of the alternatives specified
                     ^ start of string
                     $ end of string
                     Quantifiers:
                     * zero or more of the previous thing
                     + one or more of the previous thing
                     ? zero or one of the previous thing
                     {3} matches exactly 3 of the previous thing
                     {3,6} matches between 3 and 6 of the previous thing
                     {3,} matches 3 or more of the previous thing
                     Examples:
                     /^\d+/ string starts with one or more digits
                     /^$/ nothing in the string (start and end are adjacent)
                     /(\d\s){3}/ a three digits, each followed by a whitespace
                                 character (eg "3 4 5 ")
                     /(a.)+/ matches a string in which every odd-numbered letter
                             is a (eg "abacadaf")
                     # This loop reads from STDIN, and prints non-blank lines:
                     while (<>) {
                         next if /^$/; # skips to next iteration
                         print;
                     }
                iv. Parentheses for capturing
                    Parentheses can also be used to capture the results of
                    parts of the regexp match for later use. The results end up
                    in $1, $2 and so on:
                    # a quick way to break an email address up into parts
                    if ($email =~ /([^@]+)@(.+)/) {
                        print "Username is $1\n";
                        print "Hostname is $2\n";
                    }
                v. Other regexp features documented in perlrequick, perlretut,
                   & perlre
            k. Writing subroutines (more at perlsub)
                i. Defining a subroutine
                   sub logger { # args are in @_
                       my $logmessage = shift; # shifts first item off arg list
                                               # into $logmessage
                       open my $logfile, ">>", "my.log"
                            or die "Could not open my.log: $!";
                       print $logfile $logmessage;
                   }
                ii. Calling a subroutine
                    logger("We have a logger subroutine!");
                iii. Other ways of using arg list (@_)
                     my ($logmessage, $priority) = @_; # common
                     my $logmessage = $_[0]; # uncommon, and ugly
                iv. Returning values
                    Definition: 
                    sub square {
                        my $num = shift;
                        my $result = $num * $num;
                        return $result;
                    }
                    Use:
                    $sq = square(8);
            l. OO Perl at perlboot, perltoot, perltooc, and perlobj
            m. Using Perl modules (perlmod, perlmodlib, perlmodinstall)   
                i. CPAN (www.cpan.org)
                ii. Read about module:
                    $perldoc Module::Name
                iii. Install module:
                     Easier if this is installed first (once):
                     $cpan App::cpanminus
                     Now to install a module:
                     $cpanm Module::Name
                     Another way:
                     $perl -MCPAN -e 'install Module::Name'
                     Yet another way:
                     $perl -MCPAN -e shell
                     cpan> install Module::Name
                     Do-it-yourself way:
                     Download the (tar.gz) file and build it yourself.
                     $tar -zxvf HTML-Template-2.8.tar.gz
                     $cd HTML-Template-2.8
                     $perl Makefile.PL
                     $make
                     $make test
                     $make install  
                iv. Use module in Perl script:
                    use Module::Name;
        2. perlstyle
            a. 4-Column indent.
            b. Opening curly on same line as keyword, if possible, otherwise 
               line up.
            c. Space before the opening curly of a multi-line BLOCK.
            d. One-line BLOCK may be put on one line, including curlies.
            e. No space before the semicolon.
            f. Semicolon omitted in "short" one-line BLOCK.
            g. Space around most operators.
            h. Space around a "complex" subscript (inside brackets).
            i. Blank lines between chunks that do different things.
            j. Uncuddled elses.
               /* cuddled "else" */    |/* uncuddled "else" */
               if (x > 0) {            |if (x > 0) {
                   x += y;             |    x += y;
               } else {                |}
                   y += x;             |else {
               }                       |    y +=x;
                                       |}
            k. No space between function name and its opening parenthesis.
            l. Space after each comma.
            m. Long lines broken after an operator (except and and or ).
            n. Space after last parenthesis matching on current line.
            o. Line up corresponding items vertically.
            p. Omit redundant punctuation as long as clarity doesn't suffer.
II. Searching and replacing text
    A. Useful perldoc sections
        1. perlrequick
            a. Simple word matching
                i. "Hello World" =~ /World/;  # matches
                ii. print "It matches\n" if "Hello World" =~ /World/;
                iii. print "It doesn't match\n" if "Hello World" !~ /World/;
                iv. $greeting = "World";
                    print "It matches\n" if "Hello World" =~ /$greeting/;
                v. If you're matching against $_ , the $_ =~ part can be
                   omitted:
                   $_ = "Hello World";
                   print "It matches\n" if /World/;
                vi. "Hello World" =~ m!World!; # matches, delimited by '!'
                    "Hello World" =~ m{World}; # matches, note the matching '{}'
                    "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
                                                 # '/' becomes an ordinary char
                vii. "Hello World" =~ /world/; # doesn't match, case sensitive
                     "Hello World" =~ /o W/; # matches, ' ' is an ordinary char
                     "Hello World" =~ /World /; # doesn't match, no ' ' at end
                viii. "Hello World" =~ /o/; # matches 'o' in 'Hello'
                      "That hat is red" =~ /hat/; # matches 'hat' in 'That'
                ix. The metacharacters: {}[]()^$.|*+?\
                x. A metacharacter can be matched by putting a backslash before it:
                   "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
                   "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
                   'C:\WIN32' =~ /C:\\WIN/; # matches
                   "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
                   Now do you see why we use other delimiters?
                xi. Non-printable ASCII characters are represented by escape
                    sequences.
                    Common examples are \t for a tab, \n for a newline, and \r
                    for a carriage return. Arbitrary bytes are represented by
                    octal escape sequences, e.g., \033 , or hexadecimal escape
                    sequences, e.g., \x1B :
                    "1000\t2000" =~ m(0\t2) # matches
                    "cat" =~ /\143\x61\x74/ # matches in ASCII, but a weird way
                                            # to spell cat
                xii. Regexes are treated mostly as double-quoted strings, so
                     variable substitution works:
                     $foo = 'house';
                     'cathouse' =~ /cat$foo/; # matches
                     'housecat' =~ /${foo}cat/; # matches
                xiii. Anchor metacharacters (match at beginning) ^ and (match
                      at end) $:
                      "housekeeper" =~ /keeper/; # matches
                      "housekeeper" =~ /^keeper/; # doesn't match
                      "housekeeper" =~ /keeper$/; # matches
                      "housekeeper\n" =~ /keeper$/; # matches
                      "housekeeper" =~ /^housekeeper$/; # matches
            b. Using character classes
                i. A character class allows a set of possible characters to
                   match at a particular point in a regex. Character classes 
                   are denoted by brackets [...] , with the set of characters
                   to be possibly matched inside. 
                   /cat/; # matches 'cat'
                   /[bcr]at/; # matches 'bat', 'cat', or 'rat'
                   "abc" =~ /[cab]/; # matches 'a'
                   /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
                                   # 'yes', 'Yes', 'YES', etc.
                   /yes/i; # also match 'yes' in a case-insensitive way using 
                           # i modifier
                   The special characters for a character class are -]\^$
                   and are matched using an escape:
                   /[\]c]def/; # matches ']def' or 'cdef'
                   $x = 'bcr';
                   /[$x]at/; # matches 'bat, 'cat', or 'rat'
                   /[\$x]at/; # matches '$at' or 'xat'
                   /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
                   The special character '-' acts as a range operator
                   within character classes:
                   /item[0-9]/;  # matches 'item0' or ... or 'item9'
                   /[0-9a-fA-F]/; # matches a hexadecimal digit
                   Note: If '-' is the first or last character in a character 
                   class, it is treated as an ordinary character.
                   The special character ^ in the first position of a character
                   class denotes a negated character class, which matches any 
                   character but those in the brackets. 
                   /[^a]at/; # doesn't match 'aat' or 'at', but matches
                             # all other 'bat', 'cat, '0at', '%at', etc.
                   /[^0-9]/; # matches a non-numeric character
                   /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
                   Abbreviations for common character classes:
                   \d is a digit and represents [0-9]
                   \s is a whitespace character and represents [\ \t\r\n\f]
                   \w is a word character (alphanumeric or _) and represents
                      [0-9a-zA-Z_]
                   \D is a negated \d; it represents any character but a digit
                      [^0-9]
                   \S is a negated \s; it represents any non-whitespace 
                      character [^\s]
                   \W is a negated \w; it represents any non-word character
                      [^\w]
                   The period '.' matches any character but "\n"
                   The \d\s\w\D\S\W abbreviations can be used both inside and
                       outside of character classes. Here are some in use:
                   /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
                   /[\d\s]/; # matches any digit or whitespace character
                   /\w\W\w/; # matches a word char, followed by a
                             # non-word char, followed by a word char
                   /..rt/; # matches any two chars, followed by 'rt'
                   /end\./; # matches 'end.'
                   /end[.]/; # same thing, matches 'end.'
                   The word anchor \b matches a boundary between a word 
                      character and a non-word character \w\W or \W\w :
                   $x = "Housecat catenates house and cat";
                   $x =~ /\bcat/; # matches cat in 'catenates'
                   $x =~ /cat\b/; # matches cat in 'housecat'
                   $x =~ /\bcat\b/; # matches 'cat' at end of string
            c. Matching this or that using alternation metacharacter '|'
               "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
               "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
               "cats" =~ /c|ca|cat|cats/; # matches "c"
               "cats" =~ /cats|cat|ca|c/; # matches "cats"
            d. Grouping things and hierarchical matching
               Grouping metacharacters () allow a part of a regex to be treated
                  as a single unit.
               /(a|b)b/; # matches 'ab' or 'bb'
               /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
               /house(cat|)/; # matches either 'housecat' or 'house'
               /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
                                  # 'house'. Note groups can be nested.
               "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
                                       # because '20\d\d' can't match
            e. Extracting matches
               For each grouping, the part that matched inside goes into the 
                  special variables
               $1 , $2 , etc. 
               # extract hours, minutes, seconds
               $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
               $hours = $1;
               $minutes = $2;
               $seconds = $3;
               ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
               /(ab(cd|ef)((gi)|j))/;
                1  2      34
               Backreferences \g1 , \g2 , ... are matching variables that can
                   be used inside a regex:
               /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
            f. Matching repetitions using quantifier metacharacters ?, * , + ,
                   and {}
               a? = match 'a' 1 or 0 times
               a* = match 'a' 0 or more times, i.e., any number of times
               a+ = match 'a' 1 or more times, i.e., at least once
               a{n,m} = match at least n times, but not more than m times.
               a{n,} = match at least n or more times
               a{n} = match exactly n times
               /[a-z]+\s+\d*/; # match a lowercase word, at least some space, 
                               # and any number of digits
               /(\w+)\s+\g1/; # match doubled words of arbitrary length
               $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
                                     # than 4 digits
               $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit
                                           # dates
               These quantifiers will try to match as much of the string as
                   possible, while still allowing the regex to match.
               $x = 'the cat in the hat';
               $x =~ /^(.*)(at)(.*)$/; # matches,
                                       # $1 = 'the cat in the h'
                                       # $2 = 'at'
                                       # $3 = '' (0 matches)
            g. More matching
               The global modifier //g
               $x = "cat dog house"; # 3 words
               while ($x =~ /(\w+)/g) {
                   print "Word is $1, ends at position ", pos $x, "\n";
               }
               Word is cat, ends at position 3
               Word is dog, ends at position 7
               Word is house, ends at position 13
               @words = ($x =~ /(\w+)/g); # matches,
                                          # $word[0] = 'cat'
                                          # $word[1] = 'dog'
                                          # $word[2] = 'house'
            h. Search and replace
               $x = "Time to feed the cat!";
               $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
               $y = "'quoted words'";
               $y =~ s/^'(.*)'$/$1/; # strip single quotes,
                                     # $y contains "quoted words"
               $x = "I batted 4 for 4";
               $x =~ s/4/four/; # $x contains "I batted four for 4"
               $x = "I batted 4 for 4";
               $x =~ s/4/four/g; # $x contains "I batted four for four"
               $x = "I like dogs.";
               $y = $x =~ s/dogs/cats/r;
               print "$x $y\n"; # prints "I like dogs. I like cats."
               $x = "Cats are great.";
               print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ \
                    s/Frogs/Hedgehogs/r, "\n";
               # prints "Hedgehogs are great."
               @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
               # @foo is now qw(X X X 1 2 3)
               # reverse all the words in a string
               $x = "the cat in the hat";
               $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
               # convert percentage to decimal
               $x = "A 39% hit rate";
               $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
	    i. The split operator
               $x = "Calvin and Hobbes";
               @word = split /\s+/, $x; # $word[0] = 'Calvin'
                                        # $word[1] = 'and'
                                        # $word[2] = 'Hobbes'
               To extract a comma-delimited list of numbers, use
               $x = "1.618,2.718, 3.142";
               @const = split /,\s*/, $x; # $const[0] = '1.618'
                                          # $const[1] = '2.718'
                                          # $const[2] = '3.142'
               If the regex has groupings, then the list produced contains the
               matched substrings from the groupings as well:
               $x = "/usr/bin";
               @parts = split m!(/)!, $x; # $parts[0] = ''
                                          # $parts[1] = '/'
                                          # $parts[2] = 'usr'
                                          # $parts[3] = '/'
                                          # $parts[4] = 'bin'
    B. Perl pie (perl -p -i -e)
        1. Source
            |           UNIX GURU UNIVERSE 
            |              UNIX HOT TIP
            |     Unix Tip 2177 - December 17, 2002
            | http://www.ugu.com/sui/ugu/show?tip.today
            |           EAT YOUR PERL PIE
            |    Mom always sed, "eat your Perl pie"!
        2. Format
            a. Global search and replace
                i. $perl -p -i -e 's/original text string/replacement/g' foo
            b. Replace first instance only 
                ii. $perl -p -i -e 's/original text string/replacement' foo
III. Summarizing data
     A. Reading through a multiline file
     B. Parsing strings
         1. Alphanumeric
         2. Numeric
     C. String manipulation
     D. Numeric manipulation
     E. Date manipulation
     F. Writing to a file
IV. Working with CSV files
     A. Text::CSV - comma-separated values manipulator
     B. DBD::CSV - DBI driver for CSV files 
V. Accessing databases Using DBD::mysql
     A. MySQL driver for the Perl5 Database Interface (DBI)
     B. Query
     C. Insert
     D. Delete
     E. Update  
VI. Executing system commands
     A. system 
     B. open (http://www.tek-tips.com/faqs.cfm?fid=5198)
