Lesson 11 | Perl split function |
Objective | Given a set of text data, write a Perl program that uses the split function to get just the browser name and version. |
Perl Split Function
The
split
function is, in some ways, the inverse of the
join
function. However, it is more powerful than that description may imply, because it uses a regular expression to specify where the splits occur. The
split
function returns a list comprising all the resulting elements from splitting a string. Any of these forms are acceptable for invoking
split
split
/PATTERN/, EXPR, LIMIT
split
/PATTERN/, EXPR
split
/PATTERN/
split
The
/PATTERN/
parameter dictates where the string is split.
The
EXPR
parameter is the string from which the split is derived.
The
LIMIT
parameter is a number specifying the maximum pieces to split from the string (however, the split may result in fewer pieces).
If
LIMIT
is omitted, the results will be as many pieces as may be made using the pattern to split the string; if
EXPR
is omitted,
the special
$_
string is used for the input string; and if
/PATTERN/
is omitted, the string is split on whitespace, after omitting any leading whitespace. Here are
several examples that illustrate the power of the
split
function. The fact that the
split
function uses a regex to split strings makes it far more powerful than if it just split based on a simple mask. There are many circumstances in which you will probably choose to use the pattern match (
m//
) operator instead, but this little function is not to be underestimated.
Splitting a Mail header using Perl
Here is an example of a legacy mail header:
From wew@bearnet.com Fri Jun 27 20:31:48 1997
Return-Path: <wew@bearnet.com>
Received: looloo.bearnet.com (207.55.144.29)
by luna.bearnet.com with SMTP;
27 Jun 1997 20:31:47 -0000
Date: Fri, 27 Jun 1997 13:31:42 -0700 (PDT)
From: wew@bearnet.com
To: You There <you@overthere.com>
Message-Id:
<199706272031.NAA08953@luna.bearnet.com>
Subject: Welcome to the world of email!
What is wrong with the above mail header?
- Outdated Date Format:
- The email date format, while still readable, lacks some standardization seen in modern email systems, such as full compliance with RFC 5322, which is more strictly adhered to now. The inclusion of time zone abbreviations like "PDT" (Pacific Daylight Time) is less common in modern headers, where numeric offsets (e.g.,
-0700
) are preferred to avoid ambiguities.
- Unspecified Message-Id:
- The
Message-Id
is properly formatted but lacks some of the robustness that modern email systems generate, often containing more unique identifiers based on the domain or more specific tracking purposes.
- Return-Path:
- The
Return-Path
header, which is used to store the email address for bounces, is still valid today but has since been standardized to include more complex handling with anti-spam technologies like SPF (Sender Policy Framework) and DKIM (DomainKeys Identified Mail). The simple address in this header suggests a time before widespread spam concerns required more sophisticated validation.
- Received Header:
- The
Received
header, detailing how the email was processed between servers, remains a critical part of mail transmission logs even today. However, the use of IP addresses (like 207.55.144.29
) directly in the Received
header without further security details or checks like SPF, DKIM, and DMARC is typical of a less secure time in email history.
- Overall Format:
- The format of headers is still compliant with modern email RFCs (such as RFC 822 or its successors like RFC 5322). However, the general approach in this header, from the
Return-Path
to the Message-Id
structure, reflects a time before the security challenges and mass scale of email required stricter handling, authentication, and spam prevention technologies.
Summary:
This header can be seen as a historical artifact representing an era when email was less regulated and the protocols were simpler. Today, many additional headers and security measures (e.g., DKIM, SPF, and DMARC) are part of a modern email's standard, reflecting the evolving challenges of spam, phishing, and
security concerns.
Historical Artifact
The mail header above can be viewed as a historical artifact. While it reflects a valid format for email headers, certain aspects of it are outdated when compared to modern email standards and practices. Here are a few points to consider:
This Perl code can be used to split a mail header, but there are some aspects that could be improved or clarified for better robustness and readability.
How the Code Works:
- `while(<>)`:
- Reads input line by line, either from standard input (if no file is provided) or from files specified on the command line.
- `chomp`:
- Removes the trailing newline from each line.
- `last unless $_;`:
- Terminates the loop when it encounters an empty line (which typically separates the mail header from the body in an email).
- `next unless /^\w*:/;`:
- Skips lines that do not appear to be header fields (i.e., lines that don't start with a word followed by a colon, which is the format of headers like `From:`, `To:`, `Subject:`, etc.).
- `split /:\s*/;`:
- Splits the line into two parts, the header name (`lhs`) and its value (`rhs`), using `:` followed by optional whitespace as the delimiter.
- `$headers{uc $lhs} = $rhs;`:
- Stores the header name (converted to uppercase for consistency) and its corresponding value in the `%headers` hash.
Possible Improvements:
- Handle Multiline Headers:
- Some email headers can span multiple lines, with subsequent lines beginning with whitespace. The code as written doesn't handle this case.
To address it, you could concatenate lines that are part of the same header before processing.
- Whitespace Handling:
- The
split /:\s*/
will only split on the first colon and trim the spaces following it, but it doesn't trim leading or trailing whitespace on the rhs
. You might want to add a step to remove trailing spaces from both lhs
and rhs
for cleaner parsing.
- Empty Headers:
- The code assumes that each header has a value. Some headers may be present without values (like
Cc:
with no recipients). You may want to handle such cases to avoid undefined values.
Updated Version with Improvements:
use strict;
use warnings;
my %headers;
my $current_header = '';
while (<>) {
chomp;
# Stop if we encounter an empty line (end of headers)
last unless $_;
# Handle multiline headers (lines starting with whitespace)
if (/^\s+/ && $current_header) {
# Continuation of the previous header
$headers{$current_header} .= " $_";
next;
}
# Skip lines that don't look like headers
next unless /^\w*:/;
# Split on the first colon and handle whitespace
my ($lhs, $rhs) = split /:\s*/, $_, 2;
# Trim any leading/trailing whitespace
$lhs =~ s/^\s+|\s+$//g;
$rhs =~ s/^\s+|\s+$//g;
# Store the header, uppercased for consistency
$current_header = uc $lhs;
$headers{$current_header} = $rhs;
}
# Now %headers contains the mail headers with their values
Key Changes:
- Handling Multiline Headers: Added a check to handle headers that span multiple lines by detecting lines starting with whitespace.
- Trimming Whitespace: Trimmed both the
lhs
(header name) and rhs
(header value) for more accurate parsing.
- Improved Split: Used
split /:\s*/, $_, 2
to ensure that only the first colon is used to split, allowing the value to contain additional colons if necessary (e.g., in time zones like Date: Mon, 27 Jun 1997 13:31:42 -0700
).
This code will work better for real-world email headers and provide more reliable results when handling various edge cases in email parsing.
Perl Legacy version to split Mail Header
while(<>) {
chomp;
last unless $_;
next unless /^\w*:/;
($lhs, $rhs) = split /:\s*/;
$headers{uc $lhs} = $rhs;
}
- The line
last unless $_;
ends the loop at the first blank line. The last
statement tells a looping structure to go to the last possible increment of the loop, complete the cycle, then exit. We will be looking at last
in more detail in Module 5.
- The line
next unless /^\w*:/;
skips old Unix-style headers that do not have a colon. The next
statement tells a looping structure to skip the remaining steps in a given cycle of the loop and go to the next increment. We will be looking at next
in more detail in Module 5.
- The
%headers
hash will get the mail headers (except duplicate header lines).
Now you can easily do something like this:
print "on $headers{DATE},
$headers{FROM} said: . . . \n";
The following simple program prints the name, home directory, and login shell of all the users on a Unix system:
#!/usr/bin/perl -w
my ($lhsogin, $passwd, $uid, $gid,
$gcos, $home, $shell);
open(PASSWD, '</etc/passwd');
while (<PASSWD>) {
chomp;
($lhsogin, $passwd, $uid, $gid, $gcos,
$home, $shell) = split /:/;
print "$lhsogin ($gcos): UID: $uid,
HOME: $home, SHELL: $shell\n";
}
The
/etc/passwd
file is a colon-delimited list of all the user-related information for each user on a Unix system (the password is only one component of that information, and it is one-way encoded so it can not be read anyway).
Perl Split Function - Exercise