Regular Expressions   «Prev  Next»
Lesson 11 Perl split function
ObjectiveGiven a set of text data, write a Perl program that uses the split function to get just the browser name and version.

Perl Split Function

The split function is, in some ways, the inverse of the join function. However, it is more powerful than that description may imply, because it uses a regular expression to specify where the splits occur. The split function returns a list comprising all the resulting elements from splitting a string. Any of these forms are acceptable for invoking split
  1. split /PATTERN/, EXPR, LIMIT
  2. split /PATTERN/, EXPR
  3. split /PATTERN/
  4. split

The /PATTERN/ parameter dictates where the string is split.
The EXPR parameter is the string from which the split is derived.
The LIMIT parameter is a number specifying the maximum pieces to split from the string (however, the split may result in fewer pieces).
If LIMIT is omitted, the results will be as many pieces as may be made using the pattern to split the string; if EXPR is omitted, the special $_ string is used for the input string; and if /PATTERN/ is omitted, the string is split on whitespace, after omitting any leading whitespace. Here are several examples that illustrate the power of the split function. The fact that the split function uses a regex to split strings makes it far more powerful than if it just split based on a simple mask. There are many circumstances in which you will probably choose to use the pattern match (m//) operator instead, but this little function is not to be underestimated.

Splitting a Mail header using Perl

Here is an example of a legacy mail header:
From wew@bearnet.com Fri Jun 27 20:31:48 1997
Return-Path: <wew@bearnet.com>
Received: looloo.bearnet.com (207.55.144.29)
 by luna.bearnet.com with SMTP;
 27 Jun 1997 20:31:47 -0000
Date: Fri, 27 Jun 1997 13:31:42 -0700 (PDT)
From: wew@bearnet.com
To: You There <you@overthere.com>
Message-Id:
 <199706272031.NAA08953@luna.bearnet.com>
Subject: Welcome to the world of email!

What is wrong with the above mail header?
  1. Outdated Date Format:
    • The email date format, while still readable, lacks some standardization seen in modern email systems, such as full compliance with RFC 5322, which is more strictly adhered to now. The inclusion of time zone abbreviations like "PDT" (Pacific Daylight Time) is less common in modern headers, where numeric offsets (e.g., -0700) are preferred to avoid ambiguities.
  2. Unspecified Message-Id:
    • The Message-Id is properly formatted but lacks some of the robustness that modern email systems generate, often containing more unique identifiers based on the domain or more specific tracking purposes.
  3. Return-Path:
    • The Return-Path header, which is used to store the email address for bounces, is still valid today but has since been standardized to include more complex handling with anti-spam technologies like SPF (Sender Policy Framework) and DKIM (DomainKeys Identified Mail). The simple address in this header suggests a time before widespread spam concerns required more sophisticated validation.
  4. Received Header:
    • The Received header, detailing how the email was processed between servers, remains a critical part of mail transmission logs even today. However, the use of IP addresses (like 207.55.144.29) directly in the Received header without further security details or checks like SPF, DKIM, and DMARC is typical of a less secure time in email history.
  5. Overall Format:
    • The format of headers is still compliant with modern email RFCs (such as RFC 822 or its successors like RFC 5322). However, the general approach in this header, from the Return-Path to the Message-Id structure, reflects a time before the security challenges and mass scale of email required stricter handling, authentication, and spam prevention technologies.

Summary: This header can be seen as a historical artifact representing an era when email was less regulated and the protocols were simpler. Today, many additional headers and security measures (e.g., DKIM, SPF, and DMARC) are part of a modern email's standard, reflecting the evolving challenges of spam, phishing, and
security concerns.


Historical Artifact

The mail header above can be viewed as a historical artifact. While it reflects a valid format for email headers, certain aspects of it are outdated when compared to modern email standards and practices. Here are a few points to consider: This Perl code can be used to split a mail header, but there are some aspects that could be improved or clarified for better robustness and readability. How the Code Works:
  1. `while(<>)`:
    • Reads input line by line, either from standard input (if no file is provided) or from files specified on the command line.
  2. `chomp`:
    • Removes the trailing newline from each line.
  3. `last unless $_;`:
    • Terminates the loop when it encounters an empty line (which typically separates the mail header from the body in an email).
  4. `next unless /^\w*:/;`:
    • Skips lines that do not appear to be header fields (i.e., lines that don't start with a word followed by a colon, which is the format of headers like `From:`, `To:`, `Subject:`, etc.).
  5. `split /:\s*/;`:
    • Splits the line into two parts, the header name (`lhs`) and its value (`rhs`), using `:` followed by optional whitespace as the delimiter.
  6. `$headers{uc $lhs} = $rhs;`:
    • Stores the header name (converted to uppercase for consistency) and its corresponding value in the `%headers` hash.


Possible Improvements:
  1. Handle Multiline Headers:
    • Some email headers can span multiple lines, with subsequent lines beginning with whitespace. The code as written doesn't handle this case.
      To address it, you could concatenate lines that are part of the same header before processing.
  2. Whitespace Handling:
    • The split /:\s*/ will only split on the first colon and trim the spaces following it, but it doesn't trim leading or trailing whitespace on the rhs. You might want to add a step to remove trailing spaces from both lhs and rhs for cleaner parsing.
  3. Empty Headers:
    • The code assumes that each header has a value. Some headers may be present without values (like Cc: with no recipients). You may want to handle such cases to avoid undefined values.

Updated Version with Improvements:
use strict;
use warnings;

my %headers;
my $current_header = '';

while (<>) {
    chomp;
    
    # Stop if we encounter an empty line (end of headers)
    last unless $_;
    
    # Handle multiline headers (lines starting with whitespace)
    if (/^\s+/ && $current_header) {
        # Continuation of the previous header
        $headers{$current_header} .= " $_";
        next;
    }
    
    # Skip lines that don't look like headers
    next unless /^\w*:/;
    
    # Split on the first colon and handle whitespace
    my ($lhs, $rhs) = split /:\s*/, $_, 2;
    
    # Trim any leading/trailing whitespace
    $lhs =~ s/^\s+|\s+$//g;
    $rhs =~ s/^\s+|\s+$//g;
    
    # Store the header, uppercased for consistency
    $current_header = uc $lhs;
    $headers{$current_header} = $rhs;
}

# Now %headers contains the mail headers with their values

Key Changes:
  1. Handling Multiline Headers: Added a check to handle headers that span multiple lines by detecting lines starting with whitespace.
  2. Trimming Whitespace: Trimmed both the lhs (header name) and rhs (header value) for more accurate parsing.
  3. Improved Split: Used split /:\s*/, $_, 2 to ensure that only the first colon is used to split, allowing the value to contain additional colons if necessary (e.g., in time zones like Date: Mon, 27 Jun 1997 13:31:42 -0700).

This code will work better for real-world email headers and provide more reliable results when handling various edge cases in email parsing.



Perl Legacy version to split Mail Header

while(<>) {
 chomp;
 last unless $_; 
 next unless /^\w*:/; 
 ($lhs, $rhs) = split /:\s*/;          
 $headers{uc $lhs}  = $rhs;
}

  1. The line last unless $_; ends the loop at the first blank line. The last statement tells a looping structure to go to the last possible increment of the loop, complete the cycle, then exit. We will be looking at last in more detail in Module 5.
  2. The line next unless /^\w*:/; skips old Unix-style headers that do not have a colon. The next statement tells a looping structure to skip the remaining steps in a given cycle of the loop and go to the next increment. We will be looking at next in more detail in Module 5.
  3. The %headers hash will get the mail headers (except duplicate header lines).
Now you can easily do something like this:
print "on $headers{DATE},
   $headers{FROM} said: . . . \n";

The following simple program prints the name, home directory, and login shell of all the users on a Unix system:
 
#!/usr/bin/perl -w

my ($lhsogin, $passwd, $uid, $gid,
  $gcos, $home, $shell);
open(PASSWD, '</etc/passwd');
while (<PASSWD>) {
 chomp;
 ($lhsogin, $passwd, $uid, $gid, $gcos,
     $home, $shell) = split /:/;
 print "$lhsogin ($gcos): UID: $uid,
     HOME: $home, SHELL: $shell\n";
}   

The /etc/passwd file is a colon-delimited list of all the user-related information for each user on a Unix system (the password is only one component of that information, and it is one-way encoded so it can not be read anyway).


Perl Split Function - Exercise

Click the Exercise link below to use the split function with a specified set of data.
Perl Split Function - Exercise

SEMrush Software