CGI Forms   «Prev  Next»

Lesson 6Form-data encoding
ObjectiveHow form data is encoded for transmission.

Form-data Encoding in Perl

When the browser sends the form data to the server, it does so in a slightly encoded format. The reason for that encoding is to ensure that the data maintains its integrity across whatever platforms and transmission methods may stand between the client and the server. There are two major levels of encoding involved in creating the query string:
  1. Encoding of the name/value pairs that represent the data in the form.
  2. Encoding of any special or "unsafe" characters for the benefit of the transmission media.

We will consider both of these levels in turn:


Name value pairs

The name/value pairs are the values of the FORM fields, and their associated names for identification. The name/value pairs are encoded as:
name=value&name=value&name=value . . .

The ampersand character (&) separates one name/value pair from another, so you can separate them easily with Perl's split function like this:
@pairs = split(/&/, $query_string)

Likewise, the equal sign (=) is used to separate the name from the value within each pair. So you can separate the name from the value like this:
($name, $value) = split(/=/, $pair)

This allows you to use this sort of construct in Perl to extract the name/value pairs from the query string:
foreach $pair (split(/&/, $query_string)) {
 ($_qsname, $_qsvalue) = split(/=/, $pair);
 . . . 
}


Unsafe-character encoding

According to the URL specification (RFC 1738), any character that is not alphanumeric or one of these special characters:
$-_.+!*'(),

is considered unsafe, and must be encoded unless it's being used for a designated special purpose.
In practice, the only characters that are not commonly encoded are alphanumerics and these characters:
-_.

The encoding scheme used for unsafe characters uses a hexadecimal representation of the encoded character, introduced by a percent sign (%). This, of course, necessitates that a percent sign must be represented by its own value, 25 hex, if it is not being used for this purpose. Additionally, any spaces in the original string are replaced with the + (plus) sign. What does this look like? Here are a few strings and their encoded equivalents:
String Encoded equivalent
wew@bearnet.com wew%40bearnet.com
Big light in sky Big+light+in+sky
25^3 & 14% of 12. 25%5E3+%26+14%25+of+12.


DATA as a File

Perl has two special tokens: __END__ and __DATA__, which, if on a line by themselves, tell Perl that it’s reached the end of the program and to stop compiling. However, the __DATA__ token also tells Perl that it can read the data after said token (__END__ can sometimes do this too, but read perldoc perldata for the details and pretend you never knew you could do this.). Listing 3-2 (code fi le listing_3_2_reading_from_data.pl) has an example.
Llisting 3-2: Reading DATA
use strict;
use warnings;
use diagnostics;
use Data::Dumper;
my %config;
while (<DATA>) {
  next if /^\s*#/; # skip comments
  next unless /(\w+)\s*=\s*(\w+)/; # key = value
  my ( $key, $value ) = ( $1, $2 );
  if ( exists $config{$key} ) { # convert the value to an array reference
    # Does $config{$key} currently store a scalar or an aref?
    if( ! ref $config{$key} ) {
      $config{$key} = [ $config{$key} ];
    }
    push @{ $config{$key} } => $value;
  } $ end if
  else {
    $config{$key} = $value;
  }
}
print Dumper(\%config);
__DATA__
# max_tries = 3
max_tries = 2 
timeout = 30
# only these people are OK
user = Ovid
user = Sally
user = Bob

Running the code in Listing 3-2 prints something similar to the following:
$VAR1 = {
'max_tries' => '2',
'timeout' => '30',
'user' => [
'Ovid',
'Sally',
'Bob'
]
};
In this case, you used the DATA section of your code to embed a tiny config file. As a general rule, you can read from only the DATA section once, but if you need to read from it more than once, use the following code:


# Find the start of the __DATA__ section
my $data_start = tell DATA;
while ( <DATA> ) {
  #do something
}
# Reset DATA filehandle to start of __DATA__
seek DATA, $data_start, 0;
In case you are wondering, yes, you can also write to the DATA section if you have the correct permission, but this is generally a bad idea. (Hint: If you get it wrong, you can overwrite your program.)

Encode - character encodings in Perl

use Encode qw(decode encode);
$characters = decode('UTF-8', $octets,     Encode::FB_CROAK);
$octets = encode('UTF-8', $characters, Encode::FB_CROAK);

Advanced Perl Programming
Encode consists of a collection of modules whose details are too extensive to fit in one document. This one itself explains the top-level APIs and general topics at a glance. For other topics and more details, see the documentation for these modules:
The Encode module provides the interface between Perl strings and the rest of the system. Perl strings are sequences of characters. The set of characters that Perl can represent is a superset of those defined by the Unicode Consortium. On most platforms the ordinal values of a character as returned by ord(S) is the Unicode codepoint for that character. The exceptions are platforms where the legacy encoding is some variant of EBCDIC rather than a superset of ASCII; see perlebcdic.
During recent history, data is moved around a computer in 8-bit chunks, often called "bytes" but also known as "octets" in standards documents. Perl is widely used to manipulate data of many types: not only strings of characters representing human or computer languages, but also "binary" data, being the machine's representation of numbers, pixels in an image, etc.
When Perl is processing binary data, the programmer wants Perl to process "sequences of bytes". This is not a problem for Perl: because a byte has 256 possible values, it easily fits in Perl's much larger "logical character".
This document mostly explains the how. perlunitut and perlunifaq explain the why.

Data Encoding - Exercise

Click the Exercise link below to examine and analyze the decoding in a subroutine.
Data Encoding - Exercise

SEMrush Software