You are viewing the version of this documentation from Perl 5.12.1. View the latest version

CONTENTS

NAME

Encode - character encodings

SYNOPSIS

use Encode;

Table of Contents

Encode consists of a collection of modules whose details are too big to fit in one document. This POD itself explains the top-level APIs and general topics at a glance. For other topics and more details, see the PODs below:

Name			        Description
--------------------------------------------------------
Encode::Alias         Alias definitions to encodings
Encode::Encoding      Encode Implementation Base Class
Encode::Supported     List of Supported Encodings
Encode::CN            Simplified Chinese Encodings
Encode::JP            Japanese Encodings
Encode::KR            Korean Encodings
Encode::TW            Traditional Chinese Encodings
--------------------------------------------------------

DESCRIPTION

The Encode module provides the interfaces between Perl's strings and the rest of the system. Perl strings are sequences of characters.

The repertoire of characters that Perl can represent is at least that defined by the Unicode Consortium. On most platforms the ordinal values of the characters (as returned by ord(ch)) is the "Unicode codepoint" for the character (the exceptions are those platforms where the legacy encoding is some variant of EBCDIC rather than a super-set of ASCII - see perlebcdic).

Traditionally, computer data has been moved around in 8-bit chunks often called "bytes". These chunks are also known as "octets" in networking standards. Perl is widely used to manipulate data of many types - not only strings of characters representing human or computer languages but also "binary" data being the machine's representation of numbers, pixels in an image - or just about anything.

When Perl is processing "binary data", the programmer wants Perl to process "sequences of bytes". This is not a problem for Perl - as a byte has 256 possible values, it easily fits in Perl's much larger "logical character".

TERMINOLOGY

PERL ENCODING API

$octets = encode(ENCODING, $string [, CHECK])

Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see "Defining Aliases". For CHECK, see "Handling Malformed Data".

For example, to convert a string from Perl's internal format to iso-8859-1 (also known as Latin1),

$octets = encode("iso-8859-1", $string);

CAVEAT: When you run $octets = encode("utf8", $string), then $octets may not be equal to $string. Though they both contain the same data, the UTF8 flag for $octets is always off. When you encode anything, UTF8 flag of the result is always off, even when it contains completely valid utf8 string. See "The UTF8 flag" below.

If the $string is undef then undef is returned.

$string = decode(ENCODING, $octets [, CHECK])

Decodes a sequence of octets assumed to be in ENCODING into Perl's internal form and returns the resulting string. As in encode(), ENCODING can be either a canonical name or an alias. For encoding names and aliases, see "Defining Aliases". For CHECK, see "Handling Malformed Data".

For example, to convert ISO-8859-1 data to a string in Perl's internal format:

$string = decode("iso-8859-1", $octets);

CAVEAT: When you run $string = decode("utf8", $octets), then $string may not be equal to $octets. Though they both contain the same data, the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines). See "The UTF8 flag" below.

If the $string is undef then undef is returned.

[$obj =] find_encoding(ENCODING)

Returns the encoding object corresponding to ENCODING. Returns undef if no matching ENCODING is find.

This object is what actually does the actual (en|de)coding.

$utf8 = decode($name, $bytes);

is in fact

$utf8 = do{
  $obj = find_encoding($name);
  croak qq(encoding "$name" not found) unless ref $obj;
  $obj->decode($bytes)
};

with more error checking.

Therefore you can save time by reusing this object as follows;

my $enc = find_encoding("iso-8859-1");
while(<>){
   my $utf8 = $enc->decode($_);
   # and do someting with $utf8;
}

Besides ->decode and ->encode, other methods are available as well. For instance, -> name returns the canonical name of the encoding object.

find_encoding("latin1")->name; # iso-8859-1

See