You are viewing the version of this documentation from Perl 5.24.2. View the latest version

CONTENTS

NAME

perlunicode - Unicode support in Perl

DESCRIPTION

If you haven't already, before reading this document, you should become familiar with both perlunitut and perluniintro.

Unicode aims to UNI-fy the en-CODE-ings of all the world's character sets into a single Standard. For quite a few of the various coding standards that existed when Unicode was first created, converting from each to Unicode essentially meant adding a constant to each code point in the original standard, and converting back meant just subtracting that same constant. For ASCII and ISO-8859-1, the constant is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew (ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This made it easy to do the conversions, and facilitated the adoption of Unicode.

And it worked; nowadays, those legacy standards are rarely used. Most everyone uses Unicode.

Unicode is a comprehensive standard. It specifies many things outside the scope of Perl, such as how to display sequences of characters. For a full discussion of all aspects of Unicode, see http://www.unicode.org.

Important Caveats

Even though some of this section may not be understandable to you on first reading, we think it's important enough to highlight some of the gotchas before delving further, so here goes:

Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features.

Also, the use of Unicode may present security issues that aren't obvious. Read Unicode Security Considerations.

Safest if you use feature 'unicode_strings'

In order to preserve backward compatibility, Perl does not turn on full internal Unicode support unless the pragma use feature 'unicode_strings' is specified. (This is automatically selected if you use 5.012 or higher.) Failure to do this can trigger unexpected surprises. See "The "Unicode Bug"" below.

This pragma doesn't affect I/O. Nor does it change the internal representation of strings, only their interpretation. There are still several places where Unicode isn't fully supported, such as in filenames.

Input and Output Layers

Use the :encoding(...) layer to read from and write to filehandles using the specified encoding. (See open.)

You should convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8.

See encoding.

use utf8 still needed to enable UTF-8 in scripts

If your Perl script is itself encoded in UTF-8, the use utf8 pragma must be explicitly included to enable recognition of that (in string or regular expression literals, or in identifier names). This is the only time when an explicit use utf8 is needed. (See utf8).

BOM-marked scripts and UTF-16 scripts autodetected

However, if a Perl script begins with the Unicode BOM (UTF-16LE, UTF16-BE, or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either endianness, Perl will correctly read in the script as the appropriate Unicode encoding. (BOM-less UTF-8 cannot be effectively recognized or differentiated from ISO 8859-1 or other eight-bit encodings.)