perlunicode - Unicode support in Perl
If you haven't already, before reading this document, you should become familiar with both perlunitut and perluniintro.
Unicode aims to UNI-fy the en-CODE-ings of all the world's character sets into a single Standard. For quite a few of the various coding standards that existed when Unicode was first created, converting from each to Unicode essentially meant adding a constant to each code point in the original standard, and converting back meant just subtracting that same constant. For ASCII and ISO-8859-1, the constant is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew (ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This made it easy to do the conversions, and facilitated the adoption of Unicode.
And it worked; nowadays, those legacy standards are rarely used. Most everyone uses Unicode.
Unicode is a comprehensive standard. It specifies many things outside the scope of Perl, such as how to display sequences of characters. For a full discussion of all aspects of Unicode, see http://www.unicode.org.
Even though some of this section may not be understandable to you on first reading, we think it's important enough to highlight some of the gotchas before delving further, so here goes:
Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features.
Also, the use of Unicode may present security issues that aren't obvious. Read Unicode Security Considerations.
use feature 'unicode_strings'
In order to preserve backward compatibility, Perl does not turn on full internal Unicode support unless the pragma use feature 'unicode_strings'
is specified. (This is automatically selected if you use 5.012
or higher.) Failure to do this can trigger unexpected surprises. See "The "Unicode Bug"" below.
This pragma doesn't affect I/O. Nor does it change the internal representation of strings, only their interpretation. There are still several places where Unicode isn't fully supported, such as in filenames.
Use the :encoding(...)
layer to read from and write to filehandles using the specified encoding. (See open.)
See encoding.
use utf8
still needed to enable UTF-8 in scriptsIf your Perl script is itself encoded in UTF-8, the use utf8
pragma must be explicitly included to enable recognition of that (in string or regular expression literals, or in identifier names). This is the only time when an explicit use utf8
is needed. (See utf8).
BOM
-marked scripts and UTF-16 scripts autodetectedHowever, if a Perl script begins with the Unicode BOM
(UTF-16LE, UTF16-BE, or UTF-8), or if the script looks like non-BOM
-marked UTF-16 of either endianness, Perl will correctly read in the script as the appropriate Unicode encoding. (BOM
-less UTF-8 cannot be effectively recognized or differentiated from ISO 8859-1 or other eight-bit encodings.)