| 1 | =head1 NAME
|
|---|
| 2 |
|
|---|
| 3 | perlunicode - Unicode support in Perl
|
|---|
| 4 |
|
|---|
| 5 | =head1 DESCRIPTION
|
|---|
| 6 |
|
|---|
| 7 | =head2 Important Caveats
|
|---|
| 8 |
|
|---|
| 9 | Unicode support is an extensive requirement. While Perl does not
|
|---|
| 10 | implement the Unicode standard or the accompanying technical reports
|
|---|
| 11 | from cover to cover, Perl does support many Unicode features.
|
|---|
| 12 |
|
|---|
| 13 | =over 4
|
|---|
| 14 |
|
|---|
| 15 | =item Input and Output Layers
|
|---|
| 16 |
|
|---|
| 17 | Perl knows when a filehandle uses Perl's internal Unicode encodings
|
|---|
| 18 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
|
|---|
| 19 | the ":utf8" layer. Other encodings can be converted to Perl's
|
|---|
| 20 | encoding on input or from Perl's encoding on output by use of the
|
|---|
| 21 | ":encoding(...)" layer. See L<open>.
|
|---|
| 22 |
|
|---|
| 23 | To indicate that Perl source itself is using a particular encoding,
|
|---|
| 24 | see L<encoding>.
|
|---|
| 25 |
|
|---|
| 26 | =item Regular Expressions
|
|---|
| 27 |
|
|---|
| 28 | The regular expression compiler produces polymorphic opcodes. That is,
|
|---|
| 29 | the pattern adapts to the data and automatically switches to the Unicode
|
|---|
| 30 | character scheme when presented with Unicode data--or instead uses
|
|---|
| 31 | a traditional byte scheme when presented with byte data.
|
|---|
| 32 |
|
|---|
| 33 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
|
|---|
| 34 |
|
|---|
| 35 | As a compatibility measure, the C<use utf8> pragma must be explicitly
|
|---|
| 36 | included to enable recognition of UTF-8 in the Perl scripts themselves
|
|---|
| 37 | (in string or regular expression literals, or in identifier names) on
|
|---|
| 38 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
|
|---|
| 39 | machines. B<These are the only times when an explicit C<use utf8>
|
|---|
| 40 | is needed.> See L<utf8>.
|
|---|
| 41 |
|
|---|
| 42 | You can also use the C<encoding> pragma to change the default encoding
|
|---|
| 43 | of the data in your script; see L<encoding>.
|
|---|
| 44 |
|
|---|
| 45 | =item BOM-marked scripts and UTF-16 scripts autodetected
|
|---|
| 46 |
|
|---|
| 47 | If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
|
|---|
| 48 | or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
|
|---|
| 49 | endianness, Perl will correctly read in the script as Unicode.
|
|---|
| 50 | (BOMless UTF-8 cannot be effectively recognized or differentiated from
|
|---|
| 51 | ISO 8859-1 or other eight-bit encodings.)
|
|---|
| 52 |
|
|---|
| 53 | =item C<use encoding> needed to upgrade non-Latin-1 byte strings
|
|---|
| 54 |
|
|---|
| 55 | By default, there is a fundamental asymmetry in Perl's unicode model:
|
|---|
| 56 | implicit upgrading from byte strings to Unicode strings assumes that
|
|---|
| 57 | they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
|
|---|
| 58 | downgraded with UTF-8 encoding. This happens because the first 256
|
|---|
| 59 | codepoints in Unicode happens to agree with Latin-1.
|
|---|
| 60 |
|
|---|
| 61 | If you wish to interpret byte strings as UTF-8 instead, use the
|
|---|
| 62 | C<encoding> pragma:
|
|---|
| 63 |
|
|---|
| 64 | use encoding 'utf8';
|
|---|
| 65 |
|
|---|
| 66 | See L</"Byte and Character Semantics"> for more details.
|
|---|
| 67 |
|
|---|
| 68 | =back
|
|---|
| 69 |
|
|---|
| 70 | =head2 Byte and Character Semantics
|
|---|
| 71 |
|
|---|
| 72 | Beginning with version 5.6, Perl uses logically-wide characters to
|
|---|
| 73 | represent strings internally.
|
|---|
| 74 |
|
|---|
| 75 | In future, Perl-level operations will be expected to work with
|
|---|
| 76 | characters rather than bytes.
|
|---|
| 77 |
|
|---|
| 78 | However, as an interim compatibility measure, Perl aims to
|
|---|
| 79 | provide a safe migration path from byte semantics to character
|
|---|
| 80 | semantics for programs. For operations where Perl can unambiguously
|
|---|
| 81 | decide that the input data are characters, Perl switches to
|
|---|
| 82 | character semantics. For operations where this determination cannot
|
|---|
| 83 | be made without additional information from the user, Perl decides in
|
|---|
| 84 | favor of compatibility and chooses to use byte semantics.
|
|---|
| 85 |
|
|---|
| 86 | This behavior preserves compatibility with earlier versions of Perl,
|
|---|
| 87 | which allowed byte semantics in Perl operations only if
|
|---|
| 88 | none of the program's inputs were marked as being as source of Unicode
|
|---|
| 89 | character data. Such data may come from filehandles, from calls to
|
|---|
| 90 | external programs, from information provided by the system (such as %ENV),
|
|---|
| 91 | or from literals and constants in the source text.
|
|---|
| 92 |
|
|---|
| 93 | The C<bytes> pragma will always, regardless of platform, force byte
|
|---|
| 94 | semantics in a particular lexical scope. See L<bytes>.
|
|---|
| 95 |
|
|---|
| 96 | The C<utf8> pragma is primarily a compatibility device that enables
|
|---|
| 97 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
|
|---|
| 98 | Note that this pragma is only required while Perl defaults to byte
|
|---|
| 99 | semantics; when character semantics become the default, this pragma
|
|---|
| 100 | may become a no-op. See L<utf8>.
|
|---|
| 101 |
|
|---|
| 102 | Unless explicitly stated, Perl operators use character semantics
|
|---|
| 103 | for Unicode data and byte semantics for non-Unicode data.
|
|---|
| 104 | The decision to use character semantics is made transparently. If
|
|---|
| 105 | input data comes from a Unicode source--for example, if a character
|
|---|
| 106 | encoding layer is added to a filehandle or a literal Unicode
|
|---|
| 107 | string constant appears in a program--character semantics apply.
|
|---|
| 108 | Otherwise, byte semantics are in effect. The C<bytes> pragma should
|
|---|
| 109 | be used to force byte semantics on Unicode data.
|
|---|
| 110 |
|
|---|
| 111 | If strings operating under byte semantics and strings with Unicode
|
|---|
| 112 | character data are concatenated, the new string will be created by
|
|---|
| 113 | decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
|
|---|
| 114 | old Unicode string used EBCDIC. This translation is done without
|
|---|
| 115 | regard to the system's native 8-bit encoding. To change this for
|
|---|
| 116 | systems with non-Latin-1 and non-EBCDIC native encodings, use the
|
|---|
| 117 | C<encoding> pragma. See L<encoding>.
|
|---|
| 118 |
|
|---|
| 119 | Under character semantics, many operations that formerly operated on
|
|---|
| 120 | bytes now operate on characters. A character in Perl is
|
|---|
| 121 | logically just a number ranging from 0 to 2**31 or so. Larger
|
|---|
| 122 | characters may encode into longer sequences of bytes internally, but
|
|---|
| 123 | this internal detail is mostly hidden for Perl code.
|
|---|
| 124 | See L<perluniintro> for more.
|
|---|
| 125 |
|
|---|
| 126 | =head2 Effects of Character Semantics
|
|---|
| 127 |
|
|---|
| 128 | Character semantics have the following effects:
|
|---|
| 129 |
|
|---|
| 130 | =over 4
|
|---|
| 131 |
|
|---|
| 132 | =item *
|
|---|
| 133 |
|
|---|
| 134 | Strings--including hash keys--and regular expression patterns may
|
|---|
| 135 | contain characters that have an ordinal value larger than 255.
|
|---|
| 136 |
|
|---|
| 137 | If you use a Unicode editor to edit your program, Unicode characters
|
|---|
| 138 | may occur directly within the literal strings in one of the various
|
|---|
| 139 | Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
|
|---|
| 140 | as such and converted to Perl's internal representation only if the
|
|---|
| 141 | appropriate L<encoding> is specified.
|
|---|
| 142 |
|
|---|
| 143 | Unicode characters can also be added to a string by using the
|
|---|
| 144 | C<\x{...}> notation. The Unicode code for the desired character, in
|
|---|
| 145 | hexadecimal, should be placed in the braces. For instance, a smiley
|
|---|
| 146 | face is C<\x{263A}>. This encoding scheme only works for characters
|
|---|
| 147 | with a code of 0x100 or above.
|
|---|
| 148 |
|
|---|
| 149 | Additionally, if you
|
|---|
| 150 |
|
|---|
| 151 | use charnames ':full';
|
|---|
| 152 |
|
|---|
| 153 | you can use the C<\N{...}> notation and put the official Unicode
|
|---|
| 154 | character name within the braces, such as C<\N{WHITE SMILING FACE}>.
|
|---|
| 155 |
|
|---|
| 156 |
|
|---|
| 157 | =item *
|
|---|
| 158 |
|
|---|
| 159 | If an appropriate L<encoding> is specified, identifiers within the
|
|---|
| 160 | Perl script may contain Unicode alphanumeric characters, including
|
|---|
| 161 | ideographs. Perl does not currently attempt to canonicalize variable
|
|---|
| 162 | names.
|
|---|
| 163 |
|
|---|
| 164 | =item *
|
|---|
| 165 |
|
|---|
| 166 | Regular expressions match characters instead of bytes. "." matches
|
|---|
| 167 | a character instead of a byte. The C<\C> pattern is provided to force
|
|---|
| 168 | a match a single byte--a C<char> in C, hence C<\C>.
|
|---|
| 169 |
|
|---|
| 170 | =item *
|
|---|
| 171 |
|
|---|
| 172 | Character classes in regular expressions match characters instead of
|
|---|
| 173 | bytes and match against the character properties specified in the
|
|---|
|
|---|