source: trunk/essentials/dev-lang/perl/pod/perlunicode.pod@ 3310

Last change on this file since 3310 was 3181, checked in by bird, 19 years ago

perl 5.8.8

File size: 49.4 KB
Line 
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
7=head2 Important Caveats
8
9Unicode support is an extensive requirement. While Perl does not
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
12
13=over 4
14
15=item Input and Output Layers
16
17Perl knows when a filehandle uses Perl's internal Unicode encodings
18(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19the ":utf8" layer. Other encodings can be converted to Perl's
20encoding on input or from Perl's encoding on output by use of the
21":encoding(...)" layer. See L<open>.
22
23To indicate that Perl source itself is using a particular encoding,
24see L<encoding>.
25
26=item Regular Expressions
27
28The regular expression compiler produces polymorphic opcodes. That is,
29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
32
33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
34
35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
37(in string or regular expression literals, or in identifier names) on
38ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
39machines. B<These are the only times when an explicit C<use utf8>
40is needed.> See L<utf8>.
41
42You can also use the C<encoding> pragma to change the default encoding
43of the data in your script; see L<encoding>.
44
45=item BOM-marked scripts and UTF-16 scripts autodetected
46
47If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
48or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
49endianness, Perl will correctly read in the script as Unicode.
50(BOMless UTF-8 cannot be effectively recognized or differentiated from
51ISO 8859-1 or other eight-bit encodings.)
52
53=item C<use encoding> needed to upgrade non-Latin-1 byte strings
54
55By default, there is a fundamental asymmetry in Perl's unicode model:
56implicit upgrading from byte strings to Unicode strings assumes that
57they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
58downgraded with UTF-8 encoding. This happens because the first 256
59codepoints in Unicode happens to agree with Latin-1.
60
61If you wish to interpret byte strings as UTF-8 instead, use the
62C<encoding> pragma:
63
64 use encoding 'utf8';
65
66See L</"Byte and Character Semantics"> for more details.
67
68=back
69
70=head2 Byte and Character Semantics
71
72Beginning with version 5.6, Perl uses logically-wide characters to
73represent strings internally.
74
75In future, Perl-level operations will be expected to work with
76characters rather than bytes.
77
78However, as an interim compatibility measure, Perl aims to
79provide a safe migration path from byte semantics to character
80semantics for programs. For operations where Perl can unambiguously
81decide that the input data are characters, Perl switches to
82character semantics. For operations where this determination cannot
83be made without additional information from the user, Perl decides in
84favor of compatibility and chooses to use byte semantics.
85
86This behavior preserves compatibility with earlier versions of Perl,
87which allowed byte semantics in Perl operations only if
88none of the program's inputs were marked as being as source of Unicode
89character data. Such data may come from filehandles, from calls to
90external programs, from information provided by the system (such as %ENV),
91or from literals and constants in the source text.
92
93The C<bytes> pragma will always, regardless of platform, force byte
94semantics in a particular lexical scope. See L<bytes>.
95
96The C<utf8> pragma is primarily a compatibility device that enables
97recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
98Note that this pragma is only required while Perl defaults to byte
99semantics; when character semantics become the default, this pragma
100may become a no-op. See L<utf8>.
101
102Unless explicitly stated, Perl operators use character semantics
103for Unicode data and byte semantics for non-Unicode data.
104The decision to use character semantics is made transparently. If
105input data comes from a Unicode source--for example, if a character
106encoding layer is added to a filehandle or a literal Unicode
107string constant appears in a program--character semantics apply.
108Otherwise, byte semantics are in effect. The C<bytes> pragma should
109be used to force byte semantics on Unicode data.
110
111If strings operating under byte semantics and strings with Unicode
112character data are concatenated, the new string will be created by
113decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
114old Unicode string used EBCDIC. This translation is done without
115regard to the system's native 8-bit encoding. To change this for
116systems with non-Latin-1 and non-EBCDIC native encodings, use the
117C<encoding> pragma. See L<encoding>.
118
119Under character semantics, many operations that formerly operated on
120bytes now operate on characters. A character in Perl is
121logically just a number ranging from 0 to 2**31 or so. Larger
122characters may encode into longer sequences of bytes internally, but
123this internal detail is mostly hidden for Perl code.
124See L<perluniintro> for more.
125
126=head2 Effects of Character Semantics
127
128Character semantics have the following effects:
129
130=over 4
131
132=item *
133
134Strings--including hash keys--and regular expression patterns may
135contain characters that have an ordinal value larger than 255.
136
137If you use a Unicode editor to edit your program, Unicode characters
138may occur directly within the literal strings in one of the various
139Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
140as such and converted to Perl's internal representation only if the
141appropriate L<encoding> is specified.
142
143Unicode characters can also be added to a string by using the
144C<\x{...}> notation. The Unicode code for the desired character, in
145hexadecimal, should be placed in the braces. For instance, a smiley
146face is C<\x{263A}>. This encoding scheme only works for characters
147with a code of 0x100 or above.
148
149Additionally, if you
150
151 use charnames ':full';
152
153you can use the C<\N{...}> notation and put the official Unicode
154character name within the braces, such as C<\N{WHITE SMILING FACE}>.
155
156
157=item *
158
159If an appropriate L<encoding> is specified, identifiers within the
160Perl script may contain Unicode alphanumeric characters, including
161ideographs. Perl does not currently attempt to canonicalize variable
162names.
163
164=item *
165
166Regular expressions match characters instead of bytes. "." matches
167a character instead of a byte. The C<\C> pattern is provided to force
168a match a single byte--a C<char> in C, hence C<\C>.
169
170=item *
171
172Character classes in regular expressions match characters instead of
173bytes and match against the character properties specified in the