Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

perlunicode.pod@ 3310

Visit:

Last change on this file since 3310 was 3181, checked in by bird, 19 years ago
perl 5.8.8
File size: 49.4 KB

Line
1	=head1 NAME
2
3	perlunicode - Unicode support in Perl
4
5	=head1 DESCRIPTION
6
7	=head2 Important Caveats
8
9	Unicode support is an extensive requirement. While Perl does not
10	implement the Unicode standard or the accompanying technical reports
11	from cover to cover, Perl does support many Unicode features.
12
13	=over 4
14
15	=item Input and Output Layers
16
17	Perl knows when a filehandle uses Perl's internal Unicode encodings
18	(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19	the ":utf8" layer. Other encodings can be converted to Perl's
20	encoding on input or from Perl's encoding on output by use of the
21	":encoding(...)" layer. See L<open>.
22
23	To indicate that Perl source itself is using a particular encoding,
24	see L<encoding>.
25
26	=item Regular Expressions
27
28	The regular expression compiler produces polymorphic opcodes. That is,
29	the pattern adapts to the data and automatically switches to the Unicode
30	character scheme when presented with Unicode data--or instead uses
31	a traditional byte scheme when presented with byte data.
32
33	=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
34
35	As a compatibility measure, the C<use utf8> pragma must be explicitly
36	included to enable recognition of UTF-8 in the Perl scripts themselves
37	(in string or regular expression literals, or in identifier names) on
38	ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
39	machines. B<These are the only times when an explicit C<use utf8>
40	is needed.> See L<utf8>.
41
42	You can also use the C<encoding> pragma to change the default encoding
43	of the data in your script; see L<encoding>.
44
45	=item BOM-marked scripts and UTF-16 scripts autodetected
46
47	If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
48	or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
49	endianness, Perl will correctly read in the script as Unicode.
50	(BOMless UTF-8 cannot be effectively recognized or differentiated from
51	ISO 8859-1 or other eight-bit encodings.)
52
53	=item C<use encoding> needed to upgrade non-Latin-1 byte strings
54
55	By default, there is a fundamental asymmetry in Perl's unicode model:
56	implicit upgrading from byte strings to Unicode strings assumes that
57	they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
58	downgraded with UTF-8 encoding. This happens because the first 256
59	codepoints in Unicode happens to agree with Latin-1.
60
61	If you wish to interpret byte strings as UTF-8 instead, use the
62	C<encoding> pragma:
63
64	use encoding 'utf8';
65
66	See L</"Byte and Character Semantics"> for more details.
67
68	=back
69
70	=head2 Byte and Character Semantics
71
72	Beginning with version 5.6, Perl uses logically-wide characters to
73	represent strings internally.
74
75	In future, Perl-level operations will be expected to work with
76	characters rather than bytes.
77
78	However, as an interim compatibility measure, Perl aims to
79	provide a safe migration path from byte semantics to character
80	semantics for programs. For operations where Perl can unambiguously
81	decide that the input data are characters, Perl switches to
82	character semantics. For operations where this determination cannot
83	be made without additional information from the user, Perl decides in
84	favor of compatibility and chooses to use byte semantics.
85
86	This behavior preserves compatibility with earlier versions of Perl,
87	which allowed byte semantics in Perl operations only if
88	none of the program's inputs were marked as being as source of Unicode
89	character data. Such data may come from filehandles, from calls to
90	external programs, from information provided by the system (such as %ENV),
91	or from literals and constants in the source text.
92
93	The C<bytes> pragma will always, regardless of platform, force byte
94	semantics in a particular lexical scope. See L<bytes>.
95
96	The C<utf8> pragma is primarily a compatibility device that enables
97	recognition of UTF-(8\|EBCDIC) in literals encountered by the parser.
98	Note that this pragma is only required while Perl defaults to byte
99	semantics; when character semantics become the default, this pragma
100	may become a no-op. See L<utf8>.
101
102	Unless explicitly stated, Perl operators use character semantics
103	for Unicode data and byte semantics for non-Unicode data.
104	The decision to use character semantics is made transparently. If
105	input data comes from a Unicode source--for example, if a character
106	encoding layer is added to a filehandle or a literal Unicode
107	string constant appears in a program--character semantics apply.
108	Otherwise, byte semantics are in effect. The C<bytes> pragma should
109	be used to force byte semantics on Unicode data.
110
111	If strings operating under byte semantics and strings with Unicode
112	character data are concatenated, the new string will be created by
113	decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
114	old Unicode string used EBCDIC. This translation is done without
115	regard to the system's native 8-bit encoding. To change this for
116	systems with non-Latin-1 and non-EBCDIC native encodings, use the
117	C<encoding> pragma. See L<encoding>.
118
119	Under character semantics, many operations that formerly operated on
120	bytes now operate on characters. A character in Perl is
121	logically just a number ranging from 0 to 2**31 or so. Larger
122	characters may encode into longer sequences of bytes internally, but
123	this internal detail is mostly hidden for Perl code.
124	See L<perluniintro> for more.
125
126	=head2 Effects of Character Semantics
127
128	Character semantics have the following effects:
129
130	=over 4
131
132	=item *
133
134	Strings--including hash keys--and regular expression patterns may
135	contain characters that have an ordinal value larger than 255.
136
137	If you use a Unicode editor to edit your program, Unicode characters
138	may occur directly within the literal strings in one of the various
139	Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
140	as such and converted to Perl's internal representation only if the
141	appropriate L<encoding> is specified.
142
143	Unicode characters can also be added to a string by using the
144	C<\x{...}> notation. The Unicode code for the desired character, in
145	hexadecimal, should be placed in the braces. For instance, a smiley
146	face is C<\x{263A}>. This encoding scheme only works for characters
147	with a code of 0x100 or above.
148
149	Additionally, if you
150
151	use charnames ':full';
152
153	you can use the C<\N{...}> notation and put the official Unicode
154	character name within the braces, such as C<\N{WHITE SMILING FACE}>.
155
156
157	=item *
158
159	If an appropriate L<encoding> is specified, identifiers within the
160	Perl script may contain Unicode alphanumeric characters, including
161	ideographs. Perl does not currently attempt to canonicalize variable
162	names.
163
164	=item *
165
166	Regular expressions match characters instead of bytes. "." matches
167	a character instead of a byte. The C<\C> pattern is provided to force
168	a match a single byte--a C<char> in C, hence C<\C>.
169
170	=item *
171
172	Character classes in regular expressions match characters instead of
173	bytes and match against the character properties specified in the