| [3181] | 1 | =head1 NAME
|
|---|
| 2 |
|
|---|
| 3 | perlunicode - Unicode support in Perl
|
|---|
| 4 |
|
|---|
| 5 | =head1 DESCRIPTION
|
|---|
| 6 |
|
|---|
| 7 | =head2 Important Caveats
|
|---|
| 8 |
|
|---|
| 9 | Unicode support is an extensive requirement. While Perl does not
|
|---|
| 10 | implement the Unicode standard or the accompanying technical reports
|
|---|
| 11 | from cover to cover, Perl does support many Unicode features.
|
|---|
| 12 |
|
|---|
| 13 | =over 4
|
|---|
| 14 |
|
|---|
| 15 | =item Input and Output Layers
|
|---|
| 16 |
|
|---|
| 17 | Perl knows when a filehandle uses Perl's internal Unicode encodings
|
|---|
| 18 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
|
|---|
| 19 | the ":utf8" layer. Other encodings can be converted to Perl's
|
|---|
| 20 | encoding on input or from Perl's encoding on output by use of the
|
|---|
| 21 | ":encoding(...)" layer. See L<open>.
|
|---|
| 22 |
|
|---|
| 23 | To indicate that Perl source itself is using a particular encoding,
|
|---|
| 24 | see L<encoding>.
|
|---|
| 25 |
|
|---|
| 26 | =item Regular Expressions
|
|---|
| 27 |
|
|---|
| 28 | The regular expression compiler produces polymorphic opcodes. That is,
|
|---|
| 29 | the pattern adapts to the data and automatically switches to the Unicode
|
|---|
| 30 | character scheme when presented with Unicode data--or instead uses
|
|---|
| 31 | a traditional byte scheme when presented with byte data.
|
|---|
| 32 |
|
|---|
| 33 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
|
|---|
| 34 |
|
|---|
| 35 | As a compatibility measure, the C<use utf8> pragma must be explicitly
|
|---|
| 36 | included to enable recognition of UTF-8 in the Perl scripts themselves
|
|---|
| 37 | (in string or regular expression literals, or in identifier names) on
|
|---|
| 38 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
|
|---|
| 39 | machines. B<These are the only times when an explicit C<use utf8>
|
|---|
| 40 | is needed.> See L<utf8>.
|
|---|
| 41 |
|
|---|
| 42 | You can also use the C<encoding> pragma to change the default encoding
|
|---|
| 43 | of the data in your script; see L<encoding>.
|
|---|
| 44 |
|
|---|
| 45 | =item BOM-marked scripts and UTF-16 scripts autodetected
|
|---|
| 46 |
|
|---|
| 47 | If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
|
|---|
| 48 | or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
|
|---|
| 49 | endianness, Perl will correctly read in the script as Unicode.
|
|---|
| 50 | (BOMless UTF-8 cannot be effectively recognized or differentiated from
|
|---|
| 51 | ISO 8859-1 or other eight-bit encodings.)
|
|---|
| 52 |
|
|---|
| 53 | =item C<use encoding> needed to upgrade non-Latin-1 byte strings
|
|---|
| 54 |
|
|---|
| 55 | By default, there is a fundamental asymmetry in Perl's unicode model:
|
|---|
| 56 | implicit upgrading from byte strings to Unicode strings assumes that
|
|---|
| 57 | they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
|
|---|
| 58 | downgraded with UTF-8 encoding. This happens because the first 256
|
|---|
| 59 | codepoints in Unicode happens to agree with Latin-1.
|
|---|
| 60 |
|
|---|
| 61 | If you wish to interpret byte strings as UTF-8 instead, use the
|
|---|
| 62 | C<encoding> pragma:
|
|---|
| 63 |
|
|---|
| 64 | use encoding 'utf8';
|
|---|
| 65 |
|
|---|
| 66 | See L</"Byte and Character Semantics"> for more details.
|
|---|
| 67 |
|
|---|
| 68 | =back
|
|---|
| 69 |
|
|---|
| 70 | =head2 Byte and Character Semantics
|
|---|
| 71 |
|
|---|
| 72 | Beginning with version 5.6, Perl uses logically-wide characters to
|
|---|
| 73 | represent strings internally.
|
|---|
| 74 |
|
|---|
| 75 | In future, Perl-level operations will be expected to work with
|
|---|
| 76 | characters rather than bytes.
|
|---|
| 77 |
|
|---|
| 78 | However, as an interim compatibility measure, Perl aims to
|
|---|
| 79 | provide a safe migration path from byte semantics to character
|
|---|
| 80 | semantics for programs. For operations where Perl can unambiguously
|
|---|
| 81 | decide that the input data are characters, Perl switches to
|
|---|
| 82 | character semantics. For operations where this determination cannot
|
|---|
| 83 | be made without additional information from the user, Perl decides in
|
|---|
| 84 | favor of compatibility and chooses to use byte semantics.
|
|---|
| 85 |
|
|---|
| 86 | This behavior preserves compatibility with earlier versions of Perl,
|
|---|
| 87 | which allowed byte semantics in Perl operations only if
|
|---|
| 88 | none of the program's inputs were marked as being as source of Unicode
|
|---|
| 89 | character data. Such data may come from filehandles, from calls to
|
|---|
| 90 | external programs, from information provided by the system (such as %ENV),
|
|---|
| 91 | or from literals and constants in the source text.
|
|---|
| 92 |
|
|---|
| 93 | The C<bytes> pragma will always, regardless of platform, force byte
|
|---|
| 94 | semantics in a particular lexical scope. See L<bytes>.
|
|---|
| 95 |
|
|---|
| 96 | The C<utf8> pragma is primarily a compatibility device that enables
|
|---|
| 97 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
|
|---|
| 98 | Note that this pragma is only required while Perl defaults to byte
|
|---|
| 99 | semantics; when character semantics become the default, this pragma
|
|---|
| 100 | may become a no-op. See L<utf8>.
|
|---|
| 101 |
|
|---|
| 102 | Unless explicitly stated, Perl operators use character semantics
|
|---|
| 103 | for Unicode data and byte semantics for non-Unicode data.
|
|---|
| 104 | The decision to use character semantics is made transparently. If
|
|---|
| 105 | input data comes from a Unicode source--for example, if a character
|
|---|
| 106 | encoding layer is added to a filehandle or a literal Unicode
|
|---|
| 107 | string constant appears in a program--character semantics apply.
|
|---|
| 108 | Otherwise, byte semantics are in effect. The C<bytes> pragma should
|
|---|
| 109 | be used to force byte semantics on Unicode data.
|
|---|
| 110 |
|
|---|
| 111 | If strings operating under byte semantics and strings with Unicode
|
|---|
| 112 | character data are concatenated, the new string will be created by
|
|---|
| 113 | decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
|
|---|
| 114 | old Unicode string used EBCDIC. This translation is done without
|
|---|
| 115 | regard to the system's native 8-bit encoding. To change this for
|
|---|
| 116 | systems with non-Latin-1 and non-EBCDIC native encodings, use the
|
|---|
| 117 | C<encoding> pragma. See L<encoding>.
|
|---|
| 118 |
|
|---|
| 119 | Under character semantics, many operations that formerly operated on
|
|---|
| 120 | bytes now operate on characters. A character in Perl is
|
|---|
| 121 | logically just a number ranging from 0 to 2**31 or so. Larger
|
|---|
| 122 | characters may encode into longer sequences of bytes internally, but
|
|---|
| 123 | this internal detail is mostly hidden for Perl code.
|
|---|
| 124 | See L<perluniintro> for more.
|
|---|
| 125 |
|
|---|
| 126 | =head2 Effects of Character Semantics
|
|---|
| 127 |
|
|---|
| 128 | Character semantics have the following effects:
|
|---|
| 129 |
|
|---|
| 130 | =over 4
|
|---|
| 131 |
|
|---|
| 132 | =item *
|
|---|
| 133 |
|
|---|
| 134 | Strings--including hash keys--and regular expression patterns may
|
|---|
| 135 | contain characters that have an ordinal value larger than 255.
|
|---|
| 136 |
|
|---|
| 137 | If you use a Unicode editor to edit your program, Unicode characters
|
|---|
| 138 | may occur directly within the literal strings in one of the various
|
|---|
| 139 | Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
|
|---|
| 140 | as such and converted to Perl's internal representation only if the
|
|---|
| 141 | appropriate L<encoding> is specified.
|
|---|
| 142 |
|
|---|
| 143 | Unicode characters can also be added to a string by using the
|
|---|
| 144 | C<\x{...}> notation. The Unicode code for the desired character, in
|
|---|
| 145 | hexadecimal, should be placed in the braces. For instance, a smiley
|
|---|
| 146 | face is C<\x{263A}>. This encoding scheme only works for characters
|
|---|
| 147 | with a code of 0x100 or above.
|
|---|
| 148 |
|
|---|
| 149 | Additionally, if you
|
|---|
| 150 |
|
|---|
| 151 | use charnames ':full';
|
|---|
| 152 |
|
|---|
| 153 | you can use the C<\N{...}> notation and put the official Unicode
|
|---|
| 154 | character name within the braces, such as C<\N{WHITE SMILING FACE}>.
|
|---|
| 155 |
|
|---|
| 156 |
|
|---|
| 157 | =item *
|
|---|
| 158 |
|
|---|
| 159 | If an appropriate L<encoding> is specified, identifiers within the
|
|---|
| 160 | Perl script may contain Unicode alphanumeric characters, including
|
|---|
| 161 | ideographs. Perl does not currently attempt to canonicalize variable
|
|---|
| 162 | names.
|
|---|
| 163 |
|
|---|
| 164 | =item *
|
|---|
| 165 |
|
|---|
| 166 | Regular expressions match characters instead of bytes. "." matches
|
|---|
| 167 | a character instead of a byte. The C<\C> pattern is provided to force
|
|---|
| 168 | a match a single byte--a C<char> in C, hence C<\C>.
|
|---|
| 169 |
|
|---|
| 170 | =item *
|
|---|
| 171 |
|
|---|
| 172 | Character classes in regular expressions match characters instead of
|
|---|
| 173 | bytes and match against the character properties specified in the
|
|---|
| 174 | Unicode properties database. C<\w> can be used to match a Japanese
|
|---|
| 175 | ideograph, for instance.
|
|---|
| 176 |
|
|---|
| 177 | (However, and as a limitation of the current implementation, using
|
|---|
| 178 | C<\w> or C<\W> I<inside> a C<[...]> character class will still match
|
|---|
| 179 | with byte semantics.)
|
|---|
| 180 |
|
|---|
| 181 | =item *
|
|---|
| 182 |
|
|---|
| 183 | Named Unicode properties, scripts, and block ranges may be used like
|
|---|
| 184 | character classes via the C<\p{}> "matches property" construct and
|
|---|
| 185 | the C<\P{}> negation, "doesn't match property".
|
|---|
| 186 |
|
|---|
| 187 | For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
|
|---|
| 188 | (Letter, uppercase) property, while C<\p{M}> matches any character
|
|---|
| 189 | with an "M" (mark--accents and such) property. Brackets are not
|
|---|
| 190 | required for single letter properties, so C<\p{M}> is equivalent to
|
|---|
| 191 | C<\pM>. Many predefined properties are available, such as
|
|---|
| 192 | C<\p{Mirrored}> and C<\p{Tibetan}>.
|
|---|
| 193 |
|
|---|
| 194 | The official Unicode script and block names have spaces and dashes as
|
|---|
| 195 | separators, but for convenience you can use dashes, spaces, or
|
|---|
| 196 | underbars, and case is unimportant. It is recommended, however, that
|
|---|
| 197 | for consistency you use the following naming: the official Unicode
|
|---|
| 198 | script, property, or block name (see below for the additional rules
|
|---|
| 199 | that apply to block names) with whitespace and dashes removed, and the
|
|---|
| 200 | words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
|
|---|
| 201 | becomes C<Latin1Supplement>.
|
|---|
| 202 |
|
|---|
| 203 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
|
|---|
| 204 | (^) between the first brace and the property name: C<\p{^Tamil}> is
|
|---|
| 205 | equal to C<\P{Tamil}>.
|
|---|
| 206 |
|
|---|
| 207 | B<NOTE: the properties, scripts, and blocks listed here are as of
|
|---|
| 208 | Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
|
|---|
| 209 | came out in April 2003, and Perl 5.8.1 in September 2003.>
|
|---|
| 210 |
|
|---|
| 211 | Here are the basic Unicode General Category properties, followed by their
|
|---|
| 212 | long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
|
|---|
| 213 | for instance, are identical.
|
|---|
| 214 |
|
|---|
| 215 | Short Long
|
|---|
| 216 |
|
|---|
| 217 | L Letter
|
|---|
| 218 | LC CasedLetter
|
|---|
| 219 | Lu UppercaseLetter
|
|---|
| 220 | Ll LowercaseLetter
|
|---|
| 221 | Lt TitlecaseLetter
|
|---|
| 222 | Lm ModifierLetter
|
|---|
| 223 | Lo OtherLetter
|
|---|
| 224 |
|
|---|
| 225 | M Mark
|
|---|
| 226 | Mn NonspacingMark
|
|---|
| 227 | Mc SpacingMark
|
|---|
| 228 | Me EnclosingMark
|
|---|
| 229 |
|
|---|
| 230 | N Number
|
|---|
| 231 | Nd DecimalNumber
|
|---|
| 232 | Nl LetterNumber
|
|---|
| 233 | No OtherNumber
|
|---|
| 234 |
|
|---|
| 235 | P Punctuation
|
|---|
| 236 | Pc ConnectorPunctuation
|
|---|
| 237 | Pd DashPunctuation
|
|---|
| 238 | Ps OpenPunctuation
|
|---|
| 239 | Pe ClosePunctuation
|
|---|
| 240 | Pi InitialPunctuation
|
|---|
| 241 | (may behave like Ps or Pe depending on usage)
|
|---|
| 242 | Pf FinalPunctuation
|
|---|
| 243 | (may behave like Ps or Pe depending on usage)
|
|---|
| 244 | Po OtherPunctuation
|
|---|
| 245 |
|
|---|
| 246 | S Symbol
|
|---|
| 247 | Sm MathSymbol
|
|---|
| 248 | Sc CurrencySymbol
|
|---|
| 249 | Sk ModifierSymbol
|
|---|
| 250 | So OtherSymbol
|
|---|
| 251 |
|
|---|
| 252 | Z Separator
|
|---|
| 253 | Zs SpaceSeparator
|
|---|
| 254 | Zl LineSeparator
|
|---|
| 255 | Zp ParagraphSeparator
|
|---|
| 256 |
|
|---|
| 257 | C Other
|
|---|
| 258 | Cc Control
|
|---|
| 259 | Cf Format
|
|---|
| 260 | Cs Surrogate (not usable)
|
|---|
| 261 | Co PrivateUse
|
|---|
| 262 | Cn Unassigned
|
|---|
| 263 |
|
|---|
| 264 | Single-letter properties match all characters in any of the
|
|---|
| 265 | two-letter sub-properties starting with the same letter.
|
|---|
| 266 | C<LC> and C<L&> are special cases, which are aliases for the set of
|
|---|
| 267 | C<Ll>, C<Lu>, and C<Lt>.
|
|---|
| 268 |
|
|---|
| 269 | Because Perl hides the need for the user to understand the internal
|
|---|
| 270 | representation of Unicode characters, there is no need to implement
|
|---|
| 271 | the somewhat messy concept of surrogates. C<Cs> is therefore not
|
|---|
| 272 | supported.
|
|---|
| 273 |
|
|---|
| 274 | Because scripts differ in their directionality--Hebrew is
|
|---|
| 275 | written right to left, for example--Unicode supplies these properties in
|
|---|
| 276 | the BidiClass class:
|
|---|
| 277 |
|
|---|
| 278 | Property Meaning
|
|---|
| 279 |
|
|---|
| 280 | L Left-to-Right
|
|---|
| 281 | LRE Left-to-Right Embedding
|
|---|
| 282 | LRO Left-to-Right Override
|
|---|
| 283 | R Right-to-Left
|
|---|
| 284 | AL Right-to-Left Arabic
|
|---|
| 285 | RLE Right-to-Left Embedding
|
|---|
| 286 | RLO Right-to-Left Override
|
|---|
| 287 | PDF Pop Directional Format
|
|---|
| 288 | EN European Number
|
|---|
| 289 | ES European Number Separator
|
|---|
| 290 | ET European Number Terminator
|
|---|
| 291 | AN Arabic Number
|
|---|
| 292 | CS Common Number Separator
|
|---|
| 293 | NSM Non-Spacing Mark
|
|---|
| 294 | BN Boundary Neutral
|
|---|
| 295 | B Paragraph Separator
|
|---|
| 296 | S Segment Separator
|
|---|
| 297 | WS Whitespace
|
|---|
| 298 | ON Other Neutrals
|
|---|
| 299 |
|
|---|
| 300 | For example, C<\p{BidiClass:R}> matches characters that are normally
|
|---|
| 301 | written right to left.
|
|---|
| 302 |
|
|---|
| 303 | =back
|
|---|
| 304 |
|
|---|
| 305 | =head2 Scripts
|
|---|
| 306 |
|
|---|
| 307 | The script names which can be used by C<\p{...}> and C<\P{...}>,
|
|---|
| 308 | such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
|
|---|
| 309 |
|
|---|
| 310 | Arabic
|
|---|
| 311 | Armenian
|
|---|
| 312 | Bengali
|
|---|
| 313 | Bopomofo
|
|---|
| 314 | Buhid
|
|---|
| 315 | CanadianAboriginal
|
|---|
| 316 | Cherokee
|
|---|
| 317 | Cyrillic
|
|---|
| 318 | Deseret
|
|---|
| 319 | Devanagari
|
|---|
| 320 | Ethiopic
|
|---|
| 321 | Georgian
|
|---|
| 322 | Gothic
|
|---|
| 323 | Greek
|
|---|
| 324 | Gujarati
|
|---|
| 325 | Gurmukhi
|
|---|
| 326 | Han
|
|---|
| 327 | Hangul
|
|---|
| 328 | Hanunoo
|
|---|
| 329 | Hebrew
|
|---|
| 330 | Hiragana
|
|---|
| 331 | Inherited
|
|---|
| 332 | Kannada
|
|---|
| 333 | Katakana
|
|---|
| 334 | Khmer
|
|---|
| 335 | Lao
|
|---|
| 336 | Latin
|
|---|
| 337 | Malayalam
|
|---|
| 338 | Mongolian
|
|---|
| 339 | Myanmar
|
|---|
| 340 | Ogham
|
|---|
| 341 | OldItalic
|
|---|
| 342 | Oriya
|
|---|
| 343 | Runic
|
|---|
| 344 | Sinhala
|
|---|
| 345 | Syriac
|
|---|
| 346 | Tagalog
|
|---|
| 347 | Tagbanwa
|
|---|
| 348 | Tamil
|
|---|
| 349 | Telugu
|
|---|
| 350 | Thaana
|
|---|
| 351 | Thai
|
|---|
| 352 | Tibetan
|
|---|
| 353 | Yi
|
|---|
| 354 |
|
|---|
| 355 | Extended property classes can supplement the basic
|
|---|
| 356 | properties, defined by the F<PropList> Unicode database:
|
|---|
| 357 |
|
|---|
| 358 | ASCIIHexDigit
|
|---|
| 359 | BidiControl
|
|---|
| 360 | Dash
|
|---|
| 361 | Deprecated
|
|---|
| 362 | Diacritic
|
|---|
| 363 | Extender
|
|---|
| 364 | GraphemeLink
|
|---|
| 365 | HexDigit
|
|---|
| 366 | Hyphen
|
|---|
| 367 | Ideographic
|
|---|
| 368 | IDSBinaryOperator
|
|---|
| 369 | IDSTrinaryOperator
|
|---|
| 370 | JoinControl
|
|---|
| 371 | LogicalOrderException
|
|---|
| 372 | NoncharacterCodePoint
|
|---|
| 373 | OtherAlphabetic
|
|---|
| 374 | OtherDefaultIgnorableCodePoint
|
|---|
| 375 | OtherGraphemeExtend
|
|---|
| 376 | OtherLowercase
|
|---|
| 377 | OtherMath
|
|---|
| 378 | OtherUppercase
|
|---|
| 379 | QuotationMark
|
|---|
| 380 | Radical
|
|---|
| 381 | SoftDotted
|
|---|
| 382 | TerminalPunctuation
|
|---|
| 383 | UnifiedIdeograph
|
|---|
| 384 | WhiteSpace
|
|---|
| 385 |
|
|---|
| 386 | and there are further derived properties:
|
|---|
| 387 |
|
|---|
| 388 | Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
|
|---|
| 389 | Lowercase Ll + OtherLowercase
|
|---|
| 390 | Uppercase Lu + OtherUppercase
|
|---|
| 391 | Math Sm + OtherMath
|
|---|
| 392 |
|
|---|
| 393 | ID_Start Lu + Ll + Lt + Lm + Lo + Nl
|
|---|
| 394 | ID_Continue ID_Start + Mn + Mc + Nd + Pc
|
|---|
| 395 |
|
|---|
| 396 | Any Any character
|
|---|
| 397 | Assigned Any non-Cn character (i.e. synonym for \P{Cn})
|
|---|
| 398 | Unassigned Synonym for \p{Cn}
|
|---|
| 399 | Common Any character (or unassigned code point)
|
|---|
| 400 | not explicitly assigned to a script
|
|---|
| 401 |
|
|---|
| 402 | For backward compatibility (with Perl 5.6), all properties mentioned
|
|---|
| 403 | so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
|
|---|
| 404 | example, is equal to C<\P{Lu}>.
|
|---|
| 405 |
|
|---|
| 406 | =head2 Blocks
|
|---|
| 407 |
|
|---|
| 408 | In addition to B<scripts>, Unicode also defines B<blocks> of
|
|---|
| 409 | characters. The difference between scripts and blocks is that the
|
|---|
| 410 | concept of scripts is closer to natural languages, while the concept
|
|---|
| 411 | of blocks is more of an artificial grouping based on groups of 256
|
|---|
| 412 | Unicode characters. For example, the C<Latin> script contains letters
|
|---|
| 413 | from many blocks but does not contain all the characters from those
|
|---|
| 414 | blocks. It does not, for example, contain digits, because digits are
|
|---|
| 415 | shared across many scripts. Digits and similar groups, like
|
|---|
| 416 | punctuation, are in a category called C<Common>.
|
|---|
| 417 |
|
|---|
| 418 | For more about scripts, see the UTR #24:
|
|---|
| 419 |
|
|---|
| 420 | http://www.unicode.org/unicode/reports/tr24/
|
|---|
| 421 |
|
|---|
| 422 | For more about blocks, see:
|
|---|
| 423 |
|
|---|
| 424 | http://www.unicode.org/Public/UNIDATA/Blocks.txt
|
|---|
| 425 |
|
|---|
| 426 | Block names are given with the C<In> prefix. For example, the
|
|---|
| 427 | Katakana block is referenced via C<\p{InKatakana}>. The C<In>
|
|---|
| 428 | prefix may be omitted if there is no naming conflict with a script
|
|---|
| 429 | or any other property, but it is recommended that C<In> always be used
|
|---|
| 430 | for block tests to avoid confusion.
|
|---|
| 431 |
|
|---|
| 432 | These block names are supported:
|
|---|
| 433 |
|
|---|
| 434 | InAlphabeticPresentationForms
|
|---|
| 435 | InArabic
|
|---|
| 436 | InArabicPresentationFormsA
|
|---|
| 437 | InArabicPresentationFormsB
|
|---|
| 438 | InArmenian
|
|---|
| 439 | InArrows
|
|---|
| 440 | InBasicLatin
|
|---|
| 441 | InBengali
|
|---|
| 442 | InBlockElements
|
|---|
| 443 | InBopomofo
|
|---|
| 444 | InBopomofoExtended
|
|---|
| 445 | InBoxDrawing
|
|---|
| 446 | InBraillePatterns
|
|---|
| 447 | InBuhid
|
|---|
| 448 | InByzantineMusicalSymbols
|
|---|
| 449 | InCJKCompatibility
|
|---|
| 450 | InCJKCompatibilityForms
|
|---|
| 451 | InCJKCompatibilityIdeographs
|
|---|
| 452 | InCJKCompatibilityIdeographsSupplement
|
|---|
| 453 | InCJKRadicalsSupplement
|
|---|
| 454 | InCJKSymbolsAndPunctuation
|
|---|
| 455 | InCJKUnifiedIdeographs
|
|---|
| 456 | InCJKUnifiedIdeographsExtensionA
|
|---|
| 457 | InCJKUnifiedIdeographsExtensionB
|
|---|
| 458 | InCherokee
|
|---|
| 459 | InCombiningDiacriticalMarks
|
|---|
| 460 | InCombiningDiacriticalMarksforSymbols
|
|---|
| 461 | InCombiningHalfMarks
|
|---|
| 462 | InControlPictures
|
|---|
| 463 | InCurrencySymbols
|
|---|
| 464 | InCyrillic
|
|---|
| 465 | InCyrillicSupplementary
|
|---|
| 466 | InDeseret
|
|---|
| 467 | InDevanagari
|
|---|
| 468 | InDingbats
|
|---|
| 469 | InEnclosedAlphanumerics
|
|---|
| 470 | InEnclosedCJKLettersAndMonths
|
|---|
| 471 | InEthiopic
|
|---|
| 472 | InGeneralPunctuation
|
|---|
| 473 | InGeometricShapes
|
|---|
| 474 | InGeorgian
|
|---|
| 475 | InGothic
|
|---|
| 476 | InGreekExtended
|
|---|
| 477 | InGreekAndCoptic
|
|---|
| 478 | InGujarati
|
|---|
| 479 | InGurmukhi
|
|---|
| 480 | InHalfwidthAndFullwidthForms
|
|---|
| 481 | InHangulCompatibilityJamo
|
|---|
| 482 | InHangulJamo
|
|---|
| 483 | InHangulSyllables
|
|---|
| 484 | InHanunoo
|
|---|
| 485 | InHebrew
|
|---|
| 486 | InHighPrivateUseSurrogates
|
|---|
| 487 | InHighSurrogates
|
|---|
| 488 | InHiragana
|
|---|
| 489 | InIPAExtensions
|
|---|
| 490 | InIdeographicDescriptionCharacters
|
|---|
| 491 | InKanbun
|
|---|
| 492 | InKangxiRadicals
|
|---|
| 493 | InKannada
|
|---|
| 494 | InKatakana
|
|---|
| 495 | InKatakanaPhoneticExtensions
|
|---|
| 496 | InKhmer
|
|---|
| 497 | InLao
|
|---|
| 498 | InLatin1Supplement
|
|---|
| 499 | InLatinExtendedA
|
|---|
| 500 | InLatinExtendedAdditional
|
|---|
| 501 | InLatinExtendedB
|
|---|
| 502 | InLetterlikeSymbols
|
|---|
| 503 | InLowSurrogates
|
|---|
| 504 | InMalayalam
|
|---|
| 505 | InMathematicalAlphanumericSymbols
|
|---|
| 506 | InMathematicalOperators
|
|---|
| 507 | InMiscellaneousMathematicalSymbolsA
|
|---|
| 508 | InMiscellaneousMathematicalSymbolsB
|
|---|
| 509 | InMiscellaneousSymbols
|
|---|
| 510 | InMiscellaneousTechnical
|
|---|
| 511 | InMongolian
|
|---|
| 512 | InMusicalSymbols
|
|---|
| 513 | InMyanmar
|
|---|
| 514 | InNumberForms
|
|---|
| 515 | InOgham
|
|---|
| 516 | InOldItalic
|
|---|
| 517 | InOpticalCharacterRecognition
|
|---|
| 518 | InOriya
|
|---|
| 519 | InPrivateUseArea
|
|---|
| 520 | InRunic
|
|---|
| 521 | InSinhala
|
|---|
| 522 | InSmallFormVariants
|
|---|
| 523 | InSpacingModifierLetters
|
|---|
| 524 | InSpecials
|
|---|
| 525 | InSuperscriptsAndSubscripts
|
|---|
| 526 | InSupplementalArrowsA
|
|---|
| 527 | InSupplementalArrowsB
|
|---|
| 528 | InSupplementalMathematicalOperators
|
|---|
| 529 | InSupplementaryPrivateUseAreaA
|
|---|
| 530 | InSupplementaryPrivateUseAreaB
|
|---|
| 531 | InSyriac
|
|---|
| 532 | InTagalog
|
|---|
| 533 | InTagbanwa
|
|---|
| 534 | InTags
|
|---|
| 535 | InTamil
|
|---|
| 536 | InTelugu
|
|---|
| 537 | InThaana
|
|---|
| 538 | InThai
|
|---|
| 539 | InTibetan
|
|---|
| 540 | InUnifiedCanadianAboriginalSyllabics
|
|---|
| 541 | InVariationSelectors
|
|---|
| 542 | InYiRadicals
|
|---|
| 543 | InYiSyllables
|
|---|
| 544 |
|
|---|
| 545 | =over 4
|
|---|
| 546 |
|
|---|
| 547 | =item *
|
|---|
| 548 |
|
|---|
| 549 | The special pattern C<\X> matches any extended Unicode
|
|---|
| 550 | sequence--"a combining character sequence" in Standardese--where the
|
|---|
| 551 | first character is a base character and subsequent characters are mark
|
|---|
| 552 | characters that apply to the base character. C<\X> is equivalent to
|
|---|
| 553 | C<(?:\PM\pM*)>.
|
|---|
| 554 |
|
|---|
| 555 | =item *
|
|---|
| 556 |
|
|---|
| 557 | The C<tr///> operator translates characters instead of bytes. Note
|
|---|
| 558 | that the C<tr///CU> functionality has been removed. For similar
|
|---|
| 559 | functionality see pack('U0', ...) and pack('C0', ...).
|
|---|
| 560 |
|
|---|
| 561 | =item *
|
|---|
| 562 |
|
|---|
| 563 | Case translation operators use the Unicode case translation tables
|
|---|
| 564 | when character input is provided. Note that C<uc()>, or C<\U> in
|
|---|
| 565 | interpolated strings, translates to uppercase, while C<ucfirst>,
|
|---|
| 566 | or C<\u> in interpolated strings, translates to titlecase in languages
|
|---|
| 567 | that make the distinction.
|
|---|
| 568 |
|
|---|
| 569 | =item *
|
|---|
| 570 |
|
|---|
| 571 | Most operators that deal with positions or lengths in a string will
|
|---|
| 572 | automatically switch to using character positions, including
|
|---|
| 573 | C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
|
|---|
| 574 | C<sprintf()>, C<write()>, and C<length()>. Operators that
|
|---|
| 575 | specifically do not switch include C<vec()>, C<pack()>, and
|
|---|
| 576 | C<unpack()>. Operators that really don't care include
|
|---|
| 577 | operators that treats strings as a bucket of bits such as C<sort()>,
|
|---|
| 578 | and operators dealing with filenames.
|
|---|
| 579 |
|
|---|
| 580 | =item *
|
|---|
| 581 |
|
|---|
| 582 | The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
|
|---|
| 583 | since they are often used for byte-oriented formats. Again, think
|
|---|
| 584 | C<char> in the C language.
|
|---|
| 585 |
|
|---|
| 586 | There is a new C<U> specifier that converts between Unicode characters
|
|---|
| 587 | and code points.
|
|---|
| 588 |
|
|---|
| 589 | =item *
|
|---|
| 590 |
|
|---|
| 591 | The C<chr()> and C<ord()> functions work on characters, similar to
|
|---|
| 592 | C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
|
|---|
| 593 | C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
|
|---|
| 594 | emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
|
|---|
| 595 | While these methods reveal the internal encoding of Unicode strings,
|
|---|
| 596 | that is not something one normally needs to care about at all.
|
|---|
| 597 |
|
|---|
| 598 | =item *
|
|---|
| 599 |
|
|---|
| 600 | The bit string operators, C<& | ^ ~>, can operate on character data.
|
|---|
| 601 | However, for backward compatibility, such as when using bit string
|
|---|
| 602 | operations when characters are all less than 256 in ordinal value, one
|
|---|
| 603 | should not use C<~> (the bit complement) with characters of both
|
|---|
| 604 | values less than 256 and values greater than 256. Most importantly,
|
|---|
| 605 | DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
|
|---|
| 606 | will not hold. The reason for this mathematical I<faux pas> is that
|
|---|
| 607 | the complement cannot return B<both> the 8-bit (byte-wide) bit
|
|---|
| 608 | complement B<and> the full character-wide bit complement.
|
|---|
| 609 |
|
|---|
| 610 | =item *
|
|---|
| 611 |
|
|---|
| 612 | lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
|
|---|
| 613 |
|
|---|
| 614 | =over 8
|
|---|
| 615 |
|
|---|
| 616 | =item *
|
|---|
| 617 |
|
|---|
| 618 | the case mapping is from a single Unicode character to another
|
|---|
| 619 | single Unicode character, or
|
|---|
| 620 |
|
|---|
| 621 | =item *
|
|---|
| 622 |
|
|---|
| 623 | the case mapping is from a single Unicode character to more
|
|---|
| 624 | than one Unicode character.
|
|---|
| 625 |
|
|---|
| 626 | =back
|
|---|
| 627 |
|
|---|
| 628 | Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
|
|---|
| 629 | since Perl does not understand the concept of Unicode locales.
|
|---|
| 630 |
|
|---|
| 631 | See the Unicode Technical Report #21, Case Mappings, for more details.
|
|---|
| 632 |
|
|---|
| 633 | =back
|
|---|
| 634 |
|
|---|
| 635 | =over 4
|
|---|
| 636 |
|
|---|
| 637 | =item *
|
|---|
| 638 |
|
|---|
| 639 | And finally, C<scalar reverse()> reverses by character rather than by byte.
|
|---|
| 640 |
|
|---|
| 641 | =back
|
|---|
| 642 |
|
|---|
| 643 | =head2 User-Defined Character Properties
|
|---|
| 644 |
|
|---|
| 645 | You can define your own character properties by defining subroutines
|
|---|
| 646 | whose names begin with "In" or "Is". The subroutines can be defined in
|
|---|
| 647 | any package. The user-defined properties can be used in the regular
|
|---|
| 648 | expression C<\p> and C<\P> constructs; if you are using a user-defined
|
|---|
| 649 | property from a package other than the one you are in, you must specify
|
|---|
| 650 | its package in the C<\p> or C<\P> construct.
|
|---|
| 651 |
|
|---|
| 652 | # assuming property IsForeign defined in Lang::
|
|---|
| 653 | package main; # property package name required
|
|---|
| 654 | if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
|
|---|
| 655 |
|
|---|
| 656 | package Lang; # property package name not required
|
|---|
| 657 | if ($txt =~ /\p{IsForeign}+/) { ... }
|
|---|
| 658 |
|
|---|
| 659 |
|
|---|
| 660 | Note that the effect is compile-time and immutable once defined.
|
|---|
| 661 |
|
|---|
| 662 | The subroutines must return a specially-formatted string, with one
|
|---|
| 663 | or more newline-separated lines. Each line must be one of the following:
|
|---|
| 664 |
|
|---|
| 665 | =over 4
|
|---|
| 666 |
|
|---|
| 667 | =item *
|
|---|
| 668 |
|
|---|
| 669 | Two hexadecimal numbers separated by horizontal whitespace (space or
|
|---|
| 670 | tabular characters) denoting a range of Unicode code points to include.
|
|---|
| 671 |
|
|---|
| 672 | =item *
|
|---|
| 673 |
|
|---|
| 674 | Something to include, prefixed by "+": a built-in character
|
|---|
| 675 | property (prefixed by "utf8::") or a user-defined character property,
|
|---|
| 676 | to represent all the characters in that property; two hexadecimal code
|
|---|
| 677 | points for a range; or a single hexadecimal code point.
|
|---|
| 678 |
|
|---|
| 679 | =item *
|
|---|
| 680 |
|
|---|
| 681 | Something to exclude, prefixed by "-": an existing character
|
|---|
| 682 | property (prefixed by "utf8::") or a user-defined character property,
|
|---|
| 683 | to represent all the characters in that property; two hexadecimal code
|
|---|
| 684 | points for a range; or a single hexadecimal code point.
|
|---|
| 685 |
|
|---|
| 686 | =item *
|
|---|
| 687 |
|
|---|
| 688 | Something to negate, prefixed "!": an existing character
|
|---|
| 689 | property (prefixed by "utf8::") or a user-defined character property,
|
|---|
| 690 | to represent all the characters in that property; two hexadecimal code
|
|---|
| 691 | points for a range; or a single hexadecimal code point.
|
|---|
| 692 |
|
|---|
| 693 | =item *
|
|---|
| 694 |
|
|---|
| 695 | Something to intersect with, prefixed by "&": an existing character
|
|---|
| 696 | property (prefixed by "utf8::") or a user-defined character property,
|
|---|
| 697 | for all the characters except the characters in the property; two
|
|---|
| 698 | hexadecimal code points for a range; or a single hexadecimal code point.
|
|---|
| 699 |
|
|---|
| 700 | =back
|
|---|
| 701 |
|
|---|
| 702 | For example, to define a property that covers both the Japanese
|
|---|
| 703 | syllabaries (hiragana and katakana), you can define
|
|---|
| 704 |
|
|---|
| 705 | sub InKana {
|
|---|
| 706 | return <<END;
|
|---|
| 707 | 3040\t309F
|
|---|
| 708 | 30A0\t30FF
|
|---|
| 709 | END
|
|---|
| 710 | }
|
|---|
| 711 |
|
|---|
| 712 | Imagine that the here-doc end marker is at the beginning of the line.
|
|---|
| 713 | Now you can use C<\p{InKana}> and C<\P{InKana}>.
|
|---|
| 714 |
|
|---|
| 715 | You could also have used the existing block property names:
|
|---|
| 716 |
|
|---|
| 717 | sub InKana {
|
|---|
| 718 | return <<'END';
|
|---|
| 719 | +utf8::InHiragana
|
|---|
| 720 | +utf8::InKatakana
|
|---|
| 721 | END
|
|---|
| 722 | }
|
|---|
| 723 |
|
|---|
| 724 | Suppose you wanted to match only the allocated characters,
|
|---|
| 725 | not the raw block ranges: in other words, you want to remove
|
|---|
| 726 | the non-characters:
|
|---|
| 727 |
|
|---|
| 728 | sub InKana {
|
|---|
| 729 | return <<'END';
|
|---|
| 730 | +utf8::InHiragana
|
|---|
| 731 | +utf8::InKatakana
|
|---|
| 732 | -utf8::IsCn
|
|---|
| 733 | END
|
|---|
| 734 | }
|
|---|
| 735 |
|
|---|
| 736 | The negation is useful for defining (surprise!) negated classes.
|
|---|
| 737 |
|
|---|
| 738 | sub InNotKana {
|
|---|
| 739 | return <<'END';
|
|---|
| 740 | !utf8::InHiragana
|
|---|
| 741 | -utf8::InKatakana
|
|---|
| 742 | +utf8::IsCn
|
|---|
| 743 | END
|
|---|
| 744 | }
|
|---|
| 745 |
|
|---|
| 746 | Intersection is useful for getting the common characters matched by
|
|---|
| 747 | two (or more) classes.
|
|---|
| 748 |
|
|---|
| 749 | sub InFooAndBar {
|
|---|
| 750 | return <<'END';
|
|---|
| 751 | +main::Foo
|
|---|
| 752 | &main::Bar
|
|---|
| 753 | END
|
|---|
| 754 | }
|
|---|
| 755 |
|
|---|
| 756 | It's important to remember not to use "&" for the first set -- that
|
|---|
| 757 | would be intersecting with nothing (resulting in an empty set).
|
|---|
| 758 |
|
|---|
| 759 | You can also define your own mappings to be used in the lc(),
|
|---|
| 760 | lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
|
|---|
| 761 | The principle is the same: define subroutines in the C<main> package
|
|---|
| 762 | with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
|
|---|
| 763 | the first character in ucfirst()), and C<ToUpper> (for uc(), and the
|
|---|
| 764 | rest of the characters in ucfirst()).
|
|---|
| 765 |
|
|---|
| 766 | The string returned by the subroutines needs now to be three
|
|---|
| 767 | hexadecimal numbers separated by tabulators: start of the source
|
|---|
| 768 | range, end of the source range, and start of the destination range.
|
|---|
| 769 | For example:
|
|---|
| 770 |
|
|---|
| 771 | sub ToUpper {
|
|---|
| 772 | return <<END;
|
|---|
| 773 | 0061\t0063\t0041
|
|---|
| 774 | END
|
|---|
| 775 | }
|
|---|
| 776 |
|
|---|
| 777 | defines an uc() mapping that causes only the characters "a", "b", and
|
|---|
| 778 | "c" to be mapped to "A", "B", "C", all other characters will remain
|
|---|
| 779 | unchanged.
|
|---|
| 780 |
|
|---|
| 781 | If there is no source range to speak of, that is, the mapping is from
|
|---|
| 782 | a single character to another single character, leave the end of the
|
|---|
| 783 | source range empty, but the two tabulator characters are still needed.
|
|---|
| 784 | For example:
|
|---|
| 785 |
|
|---|
| 786 | sub ToLower {
|
|---|
| 787 | return <<END;
|
|---|
| 788 | 0041\t\t0061
|
|---|
| 789 | END
|
|---|
| 790 | }
|
|---|
| 791 |
|
|---|
| 792 | defines a lc() mapping that causes only "A" to be mapped to "a", all
|
|---|
| 793 | other characters will remain unchanged.
|
|---|
| 794 |
|
|---|
| 795 | (For serious hackers only) If you want to introspect the default
|
|---|
| 796 | mappings, you can find the data in the directory
|
|---|
| 797 | C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
|
|---|
| 798 | the here-document, and the C<utf8::ToSpecFoo> are special exception
|
|---|
| 799 | mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
|
|---|
| 800 | The C<Digit> and C<Fold> mappings that one can see in the directory
|
|---|
| 801 | are not directly user-accessible, one can use either the
|
|---|
| 802 | C<Unicode::UCD> module, or just match case-insensitively (that's when
|
|---|
| 803 | the C<Fold> mapping is used).
|
|---|
| 804 |
|
|---|
| 805 | A final note on the user-defined property tests and mappings: they
|
|---|
| 806 | will be used only if the scalar has been marked as having Unicode
|
|---|
| 807 | characters. Old byte-style strings will not be affected.
|
|---|
| 808 |
|
|---|
| 809 | =head2 Character Encodings for Input and Output
|
|---|
| 810 |
|
|---|
| 811 | See L<Encode>.
|
|---|
| 812 |
|
|---|
| 813 | =head2 Unicode Regular Expression Support Level
|
|---|
| 814 |
|
|---|
| 815 | The following list of Unicode support for regular expressions describes
|
|---|
| 816 | all the features currently supported. The references to "Level N"
|
|---|
| 817 | and the section numbers refer to the Unicode Technical Report 18,
|
|---|
| 818 | "Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
|
|---|
| 819 | Perl 5.8.0).
|
|---|
| 820 |
|
|---|
| 821 | =over 4
|
|---|
| 822 |
|
|---|
| 823 | =item *
|
|---|
| 824 |
|
|---|
| 825 | Level 1 - Basic Unicode Support
|
|---|
| 826 |
|
|---|
| 827 | 2.1 Hex Notation - done [1]
|
|---|
| 828 | Named Notation - done [2]
|
|---|
| 829 | 2.2 Categories - done [3][4]
|
|---|
| 830 | 2.3 Subtraction - MISSING [5][6]
|
|---|
| 831 | 2.4 Simple Word Boundaries - done [7]
|
|---|
| 832 | 2.5 Simple Loose Matches - done [8]
|
|---|
| 833 | 2.6 End of Line - MISSING [9][10]
|
|---|
| 834 |
|
|---|
| 835 | [ 1] \x{...}
|
|---|
| 836 | [ 2] \N{...}
|
|---|
| 837 | [ 3] . \p{...} \P{...}
|
|---|
| 838 | [ 4] support for scripts (see UTR#24 Script Names), blocks,
|
|---|
| |
|---|