Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

perlunicode.pod@ 3609

Visit:

Last change on this file since 3609 was 3181, checked in by bird, 19 years ago
perl 5.8.8
File size: 49.4 KB

Rev	Line
[3181]	1	=head1 NAME
	2
	3	perlunicode - Unicode support in Perl
	4
	5	=head1 DESCRIPTION
	6
	7	=head2 Important Caveats
	8
	9	Unicode support is an extensive requirement. While Perl does not
	10	implement the Unicode standard or the accompanying technical reports
	11	from cover to cover, Perl does support many Unicode features.
	12
	13	=over 4
	14
	15	=item Input and Output Layers
	16
	17	Perl knows when a filehandle uses Perl's internal Unicode encodings
	18	(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
	19	the ":utf8" layer. Other encodings can be converted to Perl's
	20	encoding on input or from Perl's encoding on output by use of the
	21	":encoding(...)" layer. See L<open>.
	22
	23	To indicate that Perl source itself is using a particular encoding,
	24	see L<encoding>.
	25
	26	=item Regular Expressions
	27
	28	The regular expression compiler produces polymorphic opcodes. That is,
	29	the pattern adapts to the data and automatically switches to the Unicode
	30	character scheme when presented with Unicode data--or instead uses
	31	a traditional byte scheme when presented with byte data.
	32
	33	=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
	34
	35	As a compatibility measure, the C<use utf8> pragma must be explicitly
	36	included to enable recognition of UTF-8 in the Perl scripts themselves
	37	(in string or regular expression literals, or in identifier names) on
	38	ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
	39	machines. B<These are the only times when an explicit C<use utf8>
	40	is needed.> See L<utf8>.
	41
	42	You can also use the C<encoding> pragma to change the default encoding
	43	of the data in your script; see L<encoding>.
	44
	45	=item BOM-marked scripts and UTF-16 scripts autodetected
	46
	47	If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
	48	or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
	49	endianness, Perl will correctly read in the script as Unicode.
	50	(BOMless UTF-8 cannot be effectively recognized or differentiated from
	51	ISO 8859-1 or other eight-bit encodings.)
	52
	53	=item C<use encoding> needed to upgrade non-Latin-1 byte strings
	54
	55	By default, there is a fundamental asymmetry in Perl's unicode model:
	56	implicit upgrading from byte strings to Unicode strings assumes that
	57	they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
	58	downgraded with UTF-8 encoding. This happens because the first 256
	59	codepoints in Unicode happens to agree with Latin-1.
	60
	61	If you wish to interpret byte strings as UTF-8 instead, use the
	62	C<encoding> pragma:
	63
	64	use encoding 'utf8';
	65
	66	See L</"Byte and Character Semantics"> for more details.
	67
	68	=back
	69
	70	=head2 Byte and Character Semantics
	71
	72	Beginning with version 5.6, Perl uses logically-wide characters to
	73	represent strings internally.
	74
	75	In future, Perl-level operations will be expected to work with
	76	characters rather than bytes.
	77
	78	However, as an interim compatibility measure, Perl aims to
	79	provide a safe migration path from byte semantics to character
	80	semantics for programs. For operations where Perl can unambiguously
	81	decide that the input data are characters, Perl switches to
	82	character semantics. For operations where this determination cannot
	83	be made without additional information from the user, Perl decides in
	84	favor of compatibility and chooses to use byte semantics.
	85
	86	This behavior preserves compatibility with earlier versions of Perl,
	87	which allowed byte semantics in Perl operations only if
	88	none of the program's inputs were marked as being as source of Unicode
	89	character data. Such data may come from filehandles, from calls to
	90	external programs, from information provided by the system (such as %ENV),
	91	or from literals and constants in the source text.
	92
	93	The C<bytes> pragma will always, regardless of platform, force byte
	94	semantics in a particular lexical scope. See L<bytes>.
	95
	96	The C<utf8> pragma is primarily a compatibility device that enables
	97	recognition of UTF-(8\|EBCDIC) in literals encountered by the parser.
	98	Note that this pragma is only required while Perl defaults to byte
	99	semantics; when character semantics become the default, this pragma
	100	may become a no-op. See L<utf8>.
	101
	102	Unless explicitly stated, Perl operators use character semantics
	103	for Unicode data and byte semantics for non-Unicode data.
	104	The decision to use character semantics is made transparently. If
	105	input data comes from a Unicode source--for example, if a character
	106	encoding layer is added to a filehandle or a literal Unicode
	107	string constant appears in a program--character semantics apply.
	108	Otherwise, byte semantics are in effect. The C<bytes> pragma should
	109	be used to force byte semantics on Unicode data.
	110
	111	If strings operating under byte semantics and strings with Unicode
	112	character data are concatenated, the new string will be created by
	113	decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
	114	old Unicode string used EBCDIC. This translation is done without
	115	regard to the system's native 8-bit encoding. To change this for
	116	systems with non-Latin-1 and non-EBCDIC native encodings, use the
	117	C<encoding> pragma. See L<encoding>.
	118
	119	Under character semantics, many operations that formerly operated on
	120	bytes now operate on characters. A character in Perl is
	121	logically just a number ranging from 0 to 2**31 or so. Larger
	122	characters may encode into longer sequences of bytes internally, but
	123	this internal detail is mostly hidden for Perl code.
	124	See L<perluniintro> for more.
	125
	126	=head2 Effects of Character Semantics
	127
	128	Character semantics have the following effects:
	129
	130	=over 4
	131
	132	=item *
	133
	134	Strings--including hash keys--and regular expression patterns may
	135	contain characters that have an ordinal value larger than 255.
	136
	137	If you use a Unicode editor to edit your program, Unicode characters
	138	may occur directly within the literal strings in one of the various
	139	Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
	140	as such and converted to Perl's internal representation only if the
	141	appropriate L<encoding> is specified.
	142
	143	Unicode characters can also be added to a string by using the
	144	C<\x{...}> notation. The Unicode code for the desired character, in
	145	hexadecimal, should be placed in the braces. For instance, a smiley
	146	face is C<\x{263A}>. This encoding scheme only works for characters
	147	with a code of 0x100 or above.
	148
	149	Additionally, if you
	150
	151	use charnames ':full';
	152
	153	you can use the C<\N{...}> notation and put the official Unicode
	154	character name within the braces, such as C<\N{WHITE SMILING FACE}>.
	155
	156
	157	=item *
	158
	159	If an appropriate L<encoding> is specified, identifiers within the
	160	Perl script may contain Unicode alphanumeric characters, including
	161	ideographs. Perl does not currently attempt to canonicalize variable
	162	names.
	163
	164	=item *
	165
	166	Regular expressions match characters instead of bytes. "." matches
	167	a character instead of a byte. The C<\C> pattern is provided to force
	168	a match a single byte--a C<char> in C, hence C<\C>.
	169
	170	=item *
	171
	172	Character classes in regular expressions match characters instead of
	173	bytes and match against the character properties specified in the
	174	Unicode properties database. C<\w> can be used to match a Japanese
	175	ideograph, for instance.
	176
	177	(However, and as a limitation of the current implementation, using
	178	C<\w> or C<\W> I<inside> a C<[...]> character class will still match
	179	with byte semantics.)
	180
	181	=item *
	182
	183	Named Unicode properties, scripts, and block ranges may be used like
	184	character classes via the C<\p{}> "matches property" construct and
	185	the C<\P{}> negation, "doesn't match property".
	186
	187	For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
	188	(Letter, uppercase) property, while C<\p{M}> matches any character
	189	with an "M" (mark--accents and such) property. Brackets are not
	190	required for single letter properties, so C<\p{M}> is equivalent to
	191	C<\pM>. Many predefined properties are available, such as
	192	C<\p{Mirrored}> and C<\p{Tibetan}>.
	193
	194	The official Unicode script and block names have spaces and dashes as
	195	separators, but for convenience you can use dashes, spaces, or
	196	underbars, and case is unimportant. It is recommended, however, that
	197	for consistency you use the following naming: the official Unicode
	198	script, property, or block name (see below for the additional rules
	199	that apply to block names) with whitespace and dashes removed, and the
	200	words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
	201	becomes C<Latin1Supplement>.
	202
	203	You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
	204	(^) between the first brace and the property name: C<\p{^Tamil}> is
	205	equal to C<\P{Tamil}>.
	206
	207	B<NOTE: the properties, scripts, and blocks listed here are as of
	208	Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
	209	came out in April 2003, and Perl 5.8.1 in September 2003.>
	210
	211	Here are the basic Unicode General Category properties, followed by their
	212	long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
	213	for instance, are identical.
	214
	215	Short Long
	216
	217	L Letter
	218	LC CasedLetter
	219	Lu UppercaseLetter
	220	Ll LowercaseLetter
	221	Lt TitlecaseLetter
	222	Lm ModifierLetter
	223	Lo OtherLetter
	224
	225	M Mark
	226	Mn NonspacingMark
	227	Mc SpacingMark
	228	Me EnclosingMark
	229
	230	N Number
	231	Nd DecimalNumber
	232	Nl LetterNumber
	233	No OtherNumber
	234
	235	P Punctuation
	236	Pc ConnectorPunctuation
	237	Pd DashPunctuation
	238	Ps OpenPunctuation
	239	Pe ClosePunctuation
	240	Pi InitialPunctuation
	241	(may behave like Ps or Pe depending on usage)
	242	Pf FinalPunctuation
	243	(may behave like Ps or Pe depending on usage)
	244	Po OtherPunctuation
	245
	246	S Symbol
	247	Sm MathSymbol
	248	Sc CurrencySymbol
	249	Sk ModifierSymbol
	250	So OtherSymbol
	251
	252	Z Separator
	253	Zs SpaceSeparator
	254	Zl LineSeparator
	255	Zp ParagraphSeparator
	256
	257	C Other
	258	Cc Control
	259	Cf Format
	260	Cs Surrogate (not usable)
	261	Co PrivateUse
	262	Cn Unassigned
	263
	264	Single-letter properties match all characters in any of the
	265	two-letter sub-properties starting with the same letter.
	266	C<LC> and C<L&> are special cases, which are aliases for the set of
	267	C<Ll>, C<Lu>, and C<Lt>.
	268
	269	Because Perl hides the need for the user to understand the internal
	270	representation of Unicode characters, there is no need to implement
	271	the somewhat messy concept of surrogates. C<Cs> is therefore not
	272	supported.
	273
	274	Because scripts differ in their directionality--Hebrew is
	275	written right to left, for example--Unicode supplies these properties in
	276	the BidiClass class:
	277
	278	Property Meaning
	279
	280	L Left-to-Right
	281	LRE Left-to-Right Embedding
	282	LRO Left-to-Right Override
	283	R Right-to-Left
	284	AL Right-to-Left Arabic
	285	RLE Right-to-Left Embedding
	286	RLO Right-to-Left Override
	287	PDF Pop Directional Format
	288	EN European Number
	289	ES European Number Separator
	290	ET European Number Terminator
	291	AN Arabic Number
	292	CS Common Number Separator
	293	NSM Non-Spacing Mark
	294	BN Boundary Neutral
	295	B Paragraph Separator
	296	S Segment Separator
	297	WS Whitespace
	298	ON Other Neutrals
	299
	300	For example, C<\p{BidiClass:R}> matches characters that are normally
	301	written right to left.
	302
	303	=back
	304
	305	=head2 Scripts
	306
	307	The script names which can be used by C<\p{...}> and C<\P{...}>,
	308	such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
	309
	310	Arabic
	311	Armenian
	312	Bengali
	313	Bopomofo
	314	Buhid
	315	CanadianAboriginal
	316	Cherokee
	317	Cyrillic
	318	Deseret
	319	Devanagari
	320	Ethiopic
	321	Georgian
	322	Gothic
	323	Greek
	324	Gujarati
	325	Gurmukhi
	326	Han
	327	Hangul
	328	Hanunoo
	329	Hebrew
	330	Hiragana
	331	Inherited
	332	Kannada
	333	Katakana
	334	Khmer
	335	Lao
	336	Latin
	337	Malayalam
	338	Mongolian
	339	Myanmar
	340	Ogham
	341	OldItalic
	342	Oriya
	343	Runic
	344	Sinhala
	345	Syriac
	346	Tagalog
	347	Tagbanwa
	348	Tamil
	349	Telugu
	350	Thaana
	351	Thai
	352	Tibetan
	353	Yi
	354
	355	Extended property classes can supplement the basic
	356	properties, defined by the F<PropList> Unicode database:
	357
	358	ASCIIHexDigit
	359	BidiControl
	360	Dash
	361	Deprecated
	362	Diacritic
	363	Extender
	364	GraphemeLink
	365	HexDigit
	366	Hyphen
	367	Ideographic
	368	IDSBinaryOperator
	369	IDSTrinaryOperator
	370	JoinControl
	371	LogicalOrderException
	372	NoncharacterCodePoint
	373	OtherAlphabetic
	374	OtherDefaultIgnorableCodePoint
	375	OtherGraphemeExtend
	376	OtherLowercase
	377	OtherMath
	378	OtherUppercase
	379	QuotationMark
	380	Radical
	381	SoftDotted
	382	TerminalPunctuation
	383	UnifiedIdeograph
	384	WhiteSpace
	385
	386	and there are further derived properties:
	387
	388	Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
	389	Lowercase Ll + OtherLowercase
	390	Uppercase Lu + OtherUppercase
	391	Math Sm + OtherMath
	392
	393	ID_Start Lu + Ll + Lt + Lm + Lo + Nl
	394	ID_Continue ID_Start + Mn + Mc + Nd + Pc
	395
	396	Any Any character
	397	Assigned Any non-Cn character (i.e. synonym for \P{Cn})
	398	Unassigned Synonym for \p{Cn}
	399	Common Any character (or unassigned code point)
	400	not explicitly assigned to a script
	401
	402	For backward compatibility (with Perl 5.6), all properties mentioned
	403	so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
	404	example, is equal to C<\P{Lu}>.
	405
	406	=head2 Blocks
	407
	408	In addition to B<scripts>, Unicode also defines B<blocks> of
	409	characters. The difference between scripts and blocks is that the
	410	concept of scripts is closer to natural languages, while the concept
	411	of blocks is more of an artificial grouping based on groups of 256
	412	Unicode characters. For example, the C<Latin> script contains letters
	413	from many blocks but does not contain all the characters from those
	414	blocks. It does not, for example, contain digits, because digits are
	415	shared across many scripts. Digits and similar groups, like
	416	punctuation, are in a category called C<Common>.
	417
	418	For more about scripts, see the UTR #24:
	419
	420	http://www.unicode.org/unicode/reports/tr24/
	421
	422	For more about blocks, see:
	423
	424	http://www.unicode.org/Public/UNIDATA/Blocks.txt
	425
	426	Block names are given with the C<In> prefix. For example, the
	427	Katakana block is referenced via C<\p{InKatakana}>. The C<In>
	428	prefix may be omitted if there is no naming conflict with a script
	429	or any other property, but it is recommended that C<In> always be used
	430	for block tests to avoid confusion.
	431
	432	These block names are supported:
	433
	434	InAlphabeticPresentationForms
	435	InArabic
	436	InArabicPresentationFormsA
	437	InArabicPresentationFormsB
	438	InArmenian
	439	InArrows
	440	InBasicLatin
	441	InBengali
	442	InBlockElements
	443	InBopomofo
	444	InBopomofoExtended
	445	InBoxDrawing
	446	InBraillePatterns
	447	InBuhid
	448	InByzantineMusicalSymbols
	449	InCJKCompatibility
	450	InCJKCompatibilityForms
	451	InCJKCompatibilityIdeographs
	452	InCJKCompatibilityIdeographsSupplement
	453	InCJKRadicalsSupplement
	454	InCJKSymbolsAndPunctuation
	455	InCJKUnifiedIdeographs
	456	InCJKUnifiedIdeographsExtensionA
	457	InCJKUnifiedIdeographsExtensionB
	458	InCherokee
	459	InCombiningDiacriticalMarks
	460	InCombiningDiacriticalMarksforSymbols
	461	InCombiningHalfMarks
	462	InControlPictures
	463	InCurrencySymbols
	464	InCyrillic
	465	InCyrillicSupplementary
	466	InDeseret
	467	InDevanagari
	468	InDingbats
	469	InEnclosedAlphanumerics
	470	InEnclosedCJKLettersAndMonths
	471	InEthiopic
	472	InGeneralPunctuation
	473	InGeometricShapes
	474	InGeorgian
	475	InGothic
	476	InGreekExtended
	477	InGreekAndCoptic
	478	InGujarati
	479	InGurmukhi
	480	InHalfwidthAndFullwidthForms
	481	InHangulCompatibilityJamo
	482	InHangulJamo
	483	InHangulSyllables
	484	InHanunoo
	485	InHebrew
	486	InHighPrivateUseSurrogates
	487	InHighSurrogates
	488	InHiragana
	489	InIPAExtensions
	490	InIdeographicDescriptionCharacters
	491	InKanbun
	492	InKangxiRadicals
	493	InKannada
	494	InKatakana
	495	InKatakanaPhoneticExtensions
	496	InKhmer
	497	InLao
	498	InLatin1Supplement
	499	InLatinExtendedA
	500	InLatinExtendedAdditional
	501	InLatinExtendedB
	502	InLetterlikeSymbols
	503	InLowSurrogates
	504	InMalayalam
	505	InMathematicalAlphanumericSymbols
	506	InMathematicalOperators
	507	InMiscellaneousMathematicalSymbolsA
	508	InMiscellaneousMathematicalSymbolsB
	509	InMiscellaneousSymbols
	510	InMiscellaneousTechnical
	511	InMongolian
	512	InMusicalSymbols
	513	InMyanmar
	514	InNumberForms
	515	InOgham
	516	InOldItalic
	517	InOpticalCharacterRecognition
	518	InOriya
	519	InPrivateUseArea
	520	InRunic
	521	InSinhala
	522	InSmallFormVariants
	523	InSpacingModifierLetters
	524	InSpecials
	525	InSuperscriptsAndSubscripts
	526	InSupplementalArrowsA
	527	InSupplementalArrowsB
	528	InSupplementalMathematicalOperators
	529	InSupplementaryPrivateUseAreaA
	530	InSupplementaryPrivateUseAreaB
	531	InSyriac
	532	InTagalog
	533	InTagbanwa
	534	InTags
	535	InTamil
	536	InTelugu
	537	InThaana
	538	InThai
	539	InTibetan
	540	InUnifiedCanadianAboriginalSyllabics
	541	InVariationSelectors
	542	InYiRadicals
	543	InYiSyllables
	544
	545	=over 4
	546
	547	=item *
	548
	549	The special pattern C<\X> matches any extended Unicode
	550	sequence--"a combining character sequence" in Standardese--where the
	551	first character is a base character and subsequent characters are mark
	552	characters that apply to the base character. C<\X> is equivalent to
	553	C<(?:\PM\pM*)>.
	554
	555	=item *
	556
	557	The C<tr///> operator translates characters instead of bytes. Note
	558	that the C<tr///CU> functionality has been removed. For similar
	559	functionality see pack('U0', ...) and pack('C0', ...).
	560
	561	=item *
	562
	563	Case translation operators use the Unicode case translation tables
	564	when character input is provided. Note that C<uc()>, or C<\U> in
	565	interpolated strings, translates to uppercase, while C<ucfirst>,
	566	or C<\u> in interpolated strings, translates to titlecase in languages
	567	that make the distinction.
	568
	569	=item *
	570
	571	Most operators that deal with positions or lengths in a string will
	572	automatically switch to using character positions, including
	573	C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
	574	C<sprintf()>, C<write()>, and C<length()>. Operators that
	575	specifically do not switch include C<vec()>, C<pack()>, and
	576	C<unpack()>. Operators that really don't care include
	577	operators that treats strings as a bucket of bits such as C<sort()>,
	578	and operators dealing with filenames.
	579
	580	=item *
	581
	582	The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
	583	since they are often used for byte-oriented formats. Again, think
	584	C<char> in the C language.
	585
	586	There is a new C<U> specifier that converts between Unicode characters
	587	and code points.
	588
	589	=item *
	590
	591	The C<chr()> and C<ord()> functions work on characters, similar to
	592	C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
	593	C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
	594	emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
	595	While these methods reveal the internal encoding of Unicode strings,
	596	that is not something one normally needs to care about at all.
	597
	598	=item *
	599
	600	The bit string operators, C<& \| ^ ~>, can operate on character data.
	601	However, for backward compatibility, such as when using bit string
	602	operations when characters are all less than 256 in ordinal value, one
	603	should not use C<~> (the bit complement) with characters of both
	604	values less than 256 and values greater than 256. Most importantly,
	605	DeMorgan's laws (C<~($x\|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x\|~$y>)
	606	will not hold. The reason for this mathematical I<faux pas> is that
	607	the complement cannot return B<both> the 8-bit (byte-wide) bit
	608	complement B<and> the full character-wide bit complement.
	609
	610	=item *
	611
	612	lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
	613
	614	=over 8
	615
	616	=item *
	617
	618	the case mapping is from a single Unicode character to another
	619	single Unicode character, or
	620
	621	=item *
	622
	623	the case mapping is from a single Unicode character to more
	624	than one Unicode character.
	625
	626	=back
	627
	628	Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
	629	since Perl does not understand the concept of Unicode locales.
	630
	631	See the Unicode Technical Report #21, Case Mappings, for more details.
	632
	633	=back
	634
	635	=over 4
	636
	637	=item *
	638
	639	And finally, C<scalar reverse()> reverses by character rather than by byte.
	640
	641	=back
	642
	643	=head2 User-Defined Character Properties
	644
	645	You can define your own character properties by defining subroutines
	646	whose names begin with "In" or "Is". The subroutines can be defined in
	647	any package. The user-defined properties can be used in the regular
	648	expression C<\p> and C<\P> constructs; if you are using a user-defined
	649	property from a package other than the one you are in, you must specify
	650	its package in the C<\p> or C<\P> construct.
	651
	652	# assuming property IsForeign defined in Lang::
	653	package main; # property package name required
	654	if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
	655
	656	package Lang; # property package name not required
	657	if ($txt =~ /\p{IsForeign}+/) { ... }
	658
	659
	660	Note that the effect is compile-time and immutable once defined.
	661
	662	The subroutines must return a specially-formatted string, with one
	663	or more newline-separated lines. Each line must be one of the following:
	664
	665	=over 4
	666
	667	=item *
	668
	669	Two hexadecimal numbers separated by horizontal whitespace (space or
	670	tabular characters) denoting a range of Unicode code points to include.
	671
	672	=item *
	673
	674	Something to include, prefixed by "+": a built-in character
	675	property (prefixed by "utf8::") or a user-defined character property,
	676	to represent all the characters in that property; two hexadecimal code
	677	points for a range; or a single hexadecimal code point.
	678
	679	=item *
	680
	681	Something to exclude, prefixed by "-": an existing character
	682	property (prefixed by "utf8::") or a user-defined character property,
	683	to represent all the characters in that property; two hexadecimal code
	684	points for a range; or a single hexadecimal code point.
	685
	686	=item *
	687
	688	Something to negate, prefixed "!": an existing character
	689	property (prefixed by "utf8::") or a user-defined character property,
	690	to represent all the characters in that property; two hexadecimal code
	691	points for a range; or a single hexadecimal code point.
	692
	693	=item *
	694
	695	Something to intersect with, prefixed by "&": an existing character
	696	property (prefixed by "utf8::") or a user-defined character property,
	697	for all the characters except the characters in the property; two
	698	hexadecimal code points for a range; or a single hexadecimal code point.
	699
	700	=back
	701
	702	For example, to define a property that covers both the Japanese
	703	syllabaries (hiragana and katakana), you can define
	704
	705	sub InKana {
	706	return <<END;
	707	3040\t309F
	708	30A0\t30FF
	709	END
	710	}
	711
	712	Imagine that the here-doc end marker is at the beginning of the line.
	713	Now you can use C<\p{InKana}> and C<\P{InKana}>.
	714
	715	You could also have used the existing block property names:
	716
	717	sub InKana {
	718	return <<'END';
	719	+utf8::InHiragana
	720	+utf8::InKatakana
	721	END
	722	}
	723
	724	Suppose you wanted to match only the allocated characters,
	725	not the raw block ranges: in other words, you want to remove
	726	the non-characters:
	727
	728	sub InKana {
	729	return <<'END';
	730	+utf8::InHiragana
	731	+utf8::InKatakana
	732	-utf8::IsCn
	733	END
	734	}
	735
	736	The negation is useful for defining (surprise!) negated classes.
	737
	738	sub InNotKana {
	739	return <<'END';
	740	!utf8::InHiragana
	741	-utf8::InKatakana
	742	+utf8::IsCn
	743	END
	744	}
	745
	746	Intersection is useful for getting the common characters matched by
	747	two (or more) classes.
	748
	749	sub InFooAndBar {
	750	return <<'END';
	751	+main::Foo
	752	&main::Bar
	753	END
	754	}
	755
	756	It's important to remember not to use "&" for the first set -- that
	757	would be intersecting with nothing (resulting in an empty set).
	758
	759	You can also define your own mappings to be used in the lc(),
	760	lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
	761	The principle is the same: define subroutines in the C<main> package
	762	with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
	763	the first character in ucfirst()), and C<ToUpper> (for uc(), and the
	764	rest of the characters in ucfirst()).
	765
	766	The string returned by the subroutines needs now to be three
	767	hexadecimal numbers separated by tabulators: start of the source
	768	range, end of the source range, and start of the destination range.
	769	For example:
	770
	771	sub ToUpper {
	772	return <<END;
	773	0061\t0063\t0041
	774	END
	775	}
	776
	777	defines an uc() mapping that causes only the characters "a", "b", and
	778	"c" to be mapped to "A", "B", "C", all other characters will remain
	779	unchanged.
	780
	781	If there is no source range to speak of, that is, the mapping is from
	782	a single character to another single character, leave the end of the
	783	source range empty, but the two tabulator characters are still needed.
	784	For example:
	785
	786	sub ToLower {
	787	return <<END;
	788	0041\t\t0061
	789	END
	790	}
	791
	792	defines a lc() mapping that causes only "A" to be mapped to "a", all
	793	other characters will remain unchanged.
	794
	795	(For serious hackers only) If you want to introspect the default
	796	mappings, you can find the data in the directory
	797	C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
	798	the here-document, and the C<utf8::ToSpecFoo> are special exception
	799	mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
	800	The C<Digit> and C<Fold> mappings that one can see in the directory
	801	are not directly user-accessible, one can use either the
	802	C<Unicode::UCD> module, or just match case-insensitively (that's when
	803	the C<Fold> mapping is used).
	804
	805	A final note on the user-defined property tests and mappings: they
	806	will be used only if the scalar has been marked as having Unicode
	807	characters. Old byte-style strings will not be affected.
	808
	809	=head2 Character Encodings for Input and Output
	810
	811	See L<Encode>.
	812
	813	=head2 Unicode Regular Expression Support Level
	814
	815	The following list of Unicode support for regular expressions describes
	816	all the features currently supported. The references to "Level N"
	817	and the section numbers refer to the Unicode Technical Report 18,
	818	"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
	819	Perl 5.8.0).
	820
	821	=over 4
	822
	823	=item *
	824
	825	Level 1 - Basic Unicode Support
	826
	827	2.1 Hex Notation - done [1]
	828	Named Notation - done [2]
	829	2.2 Categories - done [3][4]
	830	2.3 Subtraction - MISSING [5][6]
	831	2.4 Simple Word Boundaries - done [7]
	832	2.5 Simple Loose Matches - done [8]
	833	2.6 End of Line - MISSING [9][10]
	834
	835	[ 1] \x{...}
	836	[ 2] \N{...}
	837	[ 3] . \p{...} \P{...}
	838	[ 4] support for scripts (see UTR#24 Script Names), blocks,