You are viewing the version of this documentation from Perl 5.12.0. View the latest version

CONTENTS

NAME

Unicode::UCD - Unicode character database

SYNOPSIS

use Unicode::UCD 'charinfo';
my $charinfo   = charinfo($codepoint);

use Unicode::UCD 'casefold';
my $casefold = casefold(0xFB00);

use Unicode::UCD 'casespec';
my $casespec = casespec(0xFB00);

use Unicode::UCD 'charblock';
my $charblock  = charblock($codepoint);

use Unicode::UCD 'charscript';
my $charscript = charscript($codepoint);

use Unicode::UCD 'charblocks';
my $charblocks = charblocks();

use Unicode::UCD 'charscripts';
my $charscripts = charscripts();

use Unicode::UCD qw(charscript charinrange);
my $range = charscript($script);
print "looks like $script\n" if charinrange($range, $codepoint);

use Unicode::UCD qw(general_categories bidi_types);
my $categories = general_categories();
my $types = bidi_types();

use Unicode::UCD 'compexcl';
my $compexcl = compexcl($codepoint);

use Unicode::UCD 'namedseq';
my $namedseq = namedseq($named_sequence_name);

my $unicode_version = Unicode::UCD::UnicodeVersion();

DESCRIPTION

The Unicode::UCD module offers a series of functions that provide a simple interface to the Unicode Character Database.

code point argument

Some of the functions are called with a code point argument, which is either a decimal or a hexadecimal scalar designating a Unicode code point, or U+ followed by hexadecimals designating a Unicode code point. In other words, if you want a code point to be interpreted as a hexadecimal number, you must prefix it with either 0x or U+, because a string like e.g. 123 will be interpreted as a decimal code point. Also note that Unicode is not limited to 16 bits (the number of Unicode code points is open-ended, in theory unlimited): you may have more than 4 hexdigits.

charinfo()

use Unicode::UCD 'charinfo';

my $charinfo = charinfo(0x41);

This returns information about the input "code point argument" as a reference to a hash of fields as defined by the Unicode standard. If the "code point argument" is not assigned in the standard (i.e., has the general category Cn meaning Unassigned) or is a non-character (meaning it is guaranteed to never be assigned in the standard), undef is returned.

Fields that aren't applicable to the particular code point argument exist in the returned hash, and are empty.

The keys in the hash with the meanings of their values are:

code

the input "code point argument" expressed in hexadecimal, with leading zeros added if necessary to make it contain at least four hexdigits

name

name of code, all IN UPPER CASE. Some control-type code points do not have names. This field will be empty for Surrogate and Private Use code points, and for the others without a name, it will contain a description enclosed in angle brackets, like <control>.

category

The short name of the general category of code. This will match one of the keys in the hash returned by "general_categories()".

combining

the combining class number for code used in the Canonical Ordering Algorithm. For Unicode 5.1, this is described in Section 3.11 Canonical Ordering Behavior available at http://www.unicode.org/versions/Unicode5.1.0/

bidi

bidirectional type of code. This will match one of the keys in the hash returned by "bidi_types()".

decomposition

is empty if code has no decomposition; or is one or more codes (separated by spaces) that taken in order represent a decomposition for code. Each has at least four hexdigits. The codes may be preceded by a word enclosed in angle brackets then a space, like <compat> , giving the type of decomposition

decimal

if code is a decimal digit this is its integer numeric value

digit

if code represents a whole number, this is its integer numeric value

numeric

if code represents a whole or rational number, this is its numeric value. Rational values are expressed as a string like 1/4.

mirrored

Y or N designating if code is mirrored in bidirectional text

unicode10

name of code in the Unicode 1.0 standard if one existed for this code point and is different from the current name

comment

ISO 10646 comment field. It appears in parentheses in the ISO 10646 names list, or contains an asterisk to indicate there is a note for this code point in Annex P of that standard.

upper

is empty if there is no single code point uppercase mapping for code; otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)

lower

is empty if there is no single code point lowercase mapping for code; otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)

title

is empty if there is no single code point titlecase mapping for code; otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)

block

block code belongs to (used in \p{In...}). See "Blocks versus Scripts".

script

script code belongs to. See "Blocks versus Scripts".

Note that you cannot do (de)composition and casing based solely on the decomposition, combining, lower, upper, and title fields; you will need also the "compexcl()", and "casespec()" functions.

charblock()

use Unicode::UCD 'charblock';

my $charblock = charblock(0x41);
my $charblock = charblock(1234);
my $charblock = charblock(0x263a);
my $charblock = charblock("U+263a");

my $range     = charblock('Armenian');

With a "code point argument" charblock() returns the block the code point belongs to, e.g. Basic Latin. If the code point is unassigned, this returns the block it would belong to if it were assigned (which it may in future versions of the Unicode Standard).

See also "Blocks versus Scripts".

If supplied with an argument that can't be a code point, charblock() tries to do the opposite and interpret the argument as a code point block. The return value is a range: an anonymous list of lists that contain start-of-range, end-of-range code point pairs. You can test whether a code point is in a range using the "charinrange()" function. If the argument is not a known code point block, undef is returned.

charscript()

use Unicode::UCD 'charscript';

my $charscript = charscript(0x41);
my $charscript = charscript(1234);
my $charscript = charscript("U+263a");

my $range      = charscript('Thai');

With a "code point argument" charscript() returns the script the code point belongs to, e.g. Latin, Greek, Han. If the code point is unassigned, it returns undef

If supplied with an argument that can't be a code point, charscript() tries to do the opposite and interpret the argument as a code point script. The return value is a range: an anonymous list of lists that contain start-of-range, end-of-range code point pairs. You can test whether a code point is in a range using the "charinrange()" function. If the argument is not a known code point script, undef is returned.

See also "Blocks versus Scripts".

charblocks()

use Unicode::UCD 'charblocks';

my $charblocks = charblocks();

charblocks() returns a reference to a hash with the known block names as the keys, and the code point ranges (see "charblock()") as the values.

See also "Blocks versus Scripts".

charscripts()

use Unicode::UCD 'charscripts';

my $charscripts = charscripts();

charscripts() returns a reference to a hash with the known script names as the keys, and the code point ranges (see "charscript()") as the values.

See also "Blocks versus Scripts".