You are viewing the version of this documentation from Perl 5.16.0. View the latest version

CONTENTS

NAME

Unicode::UCD - Unicode character database

SYNOPSIS

use Unicode::UCD 'charinfo';
my $charinfo   = charinfo($codepoint);

use Unicode::UCD 'casefold';
my $casefold = casefold(0xFB00);

use Unicode::UCD 'casespec';
my $casespec = casespec(0xFB00);

use Unicode::UCD 'charblock';
my $charblock  = charblock($codepoint);

use Unicode::UCD 'charscript';
my $charscript = charscript($codepoint);

use Unicode::UCD 'charblocks';
my $charblocks = charblocks();

use Unicode::UCD 'charscripts';
my $charscripts = charscripts();

use Unicode::UCD qw(charscript charinrange);
my $range = charscript($script);
print "looks like $script\n" if charinrange($range, $codepoint);

use Unicode::UCD qw(general_categories bidi_types);
my $categories = general_categories();
my $types = bidi_types();

use Unicode::UCD 'prop_aliases';
my @space_names = prop_aliases("space");

use Unicode::UCD 'prop_value_aliases';
my @gc_punct_names = prop_value_aliases("Gc", "Punct");

use Unicode::UCD 'prop_invlist';
my @puncts = prop_invlist("gc=punctuation");

use Unicode::UCD 'prop_invmap';
my ($list_ref, $map_ref, $format, $missing)
                                  = prop_invmap("General Category");

use Unicode::UCD 'compexcl';
my $compexcl = compexcl($codepoint);

use Unicode::UCD 'namedseq';
my $namedseq = namedseq($named_sequence_name);

my $unicode_version = Unicode::UCD::UnicodeVersion();

my $convert_to_numeric =
          Unicode::UCD::num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");

DESCRIPTION

The Unicode::UCD module offers a series of functions that provide a simple interface to the Unicode Character Database.

code point argument

Some of the functions are called with a code point argument, which is either a decimal or a hexadecimal scalar designating a Unicode code point, or U+ followed by hexadecimals designating a Unicode code point. In other words, if you want a code point to be interpreted as a hexadecimal number, you must prefix it with either 0x or U+, because a string like e.g. 123 will be interpreted as a decimal code point. Note that the largest code point in Unicode is U+10FFFF.

charinfo()

use Unicode::UCD 'charinfo';

my $charinfo = charinfo(0x41);

This returns information about the input "code point argument" as a reference to a hash of fields as defined by the Unicode standard. If the "code point argument" is not assigned in the standard (i.e., has the general category Cn meaning Unassigned) or is a non-character (meaning it is guaranteed to never be assigned in the standard), undef is returned.

Fields that aren't applicable to the particular code point argument exist in the returned hash, and are empty.

The keys in the hash with the meanings of their values are:

code

the input "code point argument" expressed in hexadecimal, with leading zeros added if necessary to make it contain at least four hexdigits

name

name of code, all IN UPPER CASE. Some control-type code points do not have names. This field will be empty for Surrogate and Private Use code points, and for the others without a name, it will contain a description enclosed in angle brackets, like <control>.

category

The short name of the general category of code. This will match one of the keys in the hash returned by "general_categories()".

The "prop_value_aliases()" function can be used to get all the synonyms of the category name.

combining

the combining class number for code used in the Canonical Ordering Algorithm. For Unicode 5.1, this is described in Section 3.11 Canonical Ordering Behavior available at http://www.unicode.org/versions/Unicode5.1.0/

The "prop_value_aliases()" function can be used to get all the synonyms of the combining class number.

bidi

bidirectional type of code. This will match one of the keys in the hash returned by "bidi_types()".

The "prop_value_aliases()" function can be used to get all the synonyms of the bidi type name.

decomposition

is empty if code has no decomposition; or is one or more codes (separated by spaces) that, taken in order, represent a decomposition for code. Each has at least four hexdigits. The codes may be preceded by a word enclosed in angle brackets then a space, like <compat> , giving the type of decomposition

This decomposition may be an intermediate one whose components are also decomposable. Use Unicode::Normalize to get the final decomposition.

decimal

if code is a decimal digit this is its integer numeric value

digit

if code represents some other digit-like number, this is its integer numeric value

numeric

if code represents a whole or rational number, this is its numeric value. Rational values are expressed as a string like 1/4.

mirrored

Y or N designating if code is mirrored in bidirectional text

unicode10

name of code in the Unicode 1.0 standard if one existed for this code point and is different from the current name

comment

As of Unicode 6.0, this is always empty.

upper

is empty if there is no single code point uppercase mapping for code (its uppercase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)

lower

is empty if there is no single code point lowercase mapping for code (its lowercase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)

title

is empty if there is no single code point titlecase mapping for code (its titlecase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)

block

the block code belongs to (used in \p{Blk=...}). See "Blocks versus Scripts".

script

the script code belongs to. See "Blocks versus Scripts".

Note that you cannot do (de)composition and casing based solely on the decomposition, combining, lower, upper, and title fields; you will need also the "compexcl()", and "casespec()" functions.

charblock()

use Unicode::UCD 'charblock';

my $charblock = charblock(0x41);
my $charblock = charblock(1234);
my $charblock = charblock(0x263a);
my $charblock = charblock("U+263a");

my $range     = charblock('Armenian');

With a "code point argument" charblock() returns the block the code point belongs to, e.g. Basic Latin. The old-style block name is returned (see "Old-style versus new-style block names"). If the code point is unassigned, this returns the block it would belong to if it were assigned.

See also "Blocks versus Scripts".

If supplied with an argument that can't be a code point, charblock() tries to do the opposite and interpret the argument as an old-style block name. The return value is a range set with one range: an anonymous list with a single element that consists of another anonymous list whose first element is the first code point in the block, and whose second (and final) element is the final code point in the block. (The extra list consisting of just one element is so that the same program logic can be used to handle both this return, and the return from "charscript()" which can have multiple ranges.) You can test whether a code point is in a range using the "charinrange()" function. If the argument is not a known block, undef is returned.

charscript()

use Unicode::UCD 'charscript';

my $charscript = charscript(0x41);
my $charscript = charscript(1234);
my $charscript = charscript("U+263a");

my $range      = charscript('Thai');

With a "code point argument" charscript() returns the script the code point belongs to, e.g. Latin, Greek, Han. If the code point is unassigned, it returns "Unknown".

If supplied with an argument that can't be a code point, charscript() tries to do the opposite and interpret the argument as a script name. The return value is a range set: an anonymous list of lists that contain start-of-range, end-of-range code point pairs. You can test whether a code point is in a range set using the "charinrange()" function. If the argument is not a known script, undef is returned.

See also "Blocks versus Scripts".

charblocks()

use Unicode::UCD 'charblocks';

my $charblocks = charblocks();

charblocks() returns a reference to a hash with the known block names as the keys, and the code point ranges (see "charblock()") as the values.

The names are in the old-style (see "Old-style versus new-style block names").

prop_invmap("block") can be used to get this same data in a different type of data structure.

See also "Blocks versus Scripts".

charscripts()

use Unicode::UCD 'charscripts';

my $charscripts = charscripts();

charscripts() returns a reference to a hash with the known script names as the keys, and the code point ranges (see "charscript()") as the values.

prop_invmap("script") can be used to get this same data in a different type of data structure.

See also "Blocks versus Scripts".

charinrange()

In addition to using the \p{Blk=...} and \P{Blk=...} constructs, you can also test whether a code point is in the range as returned by "charblock()" and "charscript()" or as the values of the hash returned by "charblocks()" and "charscripts()" by using charinrange():

use Unicode::UCD qw(charscript charinrange);

$range = charscript('Hiragana');
print "looks like hiragana\n" if charinrange($range, $codepoint);

general_categories()

use Unicode::UCD 'general_categories';

my $categories = general_categories();

This returns a reference to a hash which has short general category names (such as Lu, Nd, Zs, S) as keys and long names (such as UppercaseLetter, DecimalNumber, SpaceSeparator, Symbol) as values. The hash is reversible in case you need to go from the long names to the short names. The general category is the one returned from "charinfo()" under the category key.

The "prop_value_aliases()" function can be used to get all the synonyms of the category name.