Unicode::UCD - Unicode character database
use Unicode::UCD 'charinfo';
my $charinfo = charinfo($codepoint);
use Unicode::UCD 'casefold';
my $casefold = casefold(0xFB00);
use Unicode::UCD 'casespec';
my $casespec = casespec(0xFB00);
use Unicode::UCD 'charblock';
my $charblock = charblock($codepoint);
use Unicode::UCD 'charscript';
my $charscript = charscript($codepoint);
use Unicode::UCD 'charblocks';
my $charblocks = charblocks();
use Unicode::UCD 'charscripts';
my $charscripts = charscripts();
use Unicode::UCD qw(charscript charinrange);
my $range = charscript($script);
print "looks like $script\n" if charinrange($range, $codepoint);
use Unicode::UCD qw(general_categories bidi_types);
my $categories = general_categories();
my $types = bidi_types();
use Unicode::UCD 'prop_aliases';
my @space_names = prop_aliases("space");
use Unicode::UCD 'prop_value_aliases';
my @gc_punct_names = prop_value_aliases("Gc", "Punct");
use Unicode::UCD 'prop_invlist';
my @puncts = prop_invlist("gc=punctuation");
use Unicode::UCD 'prop_invmap';
my ($list_ref, $map_ref, $format, $missing)
= prop_invmap("General Category");
use Unicode::UCD 'compexcl';
my $compexcl = compexcl($codepoint);
use Unicode::UCD 'namedseq';
my $namedseq = namedseq($named_sequence_name);
my $unicode_version = Unicode::UCD::UnicodeVersion();
my $convert_to_numeric =
Unicode::UCD::num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
The Unicode::UCD module offers a series of functions that provide a simple interface to the Unicode Character Database.
Some of the functions are called with a code point argument, which is either a decimal or a hexadecimal scalar designating a Unicode code point, or U+
followed by hexadecimals designating a Unicode code point. In other words, if you want a code point to be interpreted as a hexadecimal number, you must prefix it with either 0x
or U+
, because a string like e.g. 123
will be interpreted as a decimal code point. Note that the largest code point in Unicode is U+10FFFF.
use Unicode::UCD 'charinfo';
my $charinfo = charinfo(0x41);
This returns information about the input "code point argument" as a reference to a hash of fields as defined by the Unicode standard. If the "code point argument" is not assigned in the standard (i.e., has the general category Cn
meaning Unassigned
) or is a non-character (meaning it is guaranteed to never be assigned in the standard), undef
is returned.
Fields that aren't applicable to the particular code point argument exist in the returned hash, and are empty.
The keys in the hash with the meanings of their values are:
the input "code point argument" expressed in hexadecimal, with leading zeros added if necessary to make it contain at least four hexdigits
name of code, all IN UPPER CASE. Some control-type code points do not have names. This field will be empty for Surrogate
and Private Use
code points, and for the others without a name, it will contain a description enclosed in angle brackets, like <control>
.
The short name of the general category of code. This will match one of the keys in the hash returned by "general_categories()".
The "prop_value_aliases()" function can be used to get all the synonyms of the category name.
the combining class number for code used in the Canonical Ordering Algorithm. For Unicode 5.1, this is described in Section 3.11 Canonical Ordering Behavior
available at http://www.unicode.org/versions/Unicode5.1.0/
The "prop_value_aliases()" function can be used to get all the synonyms of the combining class number.
bidirectional type of code. This will match one of the keys in the hash returned by "bidi_types()".
The "prop_value_aliases()" function can be used to get all the synonyms of the bidi type name.
is empty if code has no decomposition; or is one or more codes (separated by spaces) that, taken in order, represent a decomposition for code. Each has at least four hexdigits. The codes may be preceded by a word enclosed in angle brackets then a space, like <compat>
, giving the type of decomposition
This decomposition may be an intermediate one whose components are also decomposable. Use Unicode::Normalize to get the final decomposition.
if code is a decimal digit this is its integer numeric value
if code represents some other digit-like number, this is its integer numeric value
if code represents a whole or rational number, this is its numeric value. Rational values are expressed as a string like 1/4
.
Y
or N
designating if code is mirrored in bidirectional text
name of code in the Unicode 1.0 standard if one existed for this code point and is different from the current name
As of Unicode 6.0, this is always empty.
is empty if there is no single code point uppercase mapping for code (its uppercase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)
is empty if there is no single code point lowercase mapping for code (its lowercase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)
is empty if there is no single code point titlecase mapping for code (its titlecase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)
the block code belongs to (used in \p{Blk=...}
). See "Blocks versus Scripts".
the script code belongs to. See "Blocks versus Scripts".
Note that you cannot do (de)composition and casing based solely on the decomposition, combining, lower, upper, and title fields; you will need also the "compexcl()", and "casespec()" functions.
use Unicode::UCD 'charblock';
my $charblock = charblock(0x41);
my $charblock = charblock(1234);
my $charblock = charblock(0x263a);
my $charblock = charblock("U+263a");
my $range = charblock('Armenian');
With a "code point argument" charblock() returns the block the code point belongs to, e.g. Basic Latin
. The old-style block name is returned (see "Old-style versus new-style block names"). If the code point is unassigned, this returns the block it would belong to if it were assigned.
See also "Blocks versus Scripts".
If supplied with an argument that can't be a code point, charblock() tries to do the opposite and interpret the argument as an old-style block name. The return value is a range set with one range: an anonymous list with a single element that consists of another anonymous list whose first element is the first code point in the block, and whose second (and final) element is the final code point in the block. (The extra list consisting of just one element is so that the same program logic can be used to handle both this return, and the return from "charscript()" which can have multiple ranges.) You can test whether a code point is in a range using the "charinrange()" function. If the argument is not a known block, undef
is returned.
use Unicode::UCD 'charscript';
my $charscript = charscript(0x41);
my $charscript = charscript(1234);
my $charscript = charscript("U+263a");
my $range = charscript('Thai');
With a "code point argument" charscript() returns the script the code point belongs to, e.g. Latin
, Greek
, Han
. If the code point is unassigned, it returns "Unknown"
.
If supplied with an argument that can't be a code point, charscript() tries to do the opposite and interpret the argument as a script name. The return value is a range set: an anonymous list of lists that contain start-of-range, end-of-range code point pairs. You can test whether a code point is in a range set using the "charinrange()" function. If the argument is not a known script, undef
is returned.
See also "Blocks versus Scripts".
use Unicode::UCD 'charblocks';
my $charblocks = charblocks();
charblocks() returns a reference to a hash with the known block names as the keys, and the code point ranges (see "charblock()") as the values.
The names are in the old-style (see "Old-style versus new-style block names").
prop_invmap("block") can be used to get this same data in a different type of data structure.
See also "Blocks versus Scripts".
use Unicode::UCD 'charscripts';
my $charscripts = charscripts();
charscripts() returns a reference to a hash with the known script names as the keys, and the code point ranges (see "charscript()") as the values.
prop_invmap("script") can be used to get this same data in a different type of data structure.
See also "Blocks versus Scripts".
In addition to using the \p{Blk=...}
and \P{Blk=...}
constructs, you can also test whether a code point is in the range as returned by "charblock()" and "charscript()" or as the values of the hash returned by "charblocks()" and "charscripts()" by using charinrange():
use Unicode::UCD qw(charscript charinrange);
$range = charscript('Hiragana');
print "looks like hiragana\n" if charinrange($range, $codepoint);
use Unicode::UCD 'general_categories';
my $categories = general_categories();
This returns a reference to a hash which has short general category names (such as Lu
, Nd
, Zs
, S
) as keys and long names (such as UppercaseLetter
, DecimalNumber
, SpaceSeparator
, Symbol
) as values. The hash is reversible in case you need to go from the long names to the short names. The general category is the one returned from "charinfo()" under the category
key.
The "prop_value_aliases()" function can be used to get all the synonyms of the category name.