Unicode::UCD - Unicode character database
use Unicode::UCD 'charinfo';
my $charinfo = charinfo($codepoint);
use Unicode::UCD 'casefold';
my $casefold = casefold(0xFB00);
use Unicode::UCD 'casespec';
my $casespec = casespec(0xFB00);
use Unicode::UCD 'charblock';
my $charblock = charblock($codepoint);
use Unicode::UCD 'charscript';
my $charscript = charscript($codepoint);
use Unicode::UCD 'charblocks';
my $charblocks = charblocks();
use Unicode::UCD 'charscripts';
my $charscripts = charscripts();
use Unicode::UCD qw(charscript charinrange);
my $range = charscript($script);
print "looks like $script\n" if charinrange($range, $codepoint);
use Unicode::UCD qw(general_categories bidi_types);
my $categories = general_categories();
my $types = bidi_types();
use Unicode::UCD 'compexcl';
my $compexcl = compexcl($codepoint);
use Unicode::UCD 'namedseq';
my $namedseq = namedseq($named_sequence_name);
my $unicode_version = Unicode::UCD::UnicodeVersion();
The Unicode::UCD module offers a series of functions that provide a simple interface to the Unicode Character Database.
Some of the functions are called with a code point argument, which is either a decimal or a hexadecimal scalar designating a Unicode code point, or U+
followed by hexadecimals designating a Unicode code point. In other words, if you want a code point to be interpreted as a hexadecimal number, you must prefix it with either 0x
or U+
, because a string like e.g. 123
will be interpreted as a decimal code point. Also note that Unicode is not limited to 16 bits (the number of Unicode code points is open-ended, in theory unlimited): you may have more than 4 hexdigits.
use Unicode::UCD 'charinfo';
my $charinfo = charinfo(0x41);
This returns information about the input "code point argument" as a reference to a hash of fields as defined by the Unicode standard. If the "code point argument" is not assigned in the standard (i.e., has the general category Cn
meaning Unassigned
) or is a non-character (meaning it is guaranteed to never be assigned in the standard), undef is returned.
Fields that aren't applicable to the particular code point argument exist in the returned hash, and are empty.
The keys in the hash with the meanings of their values are:
the input "code point argument" expressed in hexadecimal, with leading zeros added if necessary to make it contain at least four hexdigits
name of code, all IN UPPER CASE. Some control-type code points do not have names. This field will be empty for Surrogate
and Private Use
code points, and for the others without a name, it will contain a description enclosed in angle brackets, like <control>
.
The short name of the general category of code. This will match one of the keys in the hash returned by "general_categories()".
the combining class number for code used in the Canonical Ordering Algorithm. For Unicode 5.1, this is described in Section 3.11 Canonical Ordering Behavior
available at http://www.unicode.org/versions/Unicode5.1.0/
bidirectional type of code. This will match one of the keys in the hash returned by "bidi_types()".
is empty if code has no decomposition; or is one or more codes (separated by spaces) that taken in order represent a decomposition for code. Each has at least four hexdigits. The codes may be preceded by a word enclosed in angle brackets then a space, like <compat>
, giving the type of decomposition
if code is a decimal digit this is its integer numeric value
if code represents a whole number, this is its integer numeric value
if code represents a whole or rational number, this is its numeric value. Rational values are expressed as a string like 1/4
.
Y
or N
designating if code is mirrored in bidirectional text
name of code in the Unicode 1.0 standard if one existed for this code point and is different from the current name
ISO 10646 comment field. It appears in parentheses in the ISO 10646 names list, or contains an asterisk to indicate there is a note for this code point in Annex P of that standard.
is empty if there is no single code point uppercase mapping for code; otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)
is empty if there is no single code point lowercase mapping for code; otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)
is empty if there is no single code point titlecase mapping for code; otherwise it is that mapping expressed as at least four hexdigits. ("casespec()" should be used in addition to charinfo() for case mappings when the calling program can cope with multiple code point mappings.)
block code belongs to (used in \p{In...}). See "Blocks versus Scripts".
script code belongs to. See "Blocks versus Scripts".
Note that you cannot do (de)composition and casing based solely on the decomposition, combining, lower, upper, and title fields; you will need also the "compexcl()", and "casespec()" functions.
use Unicode::UCD 'charblock';
my $charblock = charblock(0x41);
my $charblock = charblock(1234);
my $charblock = charblock(0x263a);
my $charblock = charblock("U+263a");
my $range = charblock('Armenian');
With a "code point argument" charblock() returns the block the code point belongs to, e.g. Basic Latin
. If the code point is unassigned, this returns the block it would belong to if it were assigned (which it may in future versions of the Unicode Standard).
See also "Blocks versus Scripts".
If supplied with an argument that can't be a code point, charblock() tries to do the opposite and interpret the argument as a code point block. The return value is a range: an anonymous list of lists that contain start-of-range, end-of-range code point pairs. You can test whether a code point is in a range using the "charinrange()" function. If the argument is not a known code point block, undef is returned.
use Unicode::UCD 'charscript';
my $charscript = charscript(0x41);
my $charscript = charscript(1234);
my $charscript = charscript("U+263a");
my $range = charscript('Thai');
With a "code point argument" charscript() returns the script the code point belongs to, e.g. Latin
, Greek
, Han
. If the code point is unassigned, it returns undef
If supplied with an argument that can't be a code point, charscript() tries to do the opposite and interpret the argument as a code point script. The return value is a range: an anonymous list of lists that contain start-of-range, end-of-range code point pairs. You can test whether a code point is in a range using the "charinrange()" function. If the argument is not a known code point script, undef is returned.
See also "Blocks versus Scripts".
use Unicode::UCD 'charblocks';
my $charblocks = charblocks();
charblocks() returns a reference to a hash with the known block names as the keys, and the code point ranges (see "charblock()") as the values.
See also "Blocks versus Scripts".
use Unicode::UCD 'charscripts';
my $charscripts = charscripts();
charscripts() returns a reference to a hash with the known script names as the keys, and the code point ranges (see "charscript()") as the values.
See also "Blocks versus Scripts".