Unicode::Normalize - Unicode Normalization Forms
(1) using function names exported by default:
use Unicode::Normalize;
$NFD_string = NFD($string); # Normalization Form D
$NFC_string = NFC($string); # Normalization Form C
$NFKD_string = NFKD($string); # Normalization Form KD
$NFKC_string = NFKC($string); # Normalization Form KC
(2) using function names exported on request:
use Unicode::Normalize 'normalize';
$NFD_string = normalize('D', $string); # Normalization Form D
$NFC_string = normalize('C', $string); # Normalization Form C
$NFKD_string = normalize('KD', $string); # Normalization Form KD
$NFKC_string = normalize('KC', $string); # Normalization Form KC
Parameters:
$string
is used as a string under character semantics (see perlunicode).
$code_point
should be an unsigned integer representing a Unicode code point.
Note: Between XSUB and pure Perl, there is an incompatibility about the interpretation of $code_point
as a decimal number. XSUB converts $code_point
to an unsigned integer, but pure Perl does not. Do not use a floating point nor a negative sign in $code_point
.
$NFD_string = NFD($string)
It returns the Normalization Form D (formed by canonical decomposition).
$NFC_string = NFC($string)
It returns the Normalization Form C (formed by canonical decomposition followed by canonical composition).
$NFKD_string = NFKD($string)
It returns the Normalization Form KD (formed by compatibility decomposition).
$NFKC_string = NFKC($string)
It returns the Normalization Form KC (formed by compatibility decomposition followed by canonical composition).
$FCD_string = FCD($string)
If the given string is in FCD ("Fast C or D" form; cf. UTN #5), it returns the string without modification; otherwise it returns an FCD string.
Note: FCD is not always unique, then plural forms may be equivalent each other. FCD()
will return one of these equivalent forms.
$FCC_string = FCC($string)
It returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
Note: FCC is unique, as well as four normalization forms (NF*).
$normalized_string = normalize($form_name, $string)
It returns the normalization form of $form_name
.
As $form_name
, one of the following names must be given.
'C' or 'NFC' for Normalization Form C (UAX #15)
'D' or 'NFD' for Normalization Form D (UAX #15)
'KC' or 'NFKC' for Normalization Form KC (UAX #15)
'KD' or 'NFKD' for Normalization Form KD (UAX #15)
'FCD' for "Fast C or D" Form (UTN #5)
'FCC' for "Fast C Contiguous" (UTN #5)
$decomposed_string = decompose($string [, $useCompatMapping])
It returns the concatenation of the decomposition of each character in the string.
If the second parameter (a boolean) is omitted or false, the decomposition is canonical decomposition; if the second parameter (a boolean) is true, the decomposition is compatibility decomposition.
The string returned is not always in NFD/NFKD. Reordering may be required.
$NFD_string = reorder(decompose($string)); # eq. to NFD()
$NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
$reordered_string = reorder($string)
It returns the result of reordering the combining characters according to Canonical Ordering Behavior.
For example, when you have a list of NFD/NFKD strings, you can get the concatenated NFD/NFKD string from them, by saying
$concat_NFD = reorder(join '', @NFD_strings);
$concat_NFKD = reorder(join '', @NFKD_strings);
$composed_string = compose($string)
It returns the result of canonical composition without applying any decomposition.
For example, when you have a NFD/NFKD string, you can get its NFC/NFKC string, by saying
$NFC_string = compose($NFD_string);
$NFKC_string = compose($NFKD_string);
($processed, $unprocessed) = splitOnLastStarter($normalized)
It returns two strings: the first one, $processed
, is a part before the last starter, and the second one, $unprocessed
is another part after the first part. A starter is a character having a combining class of zero (see UAX #15).
Note that $processed
may be empty (when $normalized
contains no starter or starts with the last starter), and then $unprocessed
should be equal to the entire $normalized
.
When you have a $normalized
string and an $unnormalized
string following it, a simple concatenation is wrong:
$concat = $normalized . normalize($form, $unnormalized); # wrong!
Instead of it, do like this:
($processed, $unprocessed) = splitOnLastStarter($normalized);
$concat = $processed . normalize($form,$unprocessed.$unnormalized);
splitOnLastStarter()
should be called with a pre-normalized parameter $normalized
, that is in the same form as $form
you want.
If you have an array of @string
that should be concatenated and then normalized, you can do like this:
my $result = "";
my $unproc = "";
foreach my $str (@string) {
$unproc .= $str;
my $n = normalize($form, $unproc);
my($p, $u) = splitOnLastStarter($n);
$result .= $p;
$unproc = $u;
}
$result .= $unproc;
# instead of normalize($form, join('', @string))