Unicode::Normalize - Unicode Normalization Forms
(1) using function names exported by default:
use Unicode::Normalize;
$NFD_string = NFD($string); # Normalization Form D
$NFC_string = NFC($string); # Normalization Form C
$NFKD_string = NFKD($string); # Normalization Form KD
$NFKC_string = NFKC($string); # Normalization Form KC
(2) using function names exported on request:
use Unicode::Normalize 'normalize';
$NFD_string = normalize('D', $string); # Normalization Form D
$NFC_string = normalize('C', $string); # Normalization Form C
$NFKD_string = normalize('KD', $string); # Normalization Form KD
$NFKC_string = normalize('KC', $string); # Normalization Form KC
Parameters:
$string
is used as a string under character semantics (see perlunicode).
$code_point
should be an unsigned integer representing a Unicode code point.
Note: Between XSUB and pure Perl, there is an incompatibility about the interpretation of $code_point
as a decimal number. XSUB converts $code_point
to an unsigned integer, but pure Perl does not. Do not use a floating point nor a negative sign in $code_point
.
$NFD_string = NFD($string)
It returns the Normalization Form D (formed by canonical decomposition).
$NFC_string = NFC($string)
It returns the Normalization Form C (formed by canonical decomposition followed by canonical composition).
$NFKD_string = NFKD($string)
It returns the Normalization Form KD (formed by compatibility decomposition).
$NFKC_string = NFKC($string)
It returns the Normalization Form KC (formed by compatibility decomposition followed by canonical composition).
$FCD_string = FCD($string)
If the given string is in FCD ("Fast C or D" form; cf. UTN #5), it returns the string without modification; otherwise it returns an FCD string.
Note: FCD is not always unique, then plural forms may be equivalent each other. FCD()
will return one of these equivalent forms.
$FCC_string = FCC($string)
It returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
Note: FCC is unique, as well as four normalization forms (NF*).
$normalized_string = normalize($form_name, $string)
It returns the normalization form of $form_name
.
As $form_name
, one of the following names must be given.
'C' or 'NFC' for Normalization Form C (UAX #15)
'D' or 'NFD' for Normalization Form D (UAX #15)
'KC' or 'NFKC' for Normalization Form KC (UAX #15)
'KD' or 'NFKD' for Normalization Form KD (UAX #15)
'FCD' for "Fast C or D" Form (UTN #5)
'FCC' for "Fast C Contiguous" (UTN #5)
$decomposed_string = decompose($string [, $useCompatMapping])
It returns the concatenation of the decomposition of each character in the string.
If the second parameter (a boolean) is omitted or false, the decomposition is canonical decomposition; if the second parameter (a boolean) is true, the decomposition is compatibility decomposition.
The string returned is not always in NFD/NFKD. Reordering may be required.
$NFD_string = reorder(decompose($string)); # eq. to NFD()
$NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
$reordered_string = reorder($string)
It returns the result of reordering the combining characters according to Canonical Ordering Behavior.
For example, when you have a list of NFD/NFKD strings, you can get the concatenated NFD/NFKD string from them, by saying
$concat_NFD = reorder(join '', @NFD_strings);
$concat_NFKD = reorder(join '', @NFKD_strings);