You are viewing the version of this documentation from Perl 5.41.6. This is a development version of Perl.

CONTENTS

NAME

Unicode::Normalize - Unicode Normalization Forms

SYNOPSIS

(1) using function names exported by default:

use Unicode::Normalize;

$NFD_string  = NFD($string);  # Normalization Form D
$NFC_string  = NFC($string);  # Normalization Form C
$NFKD_string = NFKD($string); # Normalization Form KD
$NFKC_string = NFKC($string); # Normalization Form KC

(2) using function names exported on request:

use Unicode::Normalize 'normalize';

$NFD_string  = normalize('D',  $string);  # Normalization Form D
$NFC_string  = normalize('C',  $string);  # Normalization Form C
$NFKD_string = normalize('KD', $string);  # Normalization Form KD
$NFKC_string = normalize('KC', $string);  # Normalization Form KC

DESCRIPTION

Parameters:

$string is used as a string under character semantics (see perlunicode).

$code_point should be an unsigned integer representing a Unicode code point.

Note: Between XSUB and pure Perl, there is an incompatibility about the interpretation of $code_point as a decimal number. XSUB converts $code_point to an unsigned integer, but pure Perl does not. Do not use a floating point nor a negative sign in $code_point.

Normalization Forms

$NFD_string = NFD($string)

It returns the Normalization Form D (formed by canonical decomposition).

$NFC_string = NFC($string)

It returns the Normalization Form C (formed by canonical decomposition followed by canonical composition).

$NFKD_string = NFKD($string)

It returns the Normalization Form KD (formed by compatibility decomposition).

$NFKC_string = NFKC($string)

It returns the Normalization Form KC (formed by compatibility decomposition followed by canonical composition).

$FCD_string = FCD($string)

If the given string is in FCD ("Fast C or D" form; cf. UTN #5), it returns the string without modification; otherwise it returns an FCD string.

Note: FCD is not always unique, then plural forms may be equivalent each other. FCD() will return one of these equivalent forms.

$FCC_string = FCC($string)

It returns the FCC form ("Fast C Contiguous"; cf. UTN #5).

Note: FCC is unique, as well as four normalization forms (NF*).

$normalized_string = normalize($form_name, $string)

It returns the normalization form of $form_name.

As $form_name, one of the following names must be given.

'C'  or 'NFC'  for Normalization Form C  (UAX #15)
'D'  or 'NFD'  for Normalization Form D  (UAX #15)
'KC' or 'NFKC' for Normalization Form KC (UAX #15)
'KD' or 'NFKD' for Normalization Form KD (UAX #15)

'FCD'          for "Fast C or D" Form  (UTN #5)
'FCC'          for "Fast C Contiguous" (UTN #5)

Decomposition and Composition

$decomposed_string = decompose($string [, $useCompatMapping])

It returns the concatenation of the decomposition of each character in the string.

If the second parameter (a boolean) is omitted or false, the decomposition is canonical decomposition; if the second parameter (a boolean) is true, the decomposition is compatibility decomposition.

The string returned is not always in NFD/NFKD. Reordering may be required.

$NFD_string  = reorder(decompose($string));       # eq. to NFD()
$NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
$reordered_string = reorder($string)

It returns the result of reordering the combining characters according to Canonical Ordering Behavior.

For example, when you have a list of NFD/NFKD strings, you can get the concatenated NFD/NFKD string from them, by saying

$concat_NFD  = reorder(join '', @NFD_strings);
$concat_NFKD = reorder(join '', @NFKD_strings);
$composed_string = compose($string)

It returns the result of canonical composition without applying any decomposition.

For example, when you have a NFD/NFKD string, you can get its NFC/NFKC string, by saying

$NFC_string  = compose($NFD_string);
$NFKC_string = compose($NFKD_string);
($processed, $unprocessed) = splitOnLastStarter($normalized)

It returns two strings: the first one, $processed, is a part before the last starter, and the second one, $unprocessed is another part after the first part. A starter is a character having a combining class of zero (see UAX #15).

Note that $processed may be empty (when $normalized contains no starter or starts with the last starter), and then $unprocessed should be equal to the entire $normalized.

When you have a $normalized string and an $unnormalized string following it, a simple concatenation is wrong:

$concat = $normalized . normalize($form, $unnormalized); # wrong!

Instead of it, do like this:

($processed, $unprocessed) = splitOnLastStarter($normalized);
$concat = $processed . normalize($form,$unprocessed.$unnormalized);

splitOnLastStarter() should be called with a pre-normalized parameter $normalized, that is in the same form as $form you want.

If you have an array of @string that should be concatenated and then normalized, you can do like this:

my $result = "";
my $unproc = "";
foreach my $str (@string) {
    $unproc .= $str;
    my $n = normalize($form, $unproc);
    my($p, $u) = splitOnLastStarter($n);
    $result .= $p;
    $unproc  = $u;
}
$result .= $unproc;
# instead of normalize($form, join('', @string))