Unicode 4.0.1
            
            
              
              Version 4.0.1 has been superseded by the
              latest version
              of the Unicode Standard.
            
            
              
                |  | Version 4.0.1 of the Unicode Standard consists of the core
            specification, The Unicode Standard, 
            Version 4.0, the additional specifications on this page, 
            the delta and archival code charts for this version, the Unicode Standard Annexes,
            and the Unicode Character Database (UCD). The core specification gives the general principles, 
            requirements for conformance, and guidelines for implementers. The 
            code charts show representative glyphs for all the Unicode 
            characters. The Unicode Standard Annexes supply detailed normative 
            information about particular aspects of the standard. The Unicode 
            Character Database supplies normative and informative data for 
            implementers to allow them to implement the Unicode Standard. | 
            
            Version 4.0.1 of the Unicode Standard should be referenced as:
            
              The Unicode Consortium. The Unicode Standard, Version 4.0.1, 
              defined by: The Unicode Standard, Version 4.0 (Reading, MA, 
              Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by 
              Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/).
            
            A complete specification of the contributory files for Unicode 
            4.0.1 is found on the page
            
            Components for Version 4.0.1.
            
            
            
            Online Edition
            The text of The Unicode Standard, Version 4.0, as well as the 
            delta and archival code charts, is available online via the navigation links 
            on this page. These files may not be printed. The 
            Unicode 4.0 
            Web Bookmarks page has links to all sections of the online text.
            
            Overview
            Unicode 4.0.1 is an
            update version 
            of the Unicode Standard. It adds no new characters.
            The main new features in Unicode 4.0.1 are the following:
            
              - The first significant update of the Unihan Database (Unihan.txt) 
              since Unicode 3.2.0, including a large number of fixes and 
              additional data items.
- Significant clarifications in four definitions used in conformance.
- Unicode Character Database:
 - New  character properties: STerm and Variation_Selector
- Updated significantly: Terminal_Punctuation, Math, Script, and Line_Break
- Changed: general category of U+200B ZERO WIDTH SPACE
- Changed: bidi class of some characters including: +, -, / and FRACTION SLASH
- Added: property value aliases
- Revised: formats in some of the data files
 
- Changes in the recommended loose comparison of character name values.
              See Property
              and Property Value Matching
- Clearer definition of the encoding of Bengali Reph and Ya-phalaa
Changes to Definitions D13, D14, and D17
            Unicode 4.0, 
            Chapter 3 
            section 6 [page 70] contains the following definitions:
            
            D13 Base Character:
            A character that does not graphically combine with preceding characters,
and that is neither a control nor a format character.
            
D14 Combining character:
A character that graphically combines with a preceding base character. The combining character is said to apply to that base character.
D17 Combining character sequence:
A character sequence consisting of either a base character followed by a 
sequence of one or more combining characters, or a sequence of one or more 
combining characters.
            
These definitions are modified as follows in Unicode 4.0.1 for greater 
clarity and to allow U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON-JOINER 
to be used in combining character sequences. (Definition D13 has been split into two 
parts, D13a and D13b. The bullet items, not formally 
parts of the definitions, are also modified for clarity. See the above-cited reference for 
details.)
D13a Graphic character: A character with the General Categories of
Letter (L), Combining Mark (M), Number (N), Punctuation (P), Symbol (S), or
Space Separator (Zs).
  - Graphic characters specifically exclude the line and paragraph separators
    (Zl, Zp) and exclude the characters with the General Categories of Other
    (Cn, Cs, Cc, Cf).
- The interpretation of private use characters (Co) as graphic characters or
    not is determined by the implementation.
- For more information, see Chapter 2, especially Section 2.4 Code Points
    and Characters and Table 2-2 Types of Code Points.
D13b Base character: Any graphic character except for those with the
General Category of Combining Mark (M).
  - Most Unicode characters are base characters. In terms of General Category
    values, a base character is any code point that has one of the categories:
    Letter (L), Number (N), Punctuation (P), Symbol (S), or Space Separator
    (Zs).
- Base characters do not include control characters or 
    format controls.
- Base characters are independent graphic characters, but this does not
    preclude the presentation of base characters from adopting different
    contextual forms or participating in ligatures.
- The interpretation of private use characters (Co) as base characters or
    not is determined by the implementation. However, the default interpretation
    of private use characters should be as base characters, in the absence of
    other information.
D14 Combining character: A character with the General Category of Combining Mark (M).
- Combining characters consist of all characters with the General  Category values of Spacing Combining Mark (Mc), Non-Spacing Mark (Mn), and  Enclosing Mark (Me).
- All characters with non-zero canonical combining class  are  combining characters, but the reverse is not the case: there are combining  characters with a zero canonical combining class
- The interpretation of Private Use characters (Co) as combining characters or not is determined by the implementation.
- These characters are not normally used in isolation  unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
- The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor one of ZERO WIDTH JOINER or ZERO WIDTH NON-JOINER. The combining character is said to apply to that base character.
- There may be no such base character, such as when a combining character is at the start of text or follows a control or format character, such as a carriage return, tab, or RIGHT-LEFT MARK. In such cases, the combining characters are called isolated combining characters.
- With isolated combining characters, or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
- The representative images of combining characters are depicted with a dotted circle in the code charts; when presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.
- Combining characters generally take on the properties of their base character, while retaining their combining property.
D17 Combining character sequence: A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER; or a sequence of  one or more characters where each is a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER.
- When identifying a combining character sequence in Unicode text, the definition of the combining character sequence is applied maximally. Thus, for example, in the sequence <c, dot-below, caron, acute, a>, the entire sequence <c, dot-below, caron, acute> is identified as the combining character sequence, rather than the alternative of identifying <c, dot-below> as a combining character sequence followed by a separate (defective) combining character sequence <caron, acute>.
(The changes to D14 and D17 do not  imply that any particular sequence is 
automatically meaningful or interoperable; sequences must still be documented and used in 
conventional ways to convey specific meanings.)
            Change to Definition D9b
          Unicode 4.0.1 explicitly acknowledges that provisional properties 
          are not maintained.  Unicode 4.0 contains this definition in 
          Chapter 3, 
          section 5 [page 67]:
	        
	
          D9b Provisional property: A Unicode character property whose 
          values are unapproved and tentative, and which may be incomplete or 
          otherwise not in a usable state.
            
          This has been modified by addition of a bullet item, as follows:
            
      
          D9b Provisional property: A Unicode character property whose 
          values are unapproved and tentative, and which may be incomplete or 
          otherwise not in a usable state.
- 
Provisional properties may be removed from future versions of the 
          standard, without prior notice.
      Clarification of Bengali Reph and Ya-phalaa
    The formation of the Reph form is defined in 
      the Unicode 
    4.0 Book, Section 9.1, Rules for Rendering, R2. Basically, the Reph is formed when a Ra which has the inherent vowel killed by the virama/halant 
      begins a syllable. This is shown in the following example.
      
      
      The Ya-phalaa is a post-base form of Ya and I formed when the Ya is the 
      final consonant of a syllable cluster. In this case, the previous 
      consonant retains is base shape and the virama/halant is combined with the 
      following Ya. This is shown in the following example.
      
      
      An ambiguous situation 
      is encountered when the combination of Ra + virama/halant + Ya is 
      encountered.
      
      
      To resolve the 
      ambiguity with this combination and to have consistent behavior,  the processing order of the Bengali script 
      is taken into account. When parsing the 
      text, the ability to form the Reph is identified first and therefore the 
      Reph form should have priority in processing. Thus, it is necessary to 
      insert a U+200C ZERO WIDTH NON-JOINER character into the stream between the Ra and virama/halant 
      to allow the virama/halant and Ya to be grouped together during 
      processing.
      
      
      In the example above, the ZWNJ is used because  two characters that would join by default 
      are intended to remain as 
      separate entities. In cases other than where the RA is the first character 
      in the cluster, the ZWNJ is not required for the formation of the Ya-phalaa. 
      However, for ease of placing the Ya-phalaa input as a single key input, it 
      should be permissible for the Ya-phalaa to be consistently formed by “ZWNJ + 
      VIRAMA + YA” (U+200C + U+09CD + U+09AF).
      
      Unicode Character Database
             The updated 
            Unicode Character Database files for this version are available in 
            the 4.0.1 Update 
            directory. For the unchanged files, see the
            Components for
            Version 4.0.1. For more detailed information about the changes in the Unicode 
            Character Database, see the file 
            UCD.html in the Unicode Character 
            Database.