Lietuviškų kirčiuotų raidžių ir kitų rašto ženklų aibės sudarymas bei kodavimas

Title:Proposal for Unique Sequence Identifiers (USI-s) and repertoire specifications including these USI-s.

This document proposes to add a new identifier called Unique Sequence Identifier, and its use in repertoire specifications, as an enhanced response to the proposal for a composition identifier in document SC2/WG2 N2189, and similar other requirements.

Outside the context of the Unicode (and ISO/IEC 10646) standard, there is a need for expressing collections of entities that are required by a specific application -- for example, to state all the letters (including accented letters), digits, symbols etc. of a given national language such as Lithuanian.

While most of the entities of such a collection may have a single code position allocated to them, others can be represented only as sequences of code positions. Such sequences could be either combining character sequences (for example, accented Latin letters), or sequences of coded characters (such as Philippino NG, or Swiss Ch). At present, such sequences do not have a standardized unique identifier in the same sense as the characters that are encoded in the standard.

This requirement is expressed in the contribution from Finland and Germany (in document L2/00-89). When one examines the reasons behind the Lithuanian proposal, again, one of the driving requirements is the desire to be able to state uniquely the repertoire required by Lithuania. While the elements of such sequences -- such as the combining accent marks that are needed -- do have coded representations (and hence unique identifiers) that could be referenced in a repertoire, the specific sequences cannot be assigned a standardized unique identifier.

One of the principles of encoding fully composed characters in the Unicode Standard (and in ISO/IEC 10646) has been to include it only when it can be shown that a decomposed representation is not acceptable. A set of fully composed characters, that could be decomposed, were included in the first version of the standard for reasons of compatibility with the then-existing international, national or industry standards. There have been some recent proposals for adding fully composed characters, for example from Lithuania (see L2/99-349). These have not been accepted by the UTC or by ISO/IEC JTC1/SC2/WG2, for several reasons -- the primary reason being the implication of Normalization (see UTR #22, and L2/00-078). Clause 6.5 -of ISO/IEC 10646-1: 2000 contains a short identification mechanism to reference characters that are encoded in the standard. It can

Annex A of 10646-1 contains identified collections of graphic characters for subsets of 10646. This identification is done as enumeration of individual or range(s) of code positions (one form of the short identifier) within the standard, or as a union or enumerated individual or range(s) of collections of identified collections, for example:

However, such enumeration are constrained at present only to characters defined in the standard.

Paragraphs along the following lines is proposed to be added to an appropriate clause (such as clause 6.5) or Annex A, or a separate Annex to the standard.

Note: This proposal is worded towards amending 10646-1: 2000. However, equivalent paragraphs should be considered for the Unicode standard also.

An entity that is represented by a sequence of 'n' code positions from the standard, is identified in the following form:

where, UID1, UID2, etc. represent the unique identifiers of the corresponding characters from the standard, in the same sequence as needed to represent the identified entity. The syntax for UID1, UID2, … is specified in clause 6.5. A Comma (optionally followed by a Space character) separates the UIDs, and a pair of Angle Brackets enclose the whole sequence of UIDs.

a composite sequence containing a base character plus one or more combining characters

When there are multiple sequences that may be used to represent the same entity, each such sequence will be considered as a separate USI, and the choice of which one of these needed has to be made, or distinguishing entity names should be assigned to differentiate between these sequences.

Latin Small Letter U With Macron And Tilde(u combining macron combining tilde)<0075, 0304, 0303>

In addition to the unique character identifiers from the standard, a repertoire definition may include entities represented by unique sequence identifiers as defined above -- for example to specify a Lithuanian repertoire. Such a repertoire can be defined in any document, for example in a National Standard, or a standard that defines all the possible sequences to represent all of Devanagari or Thai (including the specific valid conjuncts). When sufficient justification exists, such a repertoire may be proposed to be included in ISO/IEC 10646 as "an identified collection". To be able to accommodate such a request, the definition of "collections" in the standard should be enhanced to specifically recognize the possibility of inclusion of Uniquely Identified Sequences in a collection.

Note: We have to keep in mind that 10646 collections are the only current standardized means of being able to identify repertoires which are subsets of 10646.

The above proposals should meet the stated requirements for repertoire definitions in document L2/00-89, and other such requirements.

Single names (as opposed to a sequence of names) of entities which are represented by sequences will remain outside the scope of Unicode (and ISO/IEC 10646). A sequence of standardized names, corresponding to the elements of Unique Sequence Identifier may be used to reference a single name that may be assigned (by the referencing document) to make the correspondence unique.

Note: Situations may arise when an entity may be represented using more than one sequence -- for example, a multiple-accented character may be expressed as a sequence of an already encoded composed character and another combining accent, or as a completely decomposed sequence. The UIS-s for these sequences will be different. Different entity names will be necessary to be able to reference the correct UIS.

L2/99-349 Proposal to add Lithuanian accented letters to ISO/IEC10646-1, SC2/WG2 N2075R, 1999-09-09

L2/00-089Identification of decomposed characters in ISO/IEC 10646-1, Kolehmainen, Küster, SC2/WG2 N2189, 2000-03-14

L2/00-078Implications of Normalization on Character Encoding (for addition to principles and procedures); Mark Davis, SC2/WG2 N2176, 2000-03-07

2 priedas

L2/01-191R

Dotting the i’s

Kent Karlsson and Vladas Tumasonis

2001-05-05

This is a proposal to update the SpecialCasing.txt data file in the Unicode Character Database. The current handling of dots above for lowercase i’s and j’s in SpecialCasing.txt for case mapping is not sufficient, in particular for Lithuanian where an explicit dot above sometimes needs to be introduced. This proposal also attempts a somewhat more systematic treatment of dots above lowercase i’s and j’s for other languages too.

The dot above lowercase i and lowercase j are 'soft' in the sense that they usually disappear upon uppercasing as well as upon given accents above the i or j. There are, however exceptions to this. For these exceptions, where the dot is not 'soft', a 'hard dot above' (U+0307) is the best way to deal with this matter. For Turkish, the soft dot must be “hardened” for uppercasing (when there are no accents above, otherwise the soft dot is already gone), but for Lithuanian it must be “hardened” before accenting above, but not for uppercasing.

The tables in the exposition are not complete. The formal table in the update to SpecialCasing.txt are, however, intended to be complete.

to upper and to title

Normal

Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot, then uppercase. This removes any spurious dot above, a dot that is not recommended to be there in the first place.

	i+dot (no more accents above)	I
	i-ogonek+dot (no more accents above)	I-ogonek [etc.]
	j+dot (no more accents above)	J

Lithuanian

Any lowercase variant of i or j with an unblocked extra dot above, even if there are more accents above on that base letter: remove the extra dot, then uppercase.

	i+dot	I
	j+dot	J

Turkish

An i with an unblocked extra dot above, if there are no more accents above on that base letter: keep the extra dot, but don’t add another one (for the cases below), then uppercase. This, again, takes care of the spurious case where

	i (no more accents above)	I-dot
	i+dot (no more accents above)	I-dot

to lower

Normal

Any lowercase or uppercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.

i+dot (no more accents above)		i
i-ogonek+dot (no more accents above)		i-ogonek
...		...
j+dot (no more accents above)		j
	I-dot (if more accents above)	i -dot
	I-dot (if no more accents above)	i (already in UniData.txt)
	I -dot (if more accents above)	i -dot (for NFD—NFC consistency; already in UniData)
	I -dot (if no more accents above)	i (for NFD—NFC consistency)
	J -dot (if no more accents above)	j (some degree of systematic...)

Lithuanian

Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot. Uppercase I’s and J’s that have extra accents above must get an extra dot above inserted.

	I (if more accents above)	i -dot
	J (if more accents above)	j -dot
	I-ogonek (if more accents above)	i-ogonek -dot
	I-grave	i -dot -grave
	I-acute	i -dot -acute
	I-tilde	i -dot -tilde

For NFD—NFC consistency a number of “I-letters” that are not used in Lithuanian must be handled too.

Turkish

Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot. Turkish and Azeri (at least) use a dotless i as the lowercase of I. It should not be used if there are more accents above (then use an ordinary i which then looses the dot...).

I (no more accents above)

i-dotless

Suggested changes to SpecialCasing.txt regarding dotting i’s and j’s

The exposition tables above were not intended to be complete. The formal tables below are intended to be complete enough to cover the orthographic requirements and also be such that NFD and NFC are handled consistently. Cases like barred i or j-crosstail are not covered. Review and comments are welcome. The intent is for these modifications to be included in Unicode 3.2, or if possible, in an update to Unicode 3.1.

Old lines (to remove)

1st-------------------
# characters where they are 1-1, and does not have locale-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where present, it consists of one or more locales or contexts,
# separated by spaces.
3rd-------------------
# A locale is defined as:
# <locale> := <ISO_639_code> ( "_" <ISO_3166_code> ( "_" <variant> )? )?
# <ISO_3166_code> := 2-letter ISO country code,
# <ISO_639_code> := 2-letter ISO language code
4th-------------------
# A context is one of the following choices:
5th-------------------
# AFTER_i: The last base character was "i" 0069
6th-------------------
7th-------------------
# ================================================================================
# Locale-sensitive mappings
# ================================================================================
# Lithuanian
0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or titlecase
# Turkish, Azeri
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
end-------------------

New lines (to insert, replacing the old ones listed above)

1st-------------------
# characters where they are 1-1, and does not have language-specific mappings.)
#
# Note that when case mapping a string in a normal form,
# the result need not be in any normal form.
#
2nd-------------------
# The <condition_list> is optional. Where present, it consists of one or more
# contexts, one of which may be a language code, separated by spaces.
3rd-------------------
# A _subset_ of RFC 3066 conforming language codes, _sufficient for this file_,
# can be described as:
# <langcode> := two-letter ISO 639-1 language code
4th-------------------
# A context is a <langcode> or one of the following choices (test on original string):
5th-------------------
# AFTER_i: The last preceding base character was "i" (0069), "j" (006A),
# or has a canonical decomposition that begins with an "i" or "j" but has no
# combining characters above (i.e., i-ogonek (012F), i-tilde-below (1E2D),
# or i-dot-below (1ECB)); AND no combining character class 230 (above) has
# intervened. (Neither i-stroke (0268) or j-crosstailed (029D) need be
# specially handled below, while they also have a soft dot above that
# is lost on normal uppercase or accenting above.)
#
# AFTER_CAP_I: The last preceding base character was "I" (0049), "J" (004A),
# or has a canonical decomposition that begins with an "I" or "J" but has no
# combining characters above (i.e., I-ogonek (012E), I-tilde-below (1E2C),
# or I-dot-below (1ECA)); AND no combining character class 230 (above) has
# intervened. (I-stroke (0197) need not be specially handled below, while
# it also has a soft dot above in lowercase form.)
#
# MORE_ACCENTS_ABOVE: The current combining sequence has at least one class 230
# (above) combining character after the currently considered character.
6th-------------------[no old text]
#-----
# Normal dotting/undotting of i's and j's (capital and small):
#-----
# Remove spurious explicit dot above small i or j when case mapping,
# if no more accents above:
0307; ; ; ; AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# Remove explicit dot above capital i or j when lowercasing,
# if no more accents above (mainly for NFC-NFD consistency for i--I-dot):
0307; ; 0307; 0307; AFTER_CAP_I NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# For NFC-NFD consistency for I-dot--i:
0130; 0069 0307; 0130; 0130; MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH DOT
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; [NON_MORE_ACCENTS_ABOVE] # LATIN CAPITAL LETTER I WITH DOT ABOVE
7th-------------------
# ================================================================================
# Language-sensitive mappings
# ================================================================================
#
# Lithuanian:
#
# Remove dot above small i's or j's when uppercasing,
# even if there are more accents above:
0307; 0307; ; ; lt AFTER_i # COMBINING DOT ABOVE
# Introduce an explicit dot above when lowercasing capital I's and J's
# if there are more accents above (grave, acute, tilde above, and ogonek
# occur in Lithuanian; the rest are just for consistency between NFC and NFD):
0049; 0069 0307; 0049; 0049; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
004A; 006A 0307; 004A; 004A; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER J
012E; 012F 0307; 012E; 012E; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH OGONEK
00CC; 0069 0307 0300; 00CC; 00CC; lt # LATIN CAPITAL LETTER I WITH GRAVE
00CD; 0069 0307 0301; 00CD; 00CD; lt # LATIN CAPITAL LETTER I WITH ACUTE
0128; 0069 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
1E2C; 1E2D 0307; 1E2C; 1E2C; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH TILDE BELOW
1ECA; 1ECB 0307; 1ECA; 1ECA; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH DOT BELOW
00CE; 0049 0307 0302; 00CE; 00CE; lt # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
0134; 004A 0307 0302; 0134; 0134; lt # LATIN CAPITAL LETTER J WITH CIRCUMFLEX
0128; 0049 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
012A; 0049 0307 0304; 012A; 012A; lt # LATIN CAPITAL LETTER I WITH MACRON
012C; 0049 0307 0306; 012C; 012C; lt # LATIN CAPITAL LETTER I WITH BREVE
01CF; 0049 0307 030C; 01CF; 01CF; lt # LATIN CAPITAL LETTER I WITH CARON
0208; 0049 0307 030F; 0208; 0208; lt # LATIN CAPITAL LETTER I WITH DOUBLE GRAVE
020A; 0049 0307 0311; 020A; 020A; lt # LATIN CAPITAL LETTER I WITH INVERTED BREVE
1E2E; 0049 0307 0308 0301; 1E2E; 1E2E; lt # LATIN CAPITAL LETTER I WITH DIAERESIS AND ACUTE
1EC8; 0049 0307 0309; 1EC8; 1EC8; lt # LATIN CAPITAL LETTER I WITH HOOK ABOVE
#
# Turkish, Azeri:
#
# Remove spurious dot above small i's when lowercasing, if no more accents above:
0307; ; 0307; 0307; tr AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0307; ; 0307; 0307; az AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# I—i-dotless and I-dot--i-with-soft-dot are case pairs in Turkish and Azeri,
# when there are no more accents above (otherwise use the ordinary casing rules):
0069; 0069; 0130; 0130; tr NON_MORE_ACCENTS_ABOVE # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az NON_MORE_ACCENTS_ABOVE # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; tr NON_MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az NON_MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
end-------------

3 priedas. HTML dokumentas su kirčiuotomis raidėmis

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<BODY>

<H1>Lithuanian USI</H1>

<TBODY>

<TR>

<TH>Graphic symbol</TH>

<TH>Composed glyph</TH>

<TH>Without COMBINING DOT ABOVE (U+0307)</TH></TR>

<TR>

<TD>LATIN CAPITAL LETTER A WITH OGONEK AND ACUTE</TD>

<TR>

<TD>LATIN SMALL LETTER A WITH OGONEK AND ACUTE</TD>

<TR>

<TD>LATIN CAPITAL LETTER A WITH OGONEK AND TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER A WITH OGONEK AND TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER E WITH OGONEK AND ACUTE</TD>

<TR>

<TD>LATIN SMALL LETTER E WITH OGONEK AND ACUTE</TD>

<TR>

<TD>LATIN CAPITAL LETTER E WITH OGONEK AND TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER E WITH OGONEK AND TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER E WITH DOT ABOVE AND ACUTE</TD>

<TR>

<TD>LATIN SMALL LETTER E WITH DOT ABOVE AND ACUTE</TD>

<TR>

<TD>LATIN CAPITAL LETTER E WITH DOT ABOVE AND TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER E WITH DOT ABOVE AND TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER I WITH DOT ABOVE AND GRAVE</TD>

<TR>

<TD>LATIN SMALL LETTER I WITH DOT ABOVE AND ACUTE</TD>

<TR>

<TD>LATIN SMALL LETTER I WITH DOT ABOVE AND TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER I WITH OGONEK AND ACUTE</TD>

<TR>

<TD>LATIN SMALL LETTER I WITH OGONEK AND DOT ABOVE AND ACUTE</TD>

<TR>

<TD>LATIN CAPITAL LETTER I WITH OGONEK AND TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER I WITH OGONEK AND DOT ABOVE AND TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER J WITH TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER J WITH TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER L WITH TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER L WITH TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER M WITH TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER M WITH TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER R WITH TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER R WITH TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER U WITH OGONEK AND ACUTE</TD>

<TR>

<TD>LATIN SMALL LETTER U WITH OGONEK AND ACUTE</TD>

<TR>

<TD>LATIN CAPITAL LETTER U WITH OGONEK AND TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER U WITH OGONEK AND TILDE</TD>

<TR>

<TD>LATIN CAPITAL LETTER U WITH MACRON AND ACUTE</TD>

<TR>

<TD>LATIN SMALL LETTER U WITH MACRON AND ACUTE</TD>

<TR>

<TD>LATIN CAPITAL LETTER U WITH MACRON AND TILDE</TD>

<TR>

<TD>LATIN SMALL LETTER U WITH MACRON AND TILDE</TD>

<P><FONT size=2><I><BR>Last updated on 2001.12.09 <BR>By Vladas Tumasonis

<BR>Email: </I></FONT><A href="mailto:vladas.tumasonis@maf.vu.lt"><FONT

size=2><I>vladas.tumasonis@maf.vu.lt</I></FONT></A><FONT size=2><I>

</I></FONT></P></BODY></HTML>

[1] Standartas ISO/IEC 10646 yra Unicode viršaibis. Jis apibrėžia ženklų kodavimą 32 bitais (4 baitais). Visi Unicode ženklai yra standarte ISO/IEC 10646, visų jų kodų pirmieji 16 bitų lygūs nuliui, o kitų 16 bitų Unikodo kodai sutampa su ISO/IEC 10646 kodais. Todėl nėra esminių skirtumų tarp šių dviejų kodavimų. Unicode kuria Unicode konsorciumas, kuris nepriklauso Tarptautinei standartų organizacijai. Todėl Unicode nelaikomas tarptautiniu standartu. Tačiau abi organizacijos glaudžiai bendradarbiauja. Dėl to ženklų kodavimas standarte ISO/IEC 10646 ir Unicode yra suderintas