Unicode

1. General Remarks

This discussion of Unicode® is not intended to be general or comprehensive. It is merely a set of notes to myself (these are my Notebooks, after all) of various things of use to me in my study of ancient Greek and the origins of Western writing systems.

Unicode is a standard promulgated by a private consortium for the encoding of all characters in all languages as multi-byte numbers. The nuances and details of this are of course subtle and complex; for further discussion refer to the Unicode Consortium's website. The character mappings of Unicode and the ISO® standard 10646, "Universal Character Set," are now equivalent (the Unicode 4.0 Standard says "Version 4.0 of the Unicode Standard is code-for-code identical to ISO/IEC 10646:2003"). The Unicode standard, however, specifies more than just the character mapping.

Unicode (as of version 4.0) has a "code space" in the range of integers from zero to 0x10FFFF (10,FFFF hexadecimal). It is "convenient" (says the Standard) to think of these as divided into a series of seventeen 65,536 (0xFFFF) code point sequences, or "planes," numbered 0 through 16 (0x10). Plane 0, code points 0 through 0x00FFFF, is the "Basic Multilingual Plane," which contains most of the world's modern alphabetic and syllabic scripts, as well as much else. Plane 1, code points 0x010000 through 0x01FFFF, is the "Supplementary Multilingual Plane." Of the contents of this plane presently of interest to me, Linear B and the Cypriot Syllabary stand out. Plane 2, code points 0x020000 through 0x02FFFF, is the Supplementary Ideographic Plane, which contains many of the ideographic characters used in languages such as Chinese and Japanese. Two special code points are defined in Plane 14 (0x0Ennnn), and Planes 15 (0x0Fnnnn) and 16 (0x10nnnn) are reserved as supplementary private use areas (in addition to a smaller private use area within Plane 0). This leaves 11 other planes entirely reserved for future use.

Unicode may be represented in any number of ways. Perhaps the simplest would be to use four bytes per character. This encoding is called "UTF-32" (Unicode Transformation Format - 32 bit). However, this representation is not space-efficient (Unicode could be represented in three integral bytes per character) and it is not directly compatible with traditional single-byte encoded ASCII. By way of contrast, UTF-8 is a particularly ingenious way of re-encoding the multi-byte characters of Unicode/ISO10646 into a variable number of bytes. It takes advantage of the fact that ASCII is a 7-bit code while (all modern) computers use 8-bit bytes. Basically, it uses this last bit to "chain" on another byte when necessary. Since the first 128 character positions of Unicode are, effectively, ASCII, this means that the UTF-8 encoding of the first 128 codes of Unicode is simply ASCII (with the high-order bit forced to 0). UTF-8 is thus fully backward-compatible with ASCII for these single-byte codes.

For further information, see Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux , as well as the Unicode standard itself.

As the current Linux®-based tools support all of this, I can see no reason not to use the UTF-8 encoding of Unicode/ISO-10646 for all texts for which there exist Unicode characters.

Note: When it is necessary to refer to a Unicode character by its code number (so that, for example, the display program won't try to show it as the glyph of the character itself), the convention is "U+XXXX", where "XXXX" is the character's hexadecimal code (or U+XXXXX or U+XXXXXX, if five or six hex digits are required, of course). Thus, a lowercase Greek alpha is U+03B1. The first character in the Cypriot Syllabary, in the Supplementary Multilingual Plane, is U+10800. Sometimes I'll drop leading 0s, sometimes not.

The Unicode Consortium's website, http://www.unicode.org/, has the Standard online, together with illustrative charts of the various code point ranges. It all makes for very interesting reading. Really.

2. Combining vs. Precomposed Characters

2.1. The Issue

Since I'm studying Greek, I'll discuss this issue in terms of Greek. I'm sure it comes up in other languages as well.

The major issue in representing ancient (that is, "polytonic" or multi-accented) Greek in Unicode is that Unicode allows the handling of diacritical marks in two ways. For example, a lowercase alpha with an acute accent may be represented as two characters (U+03B1 ("GREEK SMALL LETTER ALPHA") followed by U+0301 ("COMBINING ACUTE ACCENT")), in which case it is up to the displaying program to "combine" or "compose" these two characters into a single visual presentation. Alternatively, it may be represented as a single "precomposed" character (U+1F71 ("GREEK SMALL LETTER ALPHA WITH OXIA")).

The advantage of using precomposed characters is that, given that they display at all, they should display correctly. The following image shows on the left a two-character "combining" representation of alpha with an acute accent, as displayed on the Mozilla® browser, version 1.7.11 under SuSE® Linux 10.0 On the right it shows the single-character "precomposed" version of the same.

Combining vs Precomposed alpha with acute, Mozilla 1.7.11, SuSE 10.0

Combining vs Precomposed alpha with acute, Mozilla 1.7.11, SuSE 10.0

Mozilla simply overprints the accent on top of the character. As the Unicode combining diacritical marks are not specific to Greek but may be used with many other characters, and as Unicode doesn't encode visual forms (glyphs) in any case, it is not surprising that the result is disappointing.

It also happens that one of the tools I use, the "vim" version of the vi text editor, only supports two combining diacritical marks. At times in Greek it is necessary to have three (e.g., rough breathing, circumflex, and iota subscript).

Further, vim as installed supports a keyboard mapping for UTF-8 Greek which makes the entry of precomposed characters easy, but which does not address entering combining characters.

From the presentation and data entry points of view, in the absence of more sophisticated typographic software (a nontrivial issue) or other vim keyboard mappings (perhaps easier than I think), precomposed characters seem to have the advantage.

The disadvantage of using precomposed characters is that searches (with relatively simple software, at least) for the underlying character (e.g., alpha, U+03B1) won't find characters precomposed with accents (e.g., alpha acutely accented, precomposed as U+1F71). Representations using combining characters show both of these literally (U+03B1 U+0301), and so searches should work more easily.

2.2. Programming Solutions

Fortunately, the conversion between combining and precomposed characters is mechanical and lossless; it can therefore be reduced to a computer program (translating in either direction).

The following program takes a UTF-8 encoded file on the standard input, detects all Unicode Greek combining character sequences, and writes the file to the standard output with these transformed into precomposed characters.

[WRITE PROGRAM]

The following program does the opposite.

[WRITE PROGRAM]

3. The Issue of Unification

Unicode employs a principle of "unification" whereby characters from different domains which are "equivalent" (in some semantic sense, not necessarily visually) are given the same number ("code point"). This can become a political issue when people of one group find that they have to share characters with people of another group. I certainly don't want to get into that. However, the principle of unification does mean that finding the Unicode characters to represent a particular domain can be more complex than one might at first think. I'll illustrate this here with what is, I hope, the relatively neutral domain of the International Phonetic Alphabet (IPA).

The IPA quite deliberately uses symbols drawn from other alphabets, such as Latin and Greek. For example, it uses the latin letter "a" and the Greek letter beta. It also invents symbols of its own (or adapts them so completely that they become purely IPA symbols). For example, it distinguishes regular and "script" versions of "a", it rotates the latin letter "m", and it contains a symbol for "ram's horns" (from astrological notations?). It also uses some symbols which are very much like "regular" ones but which are in some way distinguishable when used as IPA symbols. For example, an IPA "colon" (its "long mark") might be typed on an ordinary typewriter as a colon (how many people nowdays have actually seen an "ordinary" typewriter? a manual ordinary typewriter?), but is generally typeset with more triangular dots and isn't quite the same thing as a "real" colon.

There are two obvious solutions here.

One solution is to simply lay out a separate range which has all of the IPA symbols. This would be straightforward, but would involve duplicating symbols which "really" are also in other domains (the lower case "a" in IPA is, really, just a lower-case letter "a"). Unicode does not do this.

The other solution, suggested by the principle of "unification" in Unicode, is to use symbols from other (preexisting, presumably) domains when possible and to add special symbols only when necessary. This makes perfect sense, but it can become complex in a situation such as that of the IPA, which draws its symbols from "basic" Latin letters (the ASCII ones), the many accented languages of Europe, the diacritical marks of many languages (considered as separate characters), and specially positioned characters (e.g., superscripts) which appear in other domains as well. The end result of this is that it takes, so far as I can identify them, characters from eleven (yes, 11) Unicode ranges to represent the IPA.

Whew.

Oh, and there's also a Uralic Phonetic Alphabet, range U+1D00, but that's not IPA.

This is no problem from a computer's point of view (characters are just numbers, after all). It can, however, make things difficult for a writer trying to locate a character. (It can also make things difficult when fonts which represent the less commonly used ranges of characters are not installed. I very much hope that this is a transitional issue which will soon go away.)

In the next section, therefore, I'll identify those groups of Unicode ranges which are relevant to my own language studies. General scholarly writing in the Western tradition, for example, takes at least five to seven ranges. Typing ordinary ancient Greek take three ranges, and there are five more ranges relevant to ancient Greek scholarship (without even getting into Linear B and the Cypriot Syllabary).

4. My Unicode Use by Ranges

4.1. About the Ranges

It's surprising how various ranges of characters necessary in a single field are scattered throughout the standard. Discussing the scholarship on the phonetics of Greek, for example, might involve characters from well over a dozen Unicode ranges. Here, I'll organize some (not all!) of the ranges by the topics in which I use them.

For Basic Western Scholarship:

Phonetics

and, separately

Greek

Other Ancient Languages

Other Modern Languages

Other Linguistics

Not in Unicode 4.1

Not Likely to Be In Unicode

4.2. For Basic Western Scholarship

4.2.1. Basic: General Remarks

I can't imagine getting along with these ranges when doing almost any Western scholarship, even though I work almost entirely in US English.

4.2.2. Basic: Range U+0000 to U+007F "C0 Control and Basic Latin"

This is basically 7-bit US ASCII.

4.2.3. Basic: Range U+0080 to U+00FF "C1 Controls and Latin-1 Supplement"

These are additional Western European characters with precomposed diacritical marks and punctuation. The Copyright and Registered Trademark symbols are here. These are all "spacing" (vs. "combining") characters, so even when they're logically and visually superscripts, they occupy their own space (semantically, at least; the visual always depends on the display system).

These get used all the time, so here they are (with my identifications, not their official names).

As noted above, these are spacing characters. Thus, that the accents as they appear in this range differ from those which appear in the U+0300 "Combining Diacritical Marks" range. Here they are independent characters. There they are intended to combine with the characters around them. So for example if I type an a and then a U+00B4, my editor (vi) shows two characters, an a and then an acute accent: a´ . If instead I type an a and then a U+0301, vi moves the acute accent leftwards so that it sits atop the a: á . (Your browser may or may not show this behavior here.)

4.2.4. Basic: Other Western Letters

My own work has not yet called for these, but I look forward to the day when it does - they're fascinating.

4.2.5. Basic: Range U+2000 to U+206F "General Punctuation"

This includes much fun stuff.

The dagger (U+2020, †) and double-dagger (U+2021, ‡) are here, as are single guillamets, double-bangs, and other editorial wonderments.

I'm not sure of the utility of a separate ellipsis symbol (U+2026 …).

This range also contains less common (obviously) and archaic punctuation, and even some IPA stuff.

There's also an invisible sign which indicates multiplication where you don't generally (as a mathematician) write a multiplication sign but might otherwise (in the text) wish to indicate it explicitly. This is thoughtful.

4.2.6. Basic: Range U+02B0 to U+02FF "Spacing Modifier Letters"

Note that some of these (e.g., the macron, which is both U+02C9 here and U+00FA in the Latin-1 Supplement) duplicate sybols elsewhere.

4.2.7. Basic: Range U+0300 to U+036F "Combining Diacritical Marks"

This range includes the combining versions of the basic diacritical marks. It also includes combining versions of diacritical marks used more specifically (e.g., for me, those used in ancient Greek).

4.2.8. Basic: Range U+2070 to U+209F "Superscripts and Subscripts"

Numbers, parentheses, and a few symbols as superscripts and subscripts.

4.2.9. Basic: Range U+2100 to U+214F "Letterlike Symbols"

This is where the non-registered TM symbol (™, U+2122) hides. (The Registered Trademark symbol (®) is U+00AE in the "C1 Controls and Latin-1 Supplement" range.) The Service Mark symbol (℠) U+2120 is also here.

And lurking at U+2117 is the Sound Recording / Phonorecord copyright symbol (℗), far from the regular copyright symbol (©) at U+00A9 in the "C1 Controls and Latin-1 Supplement" range.)

There are other fun things here, to, including the drafting centerline symbol (U+2104, ℄), degrees Celsius (U+2103, ℃), degrees Fahrenheit (U+2109, ℉), and the (not degrees) Kelvin sign (U+212A, K), for when you wish to distinguish units of measure from Kafka's protagonists, but no degrees Reaumer sign, alas. It also has "Care Of" as a symbol (U+2105, ℅), the prescription symbol (U+211E, ℞), Planck's constant (U+210F, ℏ), the Angstrom sign (U+212B, Å), the i for information sign (U+2139, ℹ), "No." as a sign ("numero sign," U+2116, №), and the ounce sign (U+2125, ℥). I never knew that there was an ounce sign.

4.2.10. Basic: Prosody

The markup of prosodic features (accents, length markings, and related diacritical marks) requires characters from several ranges. In addition to the "ordinary" situations, I also have need of the diacritical marks used by W. Sydney Allen in his Accent and Rhythm, and at times use completely ad hoc conventions of my own.

For conventional scansion:

For the scansion of text in Latinate characters, precombined versions of the acute and grave accents over the vowels are present in the U+0080 to U+00FF "C1 Controls and Latin-1 Supplement" range, and versions of the breve accent precombined over the vowels are present in the U+0100 to U+017F "Latin Extended-A" Range.

Though it isn't a part of conventional scansion, I find the musical "fermata" symbol to be of use - I use it to indicate a syllable held indefinitely which thus stands apart from the regular scansion. The combining fermata is in the relatively ordinary range U+0300 to U+036F "Combining Diacritical Marks," but the spacing version is up in the Supplementary Multilingual Plane in Range U+1D100 to U+1D1FF "Musical Symbols." The chances that either of these display on most computers at the present time (2006) is, alas, slight. (No, they don't display on my system at present; I use them infrequently, and so don't mind simply reading the numeric code point value displayed instead of a real glyph.)

My own arbitrary convention for scansion when the regular symbols will be used for something else. This is not standard!

For the scansion of Greek meter by "weight" as "light" (inverted breve below) and "heavy" (macron below) syllabes. Allen's notation also uses the (regular, not inverted) breve above and macron above to indicate short and long vowel length or syllable-length-with-short-or-long-vowel. Unicode allows this using combining diacritical marks, but its Greek wasn't really designed to accomodate these as precombined charcters. Some (not all) of the vowels have precombined versions with breve above and macron above.

Allen's notation for prosodic analysis also requires superscript and subscript numbers (0, 1, 2) and the superscript plus sign. The superscripts and subscripts are generally in range U+2070 to U+209F, "Superscripts and Subscripts." However in this range the superscript one would expect for "1" is instead "SUPERSCRIPT LATIN SMALL LETTER I"; no alternative is suggested, but there is a superscript numeral 1 in the Latin 1 Supplement (U+00B9). Also, superscript "2" and "3" are "reserved" and instead code points from the Latin 1 Supplement are suggested.

Note that the superscript "0" U+2070 (⁰) is not the same as the "masculine ordinal indicator" U+00BA (º) (and in some fonts they look quite different; e.g., the ordinal indicator may have both a round part and a line under it).

4.3. Phonetics

4.3.1. Phonetics: Use of Other Ranges

Unicode IPA uses letters from other ranges when possible:

4.3.2. Phonetics: Range U+0250 to U+02AF: IPA Extensions

See Chapter 7 of the Unicode 4.1 standard, section 7.1, "Latin."

4.3.3. Phonetics: Range U+02B0 to U+02FF: (Spacing) Modifier Letters

These are modifier symbols which are like diacritical marks, but are separately "spaced" characters (they don't combine with other characters), and so aren't called "diacritical marks" in Unicode. Mostly they're phonetic modifiers - both IPA and non-IPA.

See Chapter 7 of the Unicode 4.1 standard, section 7.6 "Modifier Letters."

4.3.4. Phonetics: Range U+1D80 to U+1DBF: Phonetic Extensions Supplement

[WHERE IS THIS IN THE STANDARD?]

Letters with paletal and retroflex hooks, and small modifier letters.

4.3.5. Phonetics: Other

Range U+1D00 to U+1D43 is "mostly for the Uralic Phonetic Alphabet (UPA)."

4.4. Greek

4.4.1. Greek: Range U+0370 to U+03FF "Greek and Coptic"

This range works for modern "monotonic" Greek and for those characters in "polytonic" (that is, written with accents) Greek which do not have accents.

4.4.2. Greek: Range U+1F00 to U+1FFF "Greek Extended"

Polytonic Greek as precomposed characters.

4.4.3. Greek: Range U+0300 to U+036F "Combining Diacritical Marks"

This range includes "combining" diacritical marks for many languages, including Greek. I'll note here only the subsets relevant to Greek.

These combining diacritical marks are to be combined with ordinary Greek letters from the range U+0370 to U+03FF, "Greek and Coptic."

(Shown combined with α (U+03B1); they may of course be combined with all appropriate letters.)

The standard prefers U+0342 COMBINING GREEK PERISPOMENI over U+0303 COMBINING TILDE (which I would not use for a Greek circumflex, though many do) and does not mention U+0302 COMBINING CIRCUMFLEX ACCENT (which is the character I would tend to use for a circumflex). I suppose that this means it is best to use U+0342 for the combining circumflex.

The standard includes U+0343 (ἀ) COMBINING GREEK KORONIS ("comma above") "for compatibility reasons," but prefers U+0313.

The standard discourages the use of U+0344 COMBINING GREEK DIALYTIKA TONOS, which is a diaeresis (¨) with an acute accent (΄) piled on top of it (α̈́), in favor of U+0308 (α̈) plus U+0301 (ά): α̈́ or ά̈ . Note also that this character is duplicated by U+1FEE (΅) in the "Greek Extended" range, where it is paired with a diaeresis-with-grave as well (U+1FED (῭)) as the "regular" range U+0344 is not. The "Greek Extended" range also includes iota and upsilon precomposed with diaeresis and acute/grave.

4.4.4. Range U+1DC0 to U+1DFF "Combining Diacritical Marks Supplement"

This includes U+1DC0 (᷀) "COMBINING DOTTED GRAVE ACCENT and U+1DC1 (᷁) "COMBINING DOTTED ACUTE ACCENT. Both of these are noted as "Used for Ancient Greek."

4.4.5. Greek: Range U+2E00 to U+2E7F "Supplemental Punctuation"

This includes Ancient Greek textual/editorial symbols.

4.4.6. Greek: Range U+10140 to U+1018F "Ancient Greek Numbers"

Ancient Greek acrophonic numerals and papyrological numbers. WHERE ARE THESE IN THE STANDARD?

4.4.7. Greek: Range U+1D000 to U+1D0FF "Byzantine Musical Symbols"

No, I haven't used these, but I suspect they might come in handy when studying Greek prosody.

4.4.8. Greek: Range U+1D200 to U+1D24F "Ancient Greek Musical Notation"

As with the Byzantine Musical Notation, I haven't used these, but I suspect they might come in handy when studying Greek prosody.

4.5. Other Ancient Scripts

4.5.1. OAS: Linear B

Devant les siècles son oeuvre est faite.

4.5.2. OAS: Range U+10100 to U+1013F "Aegean Numbers"

Used with Linear B and the Cypriot Syllabary

4.5.3. OAS: Range U+10380 to U+1039F "Ugaritic"

Ugaritic is an alphabetic script written with cuneiform characters. This is thought to be the only "other" instance of the development of an alphabet. It's a cautionary tale to the value of compatibile technology.

4.5.4. OAS: Range U+10800 to U+1083F "Cypriot Syllabary"

See especially Roger Woodard's Greek Writing from Knossos to Homer, which argues for a special place for the Cypriot Syllabary in the adoption by the Greeks of the Phoenician alphabet.

4.6. Other Modern Scripts

4.6.1. OMS: General Remarks

These are by no means the only other modern scripts in Unicode, of course, or even the only other modern scripts of interest to me. They're simply the other modern scripts that it is likely I might have to write something in while researching the linguistics of ancient Greek. Devanagari is obvious here, as it is the language in which Sanskrit is written (the seminal language in Indo-European linguistics, if no longer the oldest attested IE language). Hebrew and Arabic are obvious as well, as semitic languages with distinctive scripts which occupy important places in the history of writing sytems.

4.6.2. OMS: Arabic

4.6.3. OMS: Range U+0900 to U+097F "Devanagari"

This is such a beautiful script.

4.6.4. OMS: Hebrew

4.7. Other Linguistics

U+2E17 DOUBLE OBLIQUE HYPHEN (in Range U+2E00 to U+2E7F "Supplemental Punctuation") is noted as used in ancient Near-Eastern linguistics.

4.8. Not in Unicode 4.1

See: http://www.unicode.org/roadmaps/smp/ for "roadmap" of the Supplemental Multilingual Plane, with links to proposals.

4.9. Not Likely to Be In Unicode

The ConScript Unicode Consortium http://www.evertype.com/standards/csur/ organizes, unofficially, various scripts which for one reason or another aren't in, or aren't likely to be in Unicode. It employs the Unicode Private Use Areas:

The scripts it organizes include that of the "Phaistos disk" (U+E6D0 - U+E6FF), as well as invented scripts such as Tolkien's Tengwar and Cirth.


Select Resolution: 0 [other resolutions temporarily disabled due to lack of disk space]