UTF8: fix a couple of number ctype definitions During testing of new number ctype, I found a typo one of the CJK number definitions and two Roman Numeral characters that were set as numbers but should not be (according to equivalent python check).
UTF8 locales: Fully consider "CIRCLED_" set as alphabet This means defining the "A"-"Z" and "a"-"z" circled versions of the Enclosed Alphanumerics block (0x2460-24FF) as hexidecimal digits and defining the to-upper and to-lower conversions between the upper case and lower case circled alphabets.
UTF-8: Multiple improvements (and detection of possible issue) This commit started out intending to fix "digit" definition on unicode, which it mostly does, but a lot more happened in the end, namely: * digits apparently are not part of CLDR definition. I added a section in the manual portion of UTF-8 source file that defines digit classes for generated sections. * Add numbers classification for entire UTF-8. Currently DragonFly and all BSDs do not support "number" type. However, localedef understands it (its supported on Illumos), but currently the number flag value is zero, so it's a no-op. A short term goal is to have DragonFly be the first BSD with proper number ctype handling. * Redefine "special" ctype once and for all. There is no definitive agreement on what "special" characters are. According to wiki which got it from unicode, it starts with 33 characters (0x20 - 0x2F, 0x3A - 0x40, 0x5B - 0x60, 0x7B - 0x7E). However, localedef objects to <space> because it sets "graph" and "print" flags, and <space> can't be graph. As a result, the <space> is not considered "special" here. Moreover, the punctuation in Latin-1 supplement is "special". The division and multiplication signs are ambiguous, so I set them to special (since plus and minus signs are special). Finally, with the most doubt, the punctuation of "general punctuation" block is also considered special although I couldn't find convincing evidence either way. Given the lack of definition, I don't think "special" classification is really used, especially not in unicode. * Fix NON-BREAK_SPACE classification (set as graph and space on previous commit) * the MICRO character was also warning due to being classified as both lower (in Greek section) and punctuation, so remove the punct. class. * When possible, don't define graph if digit is defined, and similarly with graph and punct. Both digit and punct also set graph flag so having both is redundant. * add several new block definitions: - Syloti Nagri - Common Indic Number Forms - Phags-pa - Saurashra - Kayah Li - Rejang - Javanese - Cham - Tal Viet - Meetei Mayek & extension * Detection of possible bug in localedef The Tai Tham definition are producing the wrong code but there's nothing wrong with the definitions. The 6 unused characters between the two digit definitions should not be graphable, but as soon as one "digit" is defined after the first digit range is defined, all the characters between are marked as graphable and digits. There are similar "fill-ins" but so far only with Thai Tam. It was detected while outputting all "digit" types against a python program that does the same and this error was reveal. It requires further investigation about exactly what is causing it (and thus where the bug is) but right now it's either a bad definition elsewhere that affects Thai Tam or localedef has a bug somewhere (avl lookup?)
UTF8 locales: Refine Latin supplement more The multiplication and division sign were missing, and the control characters were not outlined. Also set superscript 1,2,3 as digits. There are not showing up with iswdigit() function so that requires further investigation (iswdigit does work for '0','1',...'9' however)
UTF8 locales: Complete implemenation of Latin-1 Supplement The Latin-1 Supplement block of UTF-8 (U0080-U00FF) was not fully implemented. Specifically it was missing U00A1 (inverted exclamation) through U00BF (inverted question mark). Some popular characters this affected was cent sign, pound sign, Yen sign, broken bar, copyright symbol and superscripts. On international keyboards, AltGR + number key wouldn't output correctly. This addition to the manual ctype input definitions (and subsequent regenerations) will fix these issues. Reported by: profmakx, ivadasz Diagnostics: YRabbit
Add locale tool to generate "rollup" UTF-8 src file The first version of the "common" UTF-8 file was hand-assembled by myself. This is obviously prone to error and is very hard to maintain (the previous incarnation was never maintained; not once after it was added). To address these issues, create a new tool (using cldr2def as inspiration) to create a composite UTF-8 source files using all available POSIX input from CLDR. What can't be generated still comes from a manual fragment that is added to the common source file at the end. This allows periodic maintenance when CLDR issues new releases. We are converging on using this composite (aka "rollup") file for all UTF-8 locales.
cldr2def: Add 6 Arabic locales: AE EG JO MA QA SA The lack of Arabic support on BSD seems to be a glaring omission, so lets help rectify that by adding since 6 locales to the cldr2def tool: ar_AE: United Arab Emirates ar_EG: Egypt ar_JO: Jordan ar_MA: Morocco ar_QA: Qatar ar_SA: Saudi Arabia There are obviously more (e.g. Iraq, Kuwait, Algeria, Libya, Tunisia, Syria, etc.) but I selected these as being the most likely to be used in my limited opinion. They are UTF-8 locales only.
cldr2def: Modify tool to create a "common" UTF-8 locale With the switch to CLDR definitions, each locale has specific encoding. E.g. en_US, fr_FR has latin characters, el_GR has Greek characters, etc. Seeing a filename written in Chinese characters while in en_US.UTF-8 will result in a bunch of "???" because the characters are illegal for the locale's CTYPE definition. Previously there was one shared CTYPE for all locales. This was handy but incorrect. It was also not updated and became slightly bitrotted. I am going to create a "fake" locale called "common.UTF-8". It is the same as en_US.UTF-8, but with an LC_CTYPE that is practically universal in that all common character sets (and symbols) are defined. To use it, one can do either: 1) Set LANG env. variable to "common.UTF-8" 2) Set LC_CTYPE env. variable to "common.UTF-8" e.g. "env LANG=fr_FR.UTF-8 LC_CTYPE=common.UTF-8 ls" This commit only updates the tool. The locale comes later. In addition to making changes that support the "common" base, the tool was also tweaked to favor en_US.UTF-8 as the "LOCALES+=" value. This should reduce a lot of the renaming churn we've been seeing on new generations.
locales polish: Remove ISO-8859-1 encoding from 27 locales The ISO-8859-1 encoding is considered obsolete compared to ISO-8859-19 (which in itself is obsolete compared to UTF-8). To help avoid some potential mismatches, remove the ISO-8859-1 encoding in the following cases: 1) European localities that already have ISO-8859-15 (25 of 27) * ca_(AD|ES|FR|IT) * da_dk * de_(AT|CH|DE) * en_GB * (es|eu)_ES * fi_FI * fr_(BE|CH|FR) * is_IS * it_(CH|IT) * (nb|nn)_NO * nl_(BE|NL) * pt_PT * sv_(FI|SE) 2) Localities where ISO-8859-1 can't represent local currency symbol * en_PH (PH Peso) * es_CR (Colon)
locale polishing: lt_LT, et_EE, and en_IE changes lt_LT: Remove ISO8859-4. If an ISO encoding is needed, -13 would be used instead. et_EE: Estonia could use ISO8859-4 (not present), the newer ISO8859-13 (not present) or ISO8859-15 (present). To be uniform with other European countries that use -15, change the default from UTF-8 to ISO8859-15. Do not bring in -4 or -13. en_IE: Ireland only had UTF-8. To be uniform with the UK and Western Europe, bring in ISO8859-15 and set that as the alias for the shortname. Do not bring in -1.
locales: Update ctype charmaps with CLDR 27 data
cldr2def: Update Makefile to generate new POSIX source files We've been using CLDR v2.0.1 because it was the last version that provided generated POSIX source files. (The next release, v21, only provided the java tool to generate them). The last release is version 27.0.1. This alternation to the makefile allows the generation of the desired 72 base locales with a single command (after the tool is downloaded and installed in the same CLDR release directory) Note that kk_KZ locale is changing to kk_Cyrl_KZ. While here, make the necessary updates to charmaps.xml to generate to locale source files.
Add 17 new locales and really remove Latin Now that locale defintions are generates, it's easy to add more. I've added the following new locale defintions: * en_HK ISO-8859-1 (Hong Kong/English) * en_HK UTF-8 * en_PH ISO-8859-1 (Phillipines/English) * en_PH UTF-8 * en_SG ISO-8859-1 (Singapore/English) * en_SG UTF-8 * es_AR ISO-8859-1 (Argentina/Spanish) * es_AR UTF-8 * es_CR ISO-8859-1 (Costa Rica/Spanish) * es_CR UTF-8 * es_MX ISO-8859-1 (Mexico/Spanish) * es_MX UTF-8 * se_FI UTF-8 (Finland/Northern Sami) * se_NO UTF-8 (Norway/Northern Sami) * sv_FI ISO-8859-1 (Finland/Swedish) * sv_FI ISO-8859-15 * sv_FI UTF-8 There were a few places la_LN (Latin) was hidden so I've really removed it now.
Activate kk_KZ, lv_LV, and pt_BR locales These three locales already existed, but they weren't being updated. I've adjusted the xml configuration to get them building (minus the kk_KZ.PT154 locale which will be removed). While here, update the tools/tools/locale Makefile to replace all six LC categories in /usr/src/share with the "make install" target, and then install them on a live system with "make post-install" target.
Pregenerate maps for LC_CTYPE generation These are the products of the new convert_map.pl which localedef will use to generate LC_CTYPE. To avoid duplication, these maps will be used where they are (in tools/tools/locale/etc/final-maps) It turns out that the widths.txt file is only used for LC_CTYPE, so it's being moved as well as being removed from share/colldef.