locale gen tools: Set all UTF-8 to same rollup CTYPE The CLDR CTYPE definitions are still used for non-UTF8 encodings, but all UTF-8 locales now share a single "master" CTYPE that knows all reasonable character sets.
Add locale tool to generate "rollup" UTF-8 src file The first version of the "common" UTF-8 file was hand-assembled by myself. This is obviously prone to error and is very hard to maintain (the previous incarnation was never maintained; not once after it was added). To address these issues, create a new tool (using cldr2def as inspiration) to create a composite UTF-8 source files using all available POSIX input from CLDR. What can't be generated still comes from a manual fragment that is added to the common source file at the end. This allows periodic maintenance when CLDR issues new releases. We are converging on using this composite (aka "rollup") file for all UTF-8 locales.
cldr2def: Add 6 Arabic locales: AE EG JO MA QA SA The lack of Arabic support on BSD seems to be a glaring omission, so lets help rectify that by adding since 6 locales to the cldr2def tool: ar_AE: United Arab Emirates ar_EG: Egypt ar_JO: Jordan ar_MA: Morocco ar_QA: Qatar ar_SA: Saudi Arabia There are obviously more (e.g. Iraq, Kuwait, Algeria, Libya, Tunisia, Syria, etc.) but I selected these as being the most likely to be used in my limited opinion. They are UTF-8 locales only.
cldr2def: Slim down ctype src files I originally modified the tool to use "ranges" to define the CTYPE, e.g. "<a>;...;<z>" rather than "<a>;<b>;<c> ... <z>". This worked great on UTF-8 but converting to other encodings is not supported because part of the range may not exist, or the upper boundary may come before the lower boundary in the target encoding. Thus I had to remove that work, but I was able to retain the removal of the now redundant "print" section. I confirmed that the output without "print" section was identical to before, and then I added back "print" section with a single element: NO-BREAK_SPACE. This character is used in quite a few monetary defintions, but it was never mapped to CTYPE, which I believe is a mistake. NO-BREAK_SPACE is also defined as a blank, which is considered a space as well by localedef tool (so there's no need to also define a "space" section"). The net change is multibyte encodings now have non-breaking spaces 1) recognized and 2) defined as printable.
cldr2def: Modify tool to create a "common" UTF-8 locale With the switch to CLDR definitions, each locale has specific encoding. E.g. en_US, fr_FR has latin characters, el_GR has Greek characters, etc. Seeing a filename written in Chinese characters while in en_US.UTF-8 will result in a bunch of "???" because the characters are illegal for the locale's CTYPE definition. Previously there was one shared CTYPE for all locales. This was handy but incorrect. It was also not updated and became slightly bitrotted. I am going to create a "fake" locale called "common.UTF-8". It is the same as en_US.UTF-8, but with an LC_CTYPE that is practically universal in that all common character sets (and symbols) are defined. To use it, one can do either: 1) Set LANG env. variable to "common.UTF-8" 2) Set LC_CTYPE env. variable to "common.UTF-8" e.g. "env LANG=fr_FR.UTF-8 LC_CTYPE=common.UTF-8 ls" This commit only updates the tool. The locale comes later. In addition to making changes that support the "common" base, the tool was also tweaked to favor en_US.UTF-8 as the "LOCALES+=" value. This should reduce a lot of the renaming churn we've been seeing on new generations.
locales polish: Remove ISO-8859-1 encoding from 27 locales The ISO-8859-1 encoding is considered obsolete compared to ISO-8859-19 (which in itself is obsolete compared to UTF-8). To help avoid some potential mismatches, remove the ISO-8859-1 encoding in the following cases: 1) European localities that already have ISO-8859-15 (25 of 27) * ca_(AD|ES|FR|IT) * da_dk * de_(AT|CH|DE) * en_GB * (es|eu)_ES * fi_FI * fr_(BE|CH|FR) * is_IS * it_(CH|IT) * (nb|nn)_NO * nl_(BE|NL) * pt_PT * sv_(FI|SE) 2) Localities where ISO-8859-1 can't represent local currency symbol * en_PH (PH Peso) * es_CR (Colon)
locale polishing: lt_LT, et_EE, and en_IE changes lt_LT: Remove ISO8859-4. If an ISO encoding is needed, -13 would be used instead. et_EE: Estonia could use ISO8859-4 (not present), the newer ISO8859-13 (not present) or ISO8859-15 (present). To be uniform with other European countries that use -15, change the default from UTF-8 to ISO8859-15. Do not bring in -4 or -13. en_IE: Ireland only had UTF-8. To be uniform with the UK and Western Europe, bring in ISO8859-15 and set that as the alias for the shortname. Do not bring in -1.
Add AT&T Research regex(3) regression testsuite Before we replace our ancient regex, we need to baseline it. This is a well-known and maintained regex testsuite from AT&T Research. The following commands from tools/regression/lib/libc-regex can be used: Make full-test-run (this runs all tests consecutively) make test-basic make test-categorize make test-nullsubexpr make test-leftassoc make test-rightassoc make test-forcedassoc make test-repetition To change the locale, set LOCALE (e.g. make LOCAL=en_US.UTF-8 test-basic) These are the baseline results: basic : TEST testregex, 539 tests, 0 errors categorize : TEST testregex, 20 tests, 0 errors nullsubexpr : TEST testregex, 84 tests, 31 errors leftassoc : TEST testregex, 12 tests, 12 errors rightassoc : TEST testregex, 24 tests, 0 errors forcedassoc : TEST testregex, 48 tests, 8 errors repetition : TEST testregex, 129 tests, 37 errors UNSUPPORTED: AUGMENTED,SHELL,CLASS_ESCAPE,COMMENT,DELIMITED,DISCIPLINE,ESCAPE,LEFT, LENIENT,LITERAL,MINIMAL,MULTIPLE,MULTIREF,MUSTDELIM,NULL,RIGHT, SHELL_DOT,SHELL_ESCAPED,SHELL_GROUP,SHELL_PATH,SPAN,regnexec, regsubcomp,redecomp
locales: Update ctype charmaps with CLDR 27 data
cldr2def: Update Makefile to generate new POSIX source files We've been using CLDR v2.0.1 because it was the last version that provided generated POSIX source files. (The next release, v21, only provided the java tool to generate them). The last release is version 27.0.1. This alternation to the makefile allows the generation of the desired 72 base locales with a single command (after the tool is downloaded and installed in the same CLDR release directory) Note that kk_KZ locale is changing to kk_Cyrl_KZ. While here, make the necessary updates to charmaps.xml to generate to locale source files.
Add 17 new locales and really remove Latin Now that locale defintions are generates, it's easy to add more. I've added the following new locale defintions: * en_HK ISO-8859-1 (Hong Kong/English) * en_HK UTF-8 * en_PH ISO-8859-1 (Phillipines/English) * en_PH UTF-8 * en_SG ISO-8859-1 (Singapore/English) * en_SG UTF-8 * es_AR ISO-8859-1 (Argentina/Spanish) * es_AR UTF-8 * es_CR ISO-8859-1 (Costa Rica/Spanish) * es_CR UTF-8 * es_MX ISO-8859-1 (Mexico/Spanish) * es_MX UTF-8 * se_FI UTF-8 (Finland/Northern Sami) * se_NO UTF-8 (Norway/Northern Sami) * sv_FI ISO-8859-1 (Finland/Swedish) * sv_FI ISO-8859-15 * sv_FI UTF-8 There were a few places la_LN (Latin) was hidden so I've really removed it now.
Activate kk_KZ, lv_LV, and pt_BR locales These three locales already existed, but they weren't being updated. I've adjusted the xml configuration to get them building (minus the kk_KZ.PT154 locale which will be removed). While here, update the tools/tools/locale Makefile to replace all six LC categories in /usr/src/share with the "make install" target, and then install them on a live system with "make post-install" target.
Fix cldr2def tool and regenerate 2 makefiles as a result I accidently changed the wrong makefile so both colldef and ctype makefiles wrong wrong. I adjusted the tool and regenerated the makefiles. While here, remove the unused map.UTF-8 that should have been removed earlier but wasn't due to a git misuse.
Pregenerate maps for LC_CTYPE generation These are the products of the new convert_map.pl which localedef will use to generate LC_CTYPE. To avoid duplication, these maps will be used where they are (in tools/tools/locale/etc/final-maps) It turns out that the widths.txt file is only used for LC_CTYPE, so it's being moved as well as being removed from share/colldef.
Fix three clr2def2 character maps The localedef(1) tool does not allow two symbols to be mapped to the same unicode character. I actually don't know if this is really "wrong", but I had to adjust a couple of character sets that violated this rule: ARMSCII-8 and Big5HKSCS. Neither are present on Illumos so that may explain why localedef(1) wasn't prepared to do anything except throw an error. The CP866 charset had a trailing garbage at the end of the file that localedef didn't like, so I removed it.
clr2def: Add LC_CTYPE source file generation support I added the capability to generate LC_CTYPE source files (really this is basically extracting a section from the POSIX files) but there was some logic to figure out how to use the least amount of files because some of them are large. I compromised on a scheme that makes two reductions. The first eliminates true duplicates and uses the SAME+= mechanism to create symlinks. However, this leaves still some duplicates because while the output is distinct, the source files are the same (e.g. en_US.ISO8859* uses the same input file as en_US.UTF-8 locale, but the LC_CTYPE products differ. The script identifies those are replaces them with symlinks. So it looks like a lot of files but really it's only about 12 or so. During the actual LC_CTYPE generating, character maps are needed. I added a Illumos tool to do this, which I had to modify. Unlike Illumos, we will pregenerate the maps that the tool (convert_map.pl) produces. I had to spend hours troubleshooting various "invalid" inputs so this is definitely something that should not be repeated in the build.