| # UTS #39 |
| |
| ## Modifying |
| |
| To add or fix xidmodifications, look at source/removals.txt. |
| |
| To add or fix confusables, there are multiple source files. Many were |
| machine-generated, then tweaked. They have names like |
| source/confusables-winFonts.txt. The main file is confusables-source.txt. |
| |
| ***There is fairly complex processing for the confusables, so carefully diff the |
| results. Sometimes you may get an unexpected union of two equivalence sets. |
| Look at Testing below for help.*** |
| |
| Look at the following spreadsheets / bugs to see if there are any additional |
| suggestions. |
| |
| * **[Confusable |
| Suggestions](https://docs.google.com/spreadsheet/ccc?key=0ArRWBHdd5mx-dHRXelRVbXRYSVp2QTNDdTBlV1I5X1E&usp=drive_web#gid=0)** |
| * **[Identifier Restriction |
| Suggestions](https://docs.google.com/spreadsheet/ccc?key=0ArRWBHdd5mx-dEJJWkdzZzk4cDRYbEVLTmhraGN0Q3c&usp=drive_web#gid=0)** |
| * *[Sample PRs](https://github.com/unicode-org/unicodetools/pull/841) |
| |
| If so, assess and add to unicodetools/data/security/{version}/data/source/confusables-source.txt — *if needed.* |
| Then in the spreadsheets, move the "new stuff" line to the end. |
| |
| ### File Format |
| There is a brief description of the file format at the top. |
| Each line represents a mapping from a code point or set of code points to a sequence of one or more code points. |
| |
| For example: |
| ``` |
| 0021 ; 01C3 # ( ! → ǃ) EXCLAMATION MARK → LATIN LETTER RETROFLEX CLICK |
| ``` |
| |
| The ordering of characters doesn't matter. |
| So it doesn't matter whether you have the above line, or |
| ``` |
| 01C3 ; 0021 # ( ǃ → !) LATIN LETTER RETROFLEX CLICK → EXCLAMATION MARK |
| ``` |
| It also doesn't matter if you have identical lines; the second one will be a NOOP. |
| |
| The mappings are used to generate equivalence classes. |
| From each equivalence class, one representative member will be chosen, |
| and in the resulting data file, all the other characters will map to that representative. |
| Because of transitivity, the equivalence class will tend to be somewhat looser than expected. |
| |
| We've discussed possible future enhancements: |
| - Have a second, narrower mapping that is more exact. |
| - Allow for mappings from sequences to sequences (instead of just code points to sequences). |
| - Provide for context, perhaps like the Transform rules. |
| Eg [x { a } y → A](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Aarabic_type%3A%5D&g=&i=) |
| |
| ## Before generating |
| |
| First, in CLDR, update the script metadata: |
| http://cldr.unicode.org/development/updating-codes/updating-script-metadata |
| |
| The identifier type & status take this data into account. |
| |
| ## Generating |
| |
| Fix the version string (which will appear inside GenerateConfusables.java) and |
| the REVISION (which will match the new directory). |
| |
| The version/revision strings are shared with other tools; no need to set them separately. |
| |
| Run GenerateConfusables -c -b to generate the files. They will appear in two places. |
| |
| * *for posting, after review*: |
| * {Generated}/security/11.0.0/* |
| * reformatted source, log |
| * $UNICODETOOLS_DIR/data/security/11.0.0/* *including log.txt* |
| |
| The TestSecurity.java test is part of the unit test suite, run by a github CI. |
| It verifies that the confusable mappings are idempotent. |
| |
| Copy the following from the output directory to the top level of the revision directory, and check in. |
| |
| * confusables.txt |
| * confusablesSummary.txt |
| * confusablesWholeScript.txt |
| * intentional.txt |
| * ReadMe.txt |
| * xidmodifications.txt |
| |
| ### Review |
| |
| Review the mappings to make sure that there are no surprises. |
| The biggest issue is if two equivalence classes are mistakenly joined. |
| For example, if you map b to d, then that will join the equivalence class for b with that of d. |
| |
| ### IdentifierStatus.txt & IdentifierType.txt |
| |
| Markus 2020-feb-07 for Unicode 13.0: |
| |
| * Mostly same as above for GenerateConfusables but for these files I ran |
| IdentifierInfo.java. |
| * org.unicode.text.UCD.IdentifierInfo |
| * :point_right: **Note**: When you run GenerateConfusables it also invokes |
| IdentifierInfo. |
| * The version/revision strings are shared with other tools; no need to set |
| them separately. |
| |
| ### Common problems |
| |
| You may see Identifier_Type=Recommended for characters/scripts/blocks that should not be recommended. |
| For example, the initial generation for Unicode 14 "recommended" Znamenny combining marks. |
| Add these to unicodetools/data/security/{version}/data/source/removals.txt. |
| You can use block properties like |
| ``` |
| \p{block=Znamenny_Musical_Notation} ; technical |
| ``` |
| |
| ## Stability |
| |
| We should preserve the target from old versions wherever possible. For example, |
| when the 6.3.0 files were first done, the following reversed order: |
| ``` |
| 0259 ; 01DD ; MA # ( ə → ǝ ) LATIN SMALL LETTER SCHWA → LATIN SMALL LETTER TURNED E # |
| ``` |
| |
| That was because the LATIN SMALL LETTER TURNED E changed identifier status (to |
| become better). Since stability of the ordering is important, that was fixed |
| with the following change. |
| ``` |
| // EXCEPTIONAL CASES |
| // added to preserve source-target ordering in output. |
| |
| lowerIsBetter.put('\u0259', MARK_NFC); |
| ``` |
| |
| Where `Mark_NFC` was the former status. At some point, the code should be modified |
| to read the older version of the file, and favor characters that were there as |
| targets, but for now there are few enough of these that it is simple enough to |
| just add them to this list. |
| |
| ## Testing |
| |
| After making any changes: |
| |
| 1. Look at the log.txt file to see if there are any problems recorded there. |
| 2. Examine the formatted-xxx.txt for the confusables-xxx.txt that you modified. |
| 3. Review the differences between the generated files and the old versions. |
| 4. The summary file is often the most useful. |
| |
| Because of transitive closure, it is sometimes tricky to track down why two |
| items are marked as confusable. The transitive closure not only does x ~ y, y ~ |
| z, therefore x ~ z, but also handled substrings. So if x ~ y, then ax ~ ay. You |
| can end up with conflicts, like if you have x => "", and someplace else x => y, |
| or if you have x ~ xy (and y !~ ""). |
| |
| In confusables.txt, each line that is the product of transitive closure shows |
| you a path after a second #. |
| ``` |
| 248F ; 0038 005F ; SA #\* ( ⒏ → 8_ ) DIGIT EIGHT FULL STOP → DIGIT EIGHT, LOW LINE # →8.→ |
| ``` |
| |
| Find the link in the chain that shouldn't be there. Sometimes that is because of |
| a substring mapping. In the above case, it is mapping _ to . |
| |
| Then search through the source/ for that character, to see what is happening. |
| Sometimes the formatted-xxx.txt is easier to search, since it has both the hex |
| and the character. |
| |
| Searching for a regex expression that contains both the literal characters and |
| the hex is useful. For example, if you see the line: |
| ``` |
| # ර ? |
| |
| ( ර ) 0DBB SINHALA LETTER RAYANNA |
| |
| ← ( ? ) 0DEE SINHALA LITH DIGIT EIGHT |
| ``` |
| |
| Then do a regex search in /data/source on `[ර?]|ODBB|ODEE` |
| |
| Some problems can arise when the NFKC form is very different, like for: |
| ``` |
| || cp == '﬩' || cp == '︒' |
| ``` |
| |
| In those cases, modify getSkipNFKD. |
| |
| Other problems can arise from: |
| |
| 1. Incorrect syntax, eg `[1234 1235]` instead of `[\u1234 \u1235]` |
| 2. Illegal containment. If you have a~ab, you'd get a circular closure, or if |
| you have x => "", and someplace else x => y. |
| 3. Lowercase or too-short hex codes. 1a is interpreted as `\u0031\u0061`, not |
| as `\u001A`. |
| |
| Illegal containment: U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH, U+005F |
| LOW LINE overlaps U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH |
| |
| from U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH, U+0640 ARABIC TATWEEL |
| |
| with reason `[[arabic]] plus [[arabic]→𞻰, [arabic]]` |
| |
| ## Posting |
| |
| Once you've resolved all the problems, copy certain generated files to |
| https://www.unicode.org/Public/security/{version}/ |
| |
| * confusables.txt |
| * confusablesSummary.txt |
| * confusablesWholeScript.txt |
| * intentional.txt |
| * ReadMe.txt |
| * xidmodifications.txt |
| |
| Check that the files are copied to https://www.unicode.org/Public/security/{version}/. |