docs/inputdata.md - external/github.com/unicode-org/unicodetools - Git at Google

 # Input data setup

 ## Unicode 15+ workflow

 Starting with Unicode 15, we are developing most of the Unicode data files
 in this Unicode Tools project, and publish them to the Public folder
 only for alpha/beta/final releases.
 That is, we are reversing the flow of files.

 See [data workflow](data-workflow.md). (Based on
 [issue #144](https://github.com/unicode-org/unicodetools/issues/144).)

 We are also no longer generating and posting files with version suffixes.

 Except: Some files, such as Unihan and ucdxml data files, are developed elsewhere,
 and we continue to ingest them as before.

 ## Source Files

 *Starting with Unicode 15.1, the “source of truth” for most data files is in the repo,
 and most of this section is obsolete. See [data workflow](data-workflow.md).
 The biggest exception is Unihan.zip, which we don't track in the repo; see the Unihan section below.
 Also, it's still useful to delete the BIN files/folders after changing data files.*

 ### Unihan

 You may need to manually change the "Unihan-8.0.0d2 Folder" to "Unihan".

 Unzip the Unihan.zip file into a "Unihan" subfolder.

 Starting with Unicode 13, we split the Unihan data into single-property files
 and parse those.

 Run the script that is checked in at
 [py/splitunihan.py](../py/splitunihan.py)
 with one argument, the path to the Unihan folder.

 Ignore or delete the Unihan\*.txt files now. Do not check them into the tools
 any more.

 Check for new and no-longer-present files (Unihan properties).
 `git add` and `git rm` as necessary.

 ### Fetching files from Public

 Only for Unicode 15.0 and earlier:

 The source files that you will need for a release such as 8.0.0 are in:

 *   [ftp://unicode.org/Public/8.0.0/ucd](ftp://unicode.org/Public/8.0.0/ucd)
 *   [ftp://unicode.org/Public/UCA/8.0.0/](ftp://unicode.org/Public/UCA/8.0.0/)
 *   [ftp://unicode.org/Public/idna/8.0.0/](ftp://unicode.org/Public/idna/8.0.0/)
 *   [ftp://unicode.org/Public/security/8.0.0/](ftp://unicode.org/Public/security/8.0.0/)
 *   [ftp://unicode.org/Public/emoji/1.0/](ftp://unicode.org/Public/emoji/1.0/)
     *— note version*
 *   [ftp://unicode.org/Public/8.0.0/ucdxml/](ftp://unicode.org/Public/8.0.0/ucdxml/)
     *— but NOT:*
     *   ucd.\*.flat.zip or
     *   ucd.all.\*.zip

 You will go to each one of these and download the contents into the appropriate
 local git directory, such as:

 *   <https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd>
 *   <https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/uca>
 *   ...

 **Important:** if the files in a directory are not yet released, and may have
 changed, then you'll need to delete these cached files (the exact location may
 depend on your configuration):

 *   {workspace}/Generated/BIN/8.0.0.0/\*
 *   {workspace}/Generated/BIN/UCD_Data8.0.0.bin

 Below are the original instructions for how to create these data files. Note
 that they do not 100% reflect the files published on
 <https://www.unicode.org/Public/>

 Historical note about svn (Subversion):
 In svn, each file has version history, as much as possible. For each Unicode
 version, the entire previous version's file folder was copied via `svn cp`, then
 files removed (`svn rm`) that are no longer part of the release, files renamed
 (`svn mv`) that had their name changed, files moved into other folders
 (`svn mkdir` + `svn mv`) as the organization changed.

 In git, this is optional: git tries to figure out file moves on its own.
 This works well if the file contents has not changed much.
 However, once files are checked into git, it is still convenient to use
 `git add` for new files, `git rm` for ones that are going away, and
 `git mv` for ones that are renamed or move to a different folder.

 In the unicodetools repo, we only use filenames without version suffixes, to support the version
 history without `git mv` renaming for each file each time. Exception: In some
 UCD versions there were both CamelCase `ReadMe.txt` and uppercase `README.TXT` which
 does not work on case-insensitive file systems (Mac & Windows). The uppercase
 files have the version suffix.

 ### Removing Suffixes

 Only for Unicode 14 and earlier:

 For the ucd and uca files, you will have to remove the suffixes.

 Tip: On Linux, you can remove version suffixes on the command line like this:
 ```
 data/ucd/6.3.0-Update$ rename "s/-\d(\.\d\.\d)?//" *.txt *.html
 ```

 On other platforms, you can use a script like this Python script:

 `desuffixucd.py`

 ```python
 #!/usr/bin/python3
 #
 # desuffixucd.py
 #
 # 2013-05-02 Carved out of ICU's preparseucd.py.
 # Markus Scherer

 """Takes a tree of files and renames each file if necessary,
 removing Unicode-data-file-style version number suffixes."""

 import os
 import os.path
 import re
 import shutil
 import sys

 # Get the standard basename from a versioned filename.
 # For example, match "UnicodeData-6.1.0d8.txt"
 # so we can turn it into "UnicodeData.txt".
 _file_version_re = re.compile("([a-zA-Z0-9_-]+)" +
                               "-[0-9]+(?:\\.[0-9]+)*(?:d[0-9]+)?" +
                               "(\\.[a-z]+)$")

 def main():
   if len(sys.argv) < 2:
     print("Usage: %s  path/to/UCD/root" % sys.argv[0])
     return
   ucd_root = sys.argv[1]
   source_files = []
   for root, dirs, files in os.walk(ucd_root):
     for file in files:
       source_files.append(os.path.join(root, file))
   for source_file in source_files:
     (folder, basename) = os.path.split(source_file)
     match = _file_version_re.match(basename)
     if match:
       new_basename = match.group(1) + match.group(2)
       if new_basename != basename:
         print("Removing version suffix from " + source_file)
         # ... so that we can easily compare UCD files.
         new_source_file = os.path.join(folder, new_basename)
         shutil.move(source_file, new_source_file)


 if __name__ == "__main__":
   main()
 ```

 This is checked into /data, so you can clean a target directory (such as "staging") with:
 ```
 $ cd {workspace}/unicodetools/data/ucd/staging
 $ ../../desuffixucd.py .
 ```

 ## Original data file setup instructions

 ### 2. Download all of the UnicodeData files for each version into UCD_DIR.

 TODO(Markus): Adjust notes about data files when they are in svn.

 The folder names must be of the form: "3.2.0-Update", so rename the folders on
 the
 Unicode site to this format. If the folder contains /ucd/, then make the
 contents of that directory be the contents of the x.x.x-Update directory. That
 is, each directory will directly contain files like PropList....txt

 #### 2a Ensure Complete Release

 If you are downloading any "incomplete" release (one that does not contain a
 complete set of data files for that release, you need to also download the
 previous complete release). Most of the N.M-Update directories are complete,
 *except*:

 *   4.0-Update, which does not contain a copy of Unihan.txt and some other files
 *   3.1-Update, which does not contain a copy of BidiMirroring.txt

 Also, make the following changes to UnicodeData for 1.1.5:

 **Delete**

 ```
 3400;HANGUL SYLLABLE KIYEOK A;Lo;0;L;1100 1161;;;;N;;;;;
 ...
 4DFF;HANGUL SYLLABLE MIEUM WEO RIEUL-THIEUTH;Lo;0;L;1106 116F 11B4;;;;N;;;;;
 4E00;;Lo;0;L;;;;;N;;;;;
 ```

 **Add:**

 ```
 4E00;;Lo;0;L;;;;;N;;;;;
 9FA5;;Lo;0;L;;;;;N;;;;;
 E000;;Co;0;L;;;;;N;;;;;
 F8FF;;Co;0;L;;;;;N;;;;;
 ```

 **And from a later version of Unicode, add:**

 ```
 F900;CJK COMPATIBILITY IDEOGRAPH-F900;Lo;0;L;8C48;;;;N;;;;;
 ...
 FA2D;CJK COMPATIBILITY IDEOGRAPH-FA2D;Lo;0;L;9DB4;;;;N;;;;;
 ```

 #### 2b. UCA data

 If you are building any of the UCA tools, you need to get a copy of the UCA data file
 from https://www.unicode.org/reports/tr10/#AllKeys. The default location for this is:
 ```
 BASE_DIR + "Collation\allkeys" + VERSION + ".txt".
 ```

 If you have it in a different location, change that value for KEYS in UCA.java, and the value for BASE_DIR

 #### 2c. Here is an example of the default directory structure with files.

 *   workspace/DATA/
     *   uca/
         *   4.0.0/
             *   allkeys-4.0.0.txt
         *   ...
         *   6.0.0
             *   allkeys-6.0.0d1.txt
     *   UCD
         *   1.1.0-Update
             *   UnicodeData-1.1.5.txt
         *   ...
         *   6.0.0-Update/
             *   auxiliary/
             *   extracted/
             *   Unihan/ (see Unihan section above)
             *   ArabicShaping-6.0.0d2.txt
             *   ...
             *   UnicodeData-6.0.0d6.txt
	# Input data setup

	## Unicode 15+ workflow

	Starting with Unicode 15, we are developing most of the Unicode data files
	in this Unicode Tools project, and publish them to the Public folder
	only for alpha/beta/final releases.
	That is, we are reversing the flow of files.

	See [data workflow](data-workflow.md). (Based on
	[issue #144](https://github.com/unicode-org/unicodetools/issues/144).)

	We are also no longer generating and posting files with version suffixes.

	Except: Some files, such as Unihan and ucdxml data files, are developed elsewhere,
	and we continue to ingest them as before.

	## Source Files

	*Starting with Unicode 15.1, the “source of truth” for most data files is in the repo,
	and most of this section is obsolete. See [data workflow](data-workflow.md).
	The biggest exception is Unihan.zip, which we don't track in the repo; see the Unihan section below.
	Also, it's still useful to delete the BIN files/folders after changing data files.*

	### Unihan

	You may need to manually change the "Unihan-8.0.0d2 Folder" to "Unihan".

	Unzip the Unihan.zip file into a "Unihan" subfolder.

	Starting with Unicode 13, we split the Unihan data into single-property files
	and parse those.

	Run the script that is checked in at
	[py/splitunihan.py](../py/splitunihan.py)
	with one argument, the path to the Unihan folder.

	Ignore or delete the Unihan\*.txt files now. Do not check them into the tools
	any more.

	Check for new and no-longer-present files (Unihan properties).
	`git add` and `git rm` as necessary.

	### Fetching files from Public

	Only for Unicode 15.0 and earlier:

	The source files that you will need for a release such as 8.0.0 are in:

	* [ftp://unicode.org/Public/8.0.0/ucd](ftp://unicode.org/Public/8.0.0/ucd)
	* [ftp://unicode.org/Public/UCA/8.0.0/](ftp://unicode.org/Public/UCA/8.0.0/)
	* [ftp://unicode.org/Public/idna/8.0.0/](ftp://unicode.org/Public/idna/8.0.0/)
	* [ftp://unicode.org/Public/security/8.0.0/](ftp://unicode.org/Public/security/8.0.0/)
	* [ftp://unicode.org/Public/emoji/1.0/](ftp://unicode.org/Public/emoji/1.0/)
	— note version
	* [ftp://unicode.org/Public/8.0.0/ucdxml/](ftp://unicode.org/Public/8.0.0/ucdxml/)
	— but NOT:
	* ucd.\*.flat.zip or
	* ucd.all.\*.zip

	You will go to each one of these and download the contents into the appropriate
	local git directory, such as:

	* <https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd>
	* <https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/uca>
	* ...

	Important: if the files in a directory are not yet released, and may have
	changed, then you'll need to delete these cached files (the exact location may
	depend on your configuration):

	* {workspace}/Generated/BIN/8.0.0.0/\*
	* {workspace}/Generated/BIN/UCD_Data8.0.0.bin

	Below are the original instructions for how to create these data files. Note
	that they do not 100% reflect the files published on
	<https://www.unicode.org/Public/>

	Historical note about svn (Subversion):
	In svn, each file has version history, as much as possible. For each Unicode
	version, the entire previous version's file folder was copied via `svn cp`, then
	files removed (`svn rm`) that are no longer part of the release, files renamed
	(`svn mv`) that had their name changed, files moved into other folders
	(`svn mkdir` + `svn mv`) as the organization changed.

	In git, this is optional: git tries to figure out file moves on its own.
	This works well if the file contents has not changed much.
	However, once files are checked into git, it is still convenient to use
	`git add` for new files, `git rm` for ones that are going away, and
	`git mv` for ones that are renamed or move to a different folder.

	In the unicodetools repo, we only use filenames without version suffixes, to support the version
	history without `git mv` renaming for each file each time. Exception: In some
	UCD versions there were both CamelCase `ReadMe.txt` and uppercase `README.TXT` which
	does not work on case-insensitive file systems (Mac & Windows). The uppercase
	files have the version suffix.

	### Removing Suffixes

	Only for Unicode 14 and earlier:

	For the ucd and uca files, you will have to remove the suffixes.

	Tip: On Linux, you can remove version suffixes on the command line like this:
	```
	data/ucd/6.3.0-Update$ rename "s/-\d(\.\d\.\d)?//" .txt .html
	```

	On other platforms, you can use a script like this Python script:

	`desuffixucd.py`

	```python
	#!/usr/bin/python3
	#
	# desuffixucd.py
	#
	# 2013-05-02 Carved out of ICU's preparseucd.py.
	# Markus Scherer

	"""Takes a tree of files and renames each file if necessary,
	removing Unicode-data-file-style version number suffixes."""

	import os
	import os.path
	import re
	import shutil
	import sys

	# Get the standard basename from a versioned filename.
	# For example, match "UnicodeData-6.1.0d8.txt"
	# so we can turn it into "UnicodeData.txt".
	_file_version_re = re.compile("([a-zA-Z0-9_-]+)" +
	"-[0-9]+(?:\\.[0-9]+)*(?:d[0-9]+)?" +
	"(\\.[a-z]+)$")

	def main():
	if len(sys.argv) < 2:
	print("Usage: %s path/to/UCD/root" % sys.argv[0])
	return
	ucd_root = sys.argv[1]
	source_files = []
	for root, dirs, files in os.walk(ucd_root):
	for file in files:
	source_files.append(os.path.join(root, file))
	for source_file in source_files:
	(folder, basename) = os.path.split(source_file)
	match = _file_version_re.match(basename)
	if match:
	new_basename = match.group(1) + match.group(2)
	if new_basename != basename:
	print("Removing version suffix from " + source_file)
	# ... so that we can easily compare UCD files.
	new_source_file = os.path.join(folder, new_basename)
	shutil.move(source_file, new_source_file)


	if __name__ == "__main__":
	main()
	```

	This is checked into /data, so you can clean a target directory (such as "staging") with:
	```
	$ cd {workspace}/unicodetools/data/ucd/staging
	$ ../../desuffixucd.py .
	```

	## Original data file setup instructions

	### 2. Download all of the UnicodeData files for each version into UCD_DIR.

	TODO(Markus): Adjust notes about data files when they are in svn.

	The folder names must be of the form: "3.2.0-Update", so rename the folders on
	the
	Unicode site to this format. If the folder contains /ucd/, then make the
	contents of that directory be the contents of the x.x.x-Update directory. That
	is, each directory will directly contain files like PropList....txt

	#### 2a Ensure Complete Release

	If you are downloading any "incomplete" release (one that does not contain a
	complete set of data files for that release, you need to also download the
	previous complete release). Most of the N.M-Update directories are complete,
	except:

	* 4.0-Update, which does not contain a copy of Unihan.txt and some other files
	* 3.1-Update, which does not contain a copy of BidiMirroring.txt

	Also, make the following changes to UnicodeData for 1.1.5:

	Delete

	```
	3400;HANGUL SYLLABLE KIYEOK A;Lo;0;L;1100 1161;;;;N;;;;;
	...
	4DFF;HANGUL SYLLABLE MIEUM WEO RIEUL-THIEUTH;Lo;0;L;1106 116F 11B4;;;;N;;;;;
	4E00;;Lo;0;L;;;;;N;;;;;
	```

	Add:

	```
	4E00;;Lo;0;L;;;;;N;;;;;
	9FA5;;Lo;0;L;;;;;N;;;;;
	E000;;Co;0;L;;;;;N;;;;;
	F8FF;;Co;0;L;;;;;N;;;;;
	```

	And from a later version of Unicode, add:

	```
	F900;CJK COMPATIBILITY IDEOGRAPH-F900;Lo;0;L;8C48;;;;N;;;;;
	...
	FA2D;CJK COMPATIBILITY IDEOGRAPH-FA2D;Lo;0;L;9DB4;;;;N;;;;;
	```

	#### 2b. UCA data

	If you are building any of the UCA tools, you need to get a copy of the UCA data file
	from https://www.unicode.org/reports/tr10/#AllKeys. The default location for this is:
	```
	BASE_DIR + "Collation\allkeys" + VERSION + ".txt".
	```

	If you have it in a different location, change that value for KEYS in UCA.java, and the value for BASE_DIR

	#### 2c. Here is an example of the default directory structure with files.

	* workspace/DATA/
	* uca/
	* 4.0.0/
	* allkeys-4.0.0.txt
	* ...
	* 6.0.0
	* allkeys-6.0.0d1.txt
	* UCD
	* 1.1.0-Update
	* UnicodeData-1.1.5.txt
	* ...
	* 6.0.0-Update/
	* auxiliary/
	* extracted/
	* Unihan/ (see Unihan section above)
	* ArabicShaping-6.0.0d2.txt
	* ...
	* UnicodeData-6.0.0d6.txt