Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

Kelly Davison
Offline

Posts: 275

3 years 7 months ago #7144 by Kelly Davison

Replied by Kelly Davison on topic Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

Thank you all for contributing to this discussion. We appreciate your collaboration and partnership on this issue.
Summary: An issue was reported by Anibal from QC related to space-encoding in the March 2021 CA Edition release package RF2 files. Upon investigation, it appears to be that extended ASCII encoded non-breaking spaces (NBSP) <0xa0> are present in the files instead of the expected regular space <0x20>. It was found that the NBSPs were introduced into the files via copy / paste functions from HTML based tools. Our technical team found “…32 [instances] in the delta, 75 in the snapshot, and 112 in the full. That means that 32 are new in this release, there are 43 existing ones from previous releases, and 37 that were once published as NBSPs and replaced later.”
This issue will be corrected in releases going forward, and all spaces will be encoded with regular space ASCII <0x20>. Anibal has confirmed that preprocessing has addressed the issue. Given the imminent release of the September 2021 CA Edition, and the fact that re-issuing the March release will create a much larger impact than simply correcting the issue in subsequent releases, new March 2021 CA release files will not be generated. The NBSPs will remain. Implementers are advised to use the September 2021 CA Edition Release.

Please Log in or Create an account to join the conversation.

Peter Humphries
Offline

Posts: 40

3 years 7 months ago - 3 years 7 months ago #7140 by Peter Humphries

Replied by Peter Humphries on topic Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

anibal wrote: Do you have a simple one-liner that works in Mac OS X Big Sur without having to install anything more complex?

My family MacOS expert tells me that you could just use "sed s/<nbsp>/ /" where <nbsp> is the literal, copy-pasted character. Let the tool deal with it; it works with fancy Unicode characters (his example included replacing the "thumbs up" emoji with the "stranded on a deserted island" emoji :lol:

).

It is a bit tricky to illustrate, since we are dealing with white space characters, but you would cut-and-paste the non-breaking space from between the two words into your terminal where "<nbsp>" appears in the command string. Then, save that in your script (perhaps, make a note that there is a hidden NBSP in there, for the next coder who has to support it). This could be the safer way to do the replacement, too, because UTF-8 is a variable-length encoding (the bytes that you are replacing could be part of another, valid and required, character).

But, in general, I agree that the source file should be cleaned up. :whistle:

Last edit: 3 years 7 months ago by Peter Humphries. Reason: One more thought about UTF-8's variable-length encoding and its implications for replacing hexadecimal values.

Please Log in or Create an account to join the conversation.

Anibal Jodorcovsky
Topic Author
Offline

Posts: 50

3 years 7 months ago #7139 by Anibal Jodorcovsky

Replied by Anibal Jodorcovsky on topic Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

I found out the reason, it's UTF-8 encoding:

www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=c2+a0&mode=bytes

So, the actual NBSP (No-break space) in UTF-8 is \xc2\xa0 so that's why my gsed from before fixes the issue.

CHI, please fix this for next release.

Please Log in or Create an account to join the conversation.

Guillermo Reynoso
Offline

Posts: 11

3 years 7 months ago - 3 years 7 months ago #7138 by Guillermo Reynoso

Replied by Guillermo Reynoso on topic Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

>> Look for a0 and you'll see the issue. Look at the two previous bytes, 6d and c2. 6d is the 'm' in Achaetomium, c2 is the Â, which you see once you do the gsed command to get rid of the 0a.

Aníbal, that would be the UTF8 representation of NBSP (encoded as C2A0 in UTF8 ), the character between Achaetomium and species is also a NBSP. The encoding of code 0xA0 in UTF-8 would be two bytes: 11000010 10100000 (0xC2 0xA0)
However, I think your general point is valid, I will run broader testing on other potential non-printable characters and get back to you on Monday.

We don't use sed for this processing, so I am afraid I don't have a one liner, but you have managed to find a workaround.

Have a nice weekend, will update you with my findings early next week.
Guillermo

Last edit: 3 years 7 months ago by Guillermo Reynoso. Reason: Forum software replaced an expression with an emoji, not my original intention

Please Log in or Create an account to join the conversation.

Anibal Jodorcovsky
Topic Author
Offline

Posts: 50

3 years 7 months ago #7137 by Anibal Jodorcovsky

Replied by Anibal Jodorcovsky on topic Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

It seems I need to replace both \x2c and \xa0 with \x20, so I did this:

gsed 's/\xc2\xa0/\x20/g' sct2_Description_Snapshot_CanadianEdition_20210331.txt > sct2_Description_Snapshot_CanadianEdition_20210331.txt-cleanup.txt

and the results seems OK. Not exactly sure why though.

Please Log in or Create an account to join the conversation.

Anibal Jodorcovsky
Topic Author
Offline

Posts: 50

3 years 7 months ago #7136 by Anibal Jodorcovsky

Replied by Anibal Jodorcovsky on topic Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

Well, it turns out that there's more hidden characters in the RF2 Snapshot Description file. Even after I was able to do the cleanup of the 0xa0 I now see a 0xc2, which is an Â.

Look, I took the file:

/Users/anibal/Downloads/SnomedCT_Canadian_EditionRelease_PRODUCTION_20210331T120000Z/Snapshot/Terminology/sct2_Description_Snapshot_CanadianEdition_20210331.txt

I opened that file and extracted only line # 147423 which is this:

148561000087114 20210331 1 20621000087109 28881000087108 en 900000000000013009 Achaetomium species 900000000000017005

Between Achaetomium and species there are two characters there. Look at what hexdump gives for that file:

Anibals-New-MacBook-Air:Terminology anibal$ hexdump -v filetest.txt
0000000 31 34 38 35 36 31 30 30 30 30 38 37 31 31 34 09
0000010 32 30 32 31 30 33 33 31 09 31 09 32 30 36 32 31
0000020 30 30 30 30 38 37 31 30 39 09 32 38 38 38 31 30
0000030 30 30 30 38 37 31 30 38 09 65 6e 09 39 30 30 30
0000040 30 30 30 30 30 30 30 30 30 31 33 30 30 39 09 41
0000050 63 68 61 65 74 6f 6d 69 75 6d c2 a0 73 70 65 63
0000060 69 65 73 09 39 30 30 30 30 30 30 30 30 30 30 30
0000070 30 31 37 30 30 35 0d 0a
0000078

Look for a0 and you'll see the issue. Look at the two previous bytes, 6d and c2. 6d is the 'm' in Achaetomium, c2 is the Â, which you see once you do the gsed command to get rid of the 0a.

So, it seems there's more weirdness in these files. Now, considering that every tool out there requests that we clean up the input files (snap2SNOMED just mentioned that in their latest presentation) and TermManager also is quite non-forgiving for unclean files, I'd have to say that the core source SNOMED files should also be then as clean as possible.

In the meantime, can you give me a hand in understanding what's happening with these files?

Please Log in or Create an account to join the conversation.

Moderators: Linda Monico, Himanshu Khetarpal, Helen Wu