Share this page:

question-circle Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

  • Posts: 40
2 years 7 months ago #7128 by Peter Humphries
This Unicode vs Extended Unicode space encoding issue is fairly common when HTML source is cut-and-paste into Microsoft Excel and then the cell text is cut-and-paste from or exported out of MS Excel. (I am guessing that every description in SNOMED CT went through a spreadsheet at one point in its life-cycle :D, and HTML could be from a web page, an email or a word processing document including an on-line word processor.)

If there is no reason to force a non-breaking space in the description of a SNOMED CT term, then it would make sense to screen the input for <0xA0> and replace every instance with <0x20> because Microsoft has known about this problem for many years and made no effort to fix it. Depending upon the user to notice that the text encoding is UTF-16 instead of UTF-8 is not a reliable mitigation, especially if UTF-16 is actually permitted for other extended characters.

As was noted, a regular space (ASCII 32) sorts differently than does a non-breaking space (ASCII 160), in addition to any rules your sorting algorithm might have about word breaks, on different systems and in different applications. So, allowing non-breaking spaces in the descriptions could be an issue all the way down to the end user.

Please Log in or Create an account to join the conversation.

  • Posts: 50
2 years 7 months ago #7127 by Anibal Jodorcovsky
Good point.

I just checked, issue still occurs in Snapshot file:

148561000087114 20210331 1 20621000087109 28881000087108 en 900000000000013009 Achaetomium species 900000000000017005

You don't see it here, of course, but between Achaetomium and species there's an <0xa0> rather than <0x20>

See screenshot here: drive.google.com/file/d/11RauQ0y95aRR5LP5b1LrqJEGjevIpq_n/view?usp=sharing

Please Log in or Create an account to join the conversation.

  • Posts: 13
2 years 7 months ago #7126 by Jon Zammit
Something else to consider is that the non-standard space character may appear in descriptions which have since been updated or inactivated.

So it might be practical to review the SNAPSHOT description file and investigate if the character appears in any active rows.

Note that FULL files contain historical information.

Please Log in or Create an account to join the conversation.

  • Posts: 432
2 years 7 months ago #7124 by Linda Parisien
Hi Debbie,
Thank you for the feedback, this is much appreciated!

If you have concrete issues, please let us know asap since the files will be regenerated while we are still investigating the cause of the current issue reported by Anibal.
The module ID 11000241103 is "module de la traduction française commune (concept de métadonnées de base)" or the French common French translation module.

Please Log in or Create an account to join the conversation.

  • Posts: 22
2 years 7 months ago #7123 by Debbie Onos
Good morning!

Our toolset in Alberta does not seem to have an issue loading the content with hidden characters, but it seems to affect the term's placement in alphabetic search results even though I am searching and display in English. This seems a bit odd to me, but it definitely isn't causing any load issues.

Our system gets one very consistent error from the CA release.
Canadian concepts register to the Canada Health Infoway English module. The French descriptions seem to be linked to the Canada Health Infoway French module, which causes my toolset to give me an error indicating the component module is not consistent with the concept module.
When the concept is from the SCT Core module, the French descriptions seem to be linked to a module ID that appears without a name "11000241103"; interestingly these concepts do NOT cause the same error.
We have also gotten some small validation errors, but I had not reported anything yet as this is a new toolset and I wanted to see if the same issues would arise with the next release load.


Debbie

Please Log in or Create an account to join the conversation.

  • Posts: 275
2 years 7 months ago #7122 by Kelly Davison
Good morning Anibal, and thanks to you and Jon for your posts. I think this forum is a very good place to start, as the community might have people who have run into a similar issue. Jon is correct - the files you have identified belong to the CA Edition, which is not generated by SNOMED International. It is an interesting problem and the first time that it has been raised. We will investigate and provide a response just as soon as we can follow up with our technical folks and have the information we need.

Many thanks, and have a good day.

Kelly Davison, CTSS
Senior Specialist, Standards

Please Log in or Create an account to join the conversation.

InfoCentral logo

Improving the quality of patient care through the effective sharing of clinical information among health care organizations, clinicians and their patients.