- Forum
- Communities
- Health Terminologies
- Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331
Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331
- Peter Humphries
- Hors Ligne
- Messages : 40
il y a 3 ans 2 mois #7128
par Peter Humphries
Réponse de Peter Humphries sur le sujet Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331
This Unicode vs Extended Unicode space encoding issue is fairly common when HTML source is cut-and-paste into Microsoft Excel and then the cell text is cut-and-paste from or exported out of MS Excel. (I am guessing that every description in SNOMED CT went through a spreadsheet at one point in its life-cycle , and HTML could be from a web page, an email or a word processing document including an on-line word processor.)
If there is no reason to force a non-breaking space in the description of a SNOMED CT term, then it would make sense to screen the input for <0xA0> and replace every instance with <0x20> because Microsoft has known about this problem for many years and made no effort to fix it. Depending upon the user to notice that the text encoding is UTF-16 instead of UTF-8 is not a reliable mitigation, especially if UTF-16 is actually permitted for other extended characters.
As was noted, a regular space (ASCII 32) sorts differently than does a non-breaking space (ASCII 160), in addition to any rules your sorting algorithm might have about word breaks, on different systems and in different applications. So, allowing non-breaking spaces in the descriptions could be an issue all the way down to the end user.
If there is no reason to force a non-breaking space in the description of a SNOMED CT term, then it would make sense to screen the input for <0xA0> and replace every instance with <0x20> because Microsoft has known about this problem for many years and made no effort to fix it. Depending upon the user to notice that the text encoding is UTF-16 instead of UTF-8 is not a reliable mitigation, especially if UTF-16 is actually permitted for other extended characters.
As was noted, a regular space (ASCII 32) sorts differently than does a non-breaking space (ASCII 160), in addition to any rules your sorting algorithm might have about word breaks, on different systems and in different applications. So, allowing non-breaking spaces in the descriptions could be an issue all the way down to the end user.
Connexion ou Créer un compte pour participer à la conversation.
- Anibal Jodorcovsky
- Auteur du sujet
- Hors Ligne
- Messages : 50
il y a 3 ans 2 mois #7127
par Anibal Jodorcovsky
Réponse de Anibal Jodorcovsky sur le sujet Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331
Good point.
I just checked, issue still occurs in Snapshot file:
148561000087114 20210331 1 20621000087109 28881000087108 en 900000000000013009 Achaetomium species 900000000000017005
You don't see it here, of course, but between Achaetomium and species there's an <0xa0> rather than <0x20>
See screenshot here: drive.google.com/file/d/11RauQ0y95aRR5LP5b1LrqJEGjevIpq_n/view?usp=sharing
I just checked, issue still occurs in Snapshot file:
148561000087114 20210331 1 20621000087109 28881000087108 en 900000000000013009 Achaetomium species 900000000000017005
You don't see it here, of course, but between Achaetomium and species there's an <0xa0> rather than <0x20>
See screenshot here: drive.google.com/file/d/11RauQ0y95aRR5LP5b1LrqJEGjevIpq_n/view?usp=sharing
Connexion ou Créer un compte pour participer à la conversation.
- Jon Zammit
- Hors Ligne
- Messages : 13
il y a 3 ans 2 mois #7126
par Jon Zammit
Réponse de Jon Zammit sur le sujet Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331
Something else to consider is that the non-standard space character may appear in descriptions which have since been updated or inactivated.
So it might be practical to review the SNAPSHOT description file and investigate if the character appears in any active rows.
Note that FULL files contain historical information.
So it might be practical to review the SNAPSHOT description file and investigate if the character appears in any active rows.
Note that FULL files contain historical information.
Connexion ou Créer un compte pour participer à la conversation.
- Linda Parisien
- Hors Ligne
- Messages : 437
il y a 3 ans 2 mois #7124
par Linda Parisien
Réponse de Linda Parisien sur le sujet Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331
Hi Debbie,
Thank you for the feedback, this is much appreciated!
If you have concrete issues, please let us know asap since the files will be regenerated while we are still investigating the cause of the current issue reported by Anibal.
The module ID 11000241103 is "module de la traduction française commune (concept de métadonnées de base)" or the French common French translation module.
Thank you for the feedback, this is much appreciated!
If you have concrete issues, please let us know asap since the files will be regenerated while we are still investigating the cause of the current issue reported by Anibal.
The module ID 11000241103 is "module de la traduction française commune (concept de métadonnées de base)" or the French common French translation module.
Connexion ou Créer un compte pour participer à la conversation.
- Debbie Onos
- Hors Ligne
- Messages : 24
il y a 3 ans 2 mois #7123
par Debbie Onos
Réponse de Debbie Onos sur le sujet Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331
Good morning!
Our toolset in Alberta does not seem to have an issue loading the content with hidden characters, but it seems to affect the term's placement in alphabetic search results even though I am searching and display in English. This seems a bit odd to me, but it definitely isn't causing any load issues.
Our system gets one very consistent error from the CA release.
Canadian concepts register to the Canada Health Infoway English module. The French descriptions seem to be linked to the Canada Health Infoway French module, which causes my toolset to give me an error indicating the component module is not consistent with the concept module.
When the concept is from the SCT Core module, the French descriptions seem to be linked to a module ID that appears without a name "11000241103"; interestingly these concepts do NOT cause the same error.
We have also gotten some small validation errors, but I had not reported anything yet as this is a new toolset and I wanted to see if the same issues would arise with the next release load.
Debbie
Our toolset in Alberta does not seem to have an issue loading the content with hidden characters, but it seems to affect the term's placement in alphabetic search results even though I am searching and display in English. This seems a bit odd to me, but it definitely isn't causing any load issues.
Our system gets one very consistent error from the CA release.
Canadian concepts register to the Canada Health Infoway English module. The French descriptions seem to be linked to the Canada Health Infoway French module, which causes my toolset to give me an error indicating the component module is not consistent with the concept module.
When the concept is from the SCT Core module, the French descriptions seem to be linked to a module ID that appears without a name "11000241103"; interestingly these concepts do NOT cause the same error.
We have also gotten some small validation errors, but I had not reported anything yet as this is a new toolset and I wanted to see if the same issues would arise with the next release load.
Debbie
Connexion ou Créer un compte pour participer à la conversation.
- Kelly Davison
- Hors Ligne
- Messages : 275
il y a 3 ans 2 mois #7122
par Kelly Davison
Réponse de Kelly Davison sur le sujet Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331
Good morning Anibal, and thanks to you and Jon for your posts. I think this forum is a very good place to start, as the community might have people who have run into a similar issue. Jon is correct - the files you have identified belong to the CA Edition, which is not generated by SNOMED International. It is an interesting problem and the first time that it has been raised. We will investigate and provide a response just as soon as we can follow up with our technical folks and have the information we need.
Many thanks, and have a good day.
Kelly Davison, CTSS
Senior Specialist, Standards
Many thanks, and have a good day.
Kelly Davison, CTSS
Senior Specialist, Standards
Connexion ou Créer un compte pour participer à la conversation.