Share this page:

question-circle Found issue with space encoding in Description file in RF2 for SNOMED CT release 20210331

  • Posts: 50
2 years 7 months ago #7135 by Anibal Jodorcovsky
Guillermo,

I've been trying to clean the Snapshot Description file so that I replace all the <0xa0> with <0x20> and I'm having a heck of a hard time.

I want something simple. I'm running on Mac OS X Big Sur. I tried:

sed 's/\xa0/\x20/g' filetest.txt > clean.txt

where filetest.txt is this: drive.google.com/file/d/1Ve-I9yT99YDTWoVEydRWE4zcJoXj9GKF/view?usp=sharing

It's basically just one line out of the whole SNOMED Description RF2 so that I can do quick tests.

The above sed doesn't work since sed in Mac OS X comes from BSD and this sed doesn't support hex codes! So, I ended up installing gsed (gnu-sed) and that kind of works, but still not completely.

See my comment in this post where I explain what's happening:

www.markhneedham.com/blog/2015/06/11/mac-os-x-gnu-sed-hex-string-replacement-replacing-new-line-characters/

Do you have a simple one-liner that works in Mac OS X Big Sur without having to install anything more complex?

Please Log in or Create an account to join the conversation.

  • Posts: 40
2 years 7 months ago #7134 by Peter Humphries

greynoso wrote: For example, with a regular space between “5” and “ml” the text “5 ml” could be split automatically into two lines, while with a non-breaking space it would either fit, or both components moved to the next line.


Keeping measurement values and units together is a great example of why we might want to permit the non-breaking space (NBSP), but to force Latin names or multi-word trade names to stay together does not seem like a reasonable use of NBSP -- it is not standard practice and it breaks sorting and counting algorithms. I agree that eliminating the NBSP is the easiest solution. A sophisticated import rule could allow NBSP only immediately after a numeral, replacing all other instances, but there could be other ways of forcing values and units to stay together.

Please Log in or Create an account to join the conversation.

  • Posts: 11
2 years 7 months ago #7133 by Guillermo Reynoso
Debbie,

Components like descriptions can be referencing concepts in other modules, by design. For example, French or Spanish descriptions that represent core concepts are published in their own module (they have their own module because they are maintained by another organization, are published independently or with different periodicity, etc.)

If you create your own additional descriptions for International core or Canadian extension concepts in the Alberta extension, they will have the Alberta module id, and that would be correct. While the descriptions file might contain several module ids depending on the component origin (Canadian French, Common French, International Edition, etc.) the relevant descriptions for any implementation or setting are defined in the language reference set.

Thanks for sharing your validation results, they are opportunities for improvement.

Please Log in or Create an account to join the conversation.

  • Posts: 50
2 years 7 months ago #7132 by Anibal Jodorcovsky
Gracias Guillermo.

I'll go ahead and pre-process the RF2 files before importing into my DB. sigh....

Please Log in or Create an account to join the conversation.

  • Posts: 11
2 years 7 months ago #7131 by Guillermo Reynoso
Hello everyone...

As correctly noted by several posts, the characters causing anomalous behaviour for Anibal toolset were non-breaking spaces.

Non-breaking spaces prevent word processors from inserting an automatic line break replacing a space character. For example, with a regular space between “5” and “ml” the text “5 ml” could be split automatically into two lines, while with a non-breaking space it would either fit, or both components moved to the next line. So from that perspective, there might be a case for using it (for example, between scientific organism names separating genus and species, or in pharmaceutical products) in terminologies and health records. However, the difference is usually not visible for writers, and it is not easy to key it instead of the plain space. It is frequently used in HTML for several reasons beyond this discussion, and as commented before, the likely origins of these characters in SNOMED CT were copy-paste operations from HTML sources.

It seems that some of the sources used by SNOMED for organism naming conventions have the non-breaking space between organism names, as this has happened in the past in the International Edition core (those descriptions were inactivated and replaced with plain spacing, so they are only present in the full description file):

3030901012 20190731 0 900000000000207008 707630002 en 900000000000013009 Ribosomal ribonucleic acid of Anaplasma marginale 900000000000020002

That description has a non-breaking space (in the original file) between Anaplasma and marginale

Recently, there has been active editing of microorganism concepts in the Canadian Edition, and it seems some copy-pasting from HTML authoritative sources has brought those intrusive characters. While not expected by some tools, they are still valid UTF8 characters and not prohibited explicitly by SNOMED CT editorial guidelines. However, for the sake of consistency, they should be avoided.

Most users and tools would not note the difference, because it affects tokenization (the separation of a term into smaller "words" or units, for indexing purposes, for example).

While the Canadian Edition microorganism descriptions can be inactivated in the September 2021 release (and therefore in the snapshot that would become available in the short term) they would stay in the full file, together with those already present in the core. So probably it would be best for tooling to have a way of handling the NBSP character as an acceptable token/word separator, or just process the file replacing NBSPs by plain spaces before importing it into the SQL database.

Please Log in or Create an account to join the conversation.

  • Posts: 22
2 years 7 months ago #7130 by Debbie Onos
Hi Linda,

I sent you an email with the 3 issues our toolset flags, beyond the error message we get for the modules noted below.
Our toolset went live right around the time of the last CA release and some issues were pushed through so that we could get to functioning. I will update Infoway if the next release causes any unforeseen issues or if our tool picks up some errors.

Thank you
Debbie

Please Log in or Create an account to join the conversation.

InfoCentral logo

Improving the quality of patient care through the effective sharing of clinical information among health care organizations, clinicians and their patients.