Hi all,
Not sure if this is the right place to post this, but given the group description I thought it'd be worth a shot.
I'm trying to automate a whole bunch of tasks that were done by hand previously within our group.
To this end, I’m writing several scripts and SQL against a MS Access DB that houses the RF2 SNOMED CT CAD release.
One of our tools is not working as expected when doing comparisons and after a lot of digging, I discovered that the source files within the RF2 release are encoding the space between words differently in some cases.
The file in question is this:
C:\Users\aniba\Desktop\SnomedCT_Canadian_EditionRelease_PRODUCTION_20210331T120000Z\Full\Terminology sct2_Description_Full_CanadianEdition_20210331.txt
See attachment - taken from a Sublime text capture - where we can see the issue [hmmm, I can't find a way to upload an attachment to a topic].
I uploaded the screenshot to a public google folder, here it is:
drive.google.com/file/d/1XtIPvxFXQNHFPyHdzIlwXRxNsKFQ7lFI/view?usp=sharing
That’s the screen where I’m seeing “hidden” characters in some of the terms.
Notice how the space between several terms is encoded as <0xa0> rather than <0x20> as it should be and all other terms are.
<0xa0> is part of the extended ASCII char set and it should not be used in txt files like this, in particular when we need to be consistent. So, we either use <0xa0> for all spaces, or <0x20>.
This is causing our tools to break and are unable to compare terms automatically.
Is this something that comes from SNOMED International or is this something that comes from CHI?