Okay, I finally decided to tackle the problem of non-standard characters in the journal data. I put off doing so because of the gianormous headache I knew this would be.
I want to standardize journal titles for several reasons:
- To make them legible
- To make them searchable using a standard keyboard
- To prevent them from “breaking” data with field marks
- To enable the Journal List data prep programs to run with intervention (right now, when they hit a non-standard character, they pause, waiting from me to make some decision before proceeding)
So I downloaded all of the titles in our online catalog data (not just journals) knowing I would find problems. I was thinking the list would serve as a good set of data to troubleshoot from.
I was right.
In case you’re reading this and wondering what the big deal is, consider this: There is not only the Standard ASCII, but also the Extended ASCII set of characters. Since Unicode first came along, there has been at least 8 changes in what is “standard” Unicode. In addition, just in the “current” standard alone, there are 108 Unicode “sets” of characters (see Unicode Tables) !!! And each of these tables may have hundreds of individual characters.
Now, in the good ol’ days, everything was “normalized” into ASCII. Then Extended ASCII came along. 255 characters. Plenty! Also, English was the standard for programming in English-speaking countries. American English in the United States.
In my work, it is apparent that there has been no institutional will to stick to any standard when it comes to record sets and characters used. Hence, our problem.
Our records date back to the late 1970′s when we first started automating. Since then, every imaginable error has been made. Incorrect data entry. Poorly standardized “purchased” records. Overlays of “new” records over old records, even if they do not conform to the previous record’s standards. Marc formatting codes and symbols accidentally (or maybe intentionally… who knows?) used as characters in titles. Non-English character sets imported and used that cannot be typed on a standard QWERTY keyboard (even with ctrl/alt/shift combinations).
This leaves us with a veritable “anything goes” soup of possible problem characters that may crop up at any moment to kill, maim, or mutate not only the record that it exists within, but other records around it.
THAT’s my problem. The challenge is to find them and “normalize” them. Since no “normalization standard” exists, I’m trying to come up with one that will work for MY applications. And, sorry world, ASCII is MY standard. It is compatible with any machine that my applications will be used on. So there.
So, I’m running an analysis program that spits out all non-ASCII anomalies for me to make a decision about. Not only am I finding the expected non-English characters, but I’m also finding the most ridiculous “Unicode” characters used in place of standard intelligible characters. Why, for instance, would you put a Unicode dash (ORD 226 128 147 or ORD 226 129 187) when you could use a good ol’ ASCII dash (chr 45), especially when all of the rest of the characters are ASCII?. ‘Cause, guess what? That Unicode dash is not interpreted as such by many data processing programs and machines. Just an example.
So, even though my programming is saving a lot of time, it is very tedious. My character “replacement” choices are imperfect. Some are obviously wrong (in terms of the base language), but I’m trying to stick to characters that EXIST on a standard keyboard!!
I’ve been at this for over a week with very few breaks.
When I wrap this project up, I hope I’ll be able to use the algorithms in this program as some sort of subroutine for checking entries in the Journal List. Right now, the Journal List prep program stops when it encounters a non-standard character in a title. It then waits for me to add it to a reference file before proceeding. So obviously, I’d like for the Journal List prep program to run independent of my intervention, hence all this work.
I will also try to report back how many titles are in our online catalog that cannot be found with a standard keyboard search.
Until then…
eof