First a definition:
Qwerty: 1. Made up of characters that one may type using a standard QWERTY keyboard. 2. Characters that are made up of the ASCII set of printable characters numbering from 32 to 126 that represent language/expression characters (this excludes machine commands such as Delete, Backspace, Escape etc.).
I may elaborate on this definition later, but generally the above definition works for the purposes of this discussion.
Okay.
Well, this Unicode thing is a bit tougher than expected. I’ve been able to account for and convert all of the Unicode (or special characters) within our title list (from our bib records), but the SFX/Aggregator titles are a whole different matter.
For one, some have no Western or Arabic characters at all. That is, they are completely Chinese, Thai, Vietnamese, etc., with no alt titles. This means that there cannot be any equivalent or even close character substitutions. That only leaves the previous method of transliteration: using the old “agg_weird.dat” file that I’ve built over time.
This is frustrating because these titles do not display as Qwerty titles. Some of these titles do not even have ISSN or ISBN numbers! This means I must do some research each time one comes up and either copy or invent some Qwerty version of the title.
In short, 100% conversion of every title that may possibly exist (now or in the future) is not possible using any approach that is now within my ability or resources (including time).
So-o-o, I’ve decided to develop a “hybrid” conversion process.
Here’s how it will work (I hope).
Journal List Updater programs encounter a special character within a string.
- First, an attempt is made to “normalize” to a Qwerty form using a series of algorithms. If this works, all is well and the program proceeds with its other tasks, skipping the below steps.
- Second, if the above fails, the program will further examine the string to determine if the existence of the special character might cause a break in the sequence of data (big danger!).
- >>> If a potential data break (such as a line break) IS detected, the program will SKIP that record, RECORD the suspect data in a separate file to safely isolate it from other records, then SEND a notice to the Administrator (me) that a problem was encountered.
- Third, if a potential data break IS NOT detected, the “agg_weird.dat” file is consulted using a key from the problem data (such as an ISSN, ISBN, or Object_ID number).
- >>>If a match IS found, a substitution will take place based on the record within the file. This is more or less a “dictionary” look up.
- >>>If a match IS NOT found, the entry will be recorded as a problem and a notice about it will be sent to the Administrator.
I think that, while inelegant, this scheme will work. It’s the one I’ll be going with, anyway. At least this method will permit the data processing programs to proceed (right now, everything halts until I can intervene) without “supervision.” So if this works, I’ll be satisfied.
I suppose.
eof