Where angels fear to tread

Roger Costello's excellent XML Schema Tutorial includes a detailed breakdown of the ISBN. I've excerpted the documentation (along with Roger's GPL) here. The example also includes a complete ISBN schema, which involves a huge pile of regular expressions. The hyphens, which most book-related Web services ignore, are meant to carve up the address space in a very TCP/IP-like way:


The format of an ISBN is:
1 -- it is always 10 characters long
2 -- it's broken into 4 parts, and these four parts   
     always appear separated with hyphens or spaces.
3 -- the four parts are:
     - group/country identifier
     - publisher identifier
     - number assigned to a specific title in one format 
       (formally called the title identifier)
     - a check digit

For English-speaking countries, part one is 0 or 1, and the publisher id is variable like so:



Country Publisher ID  If number ranges Insert hyphen Block
Size
                      are between:     after the: 
----------------------------------------------------------------
0       00.......19         00-19      3rd digit     1,000,000

0       200......699        20-69      4th digit     100,000 
0       7000.....8499       70-84      5th digit     10,000 
0       85000....89999      85-89      6th digit     1,000 
0       900000...949999     90-94      7th digit     100 
0       9500000..9999999    95-99      8th digit     10 
1       55000....86979      5500-8697  6th digit     1,000 
1       869800...998999     8698-9989  7th digit     100
1       9990000..9999999    9990-9999  8th digit     10 


Costello's complete ISBN schema runs to about 180K, all stuff like this:



<xsd:pattern value="951\s\d([0-9]|\s){5}\d\s[0-9x]">
    <xsd:annotation>
        <xsd:documentation>
            group/country ID = 951 (space after the 3rd digit)
            Country = Finland
            check digit is 0-9 or 'x'
        </xsd:documentation>
    </xsd:annotation>
</xsd:pattern>


Fascinating, but formidable. The inventors of this scheme must have been chagrined to see Amazon and the rest of the book sites discard this carefully designed information architecture. Can't blame them, though. 180K of regular expressions is a lot of overhead. And even if the hyphens were preserved, there would still be a big problem: a fragmented address space in need of some means of coalescence.

Lorcan Dempsey, who is VP for research at OCLC, wrote to let me know that there is an initiative to achieve that coalescence. From the abstract:

OCLC is investigating how best to implement IFLA's Functional Requirements for Bibliographic Records (FRBR). As part of that work, we have undertaken a series of experiments with algorithms to group existing bibliographic records into works and expressions. Working with both subsets of records and the whole WorldCat database, the algorithm we developed achieved reasonable success identifying all manifestations of a work.

Cool!


Former URL: http://weblog.infoworld.com/udell/2003/01/07.html#a567