Roger Costello's excellent XML Schema Tutorial includes a detailed breakdown of the ISBN. I've excerpted the documentation (along with Roger's GPL) here. The example also includes a complete ISBN schema, which involves a huge pile of regular expressions. The hyphens, which most book-related Web services ignore, are meant to carve up the address space in a very TCP/IP-like way:
The format of an ISBN is: 1 -- it is always 10 characters long 2 -- it's broken into 4 parts, and these four parts always appear separated with hyphens or spaces. 3 -- the four parts are: - group/country identifier - publisher identifier - number assigned to a specific title in one format (formally called the title identifier) - a check digit
For English-speaking countries, part one is 0 or 1, and the publisher id is variable like so:
Country Publisher ID If number ranges Insert hyphen Block Size are between: after the: ---------------------------------------------------------------- 0 00.......19 00-19 3rd digit 1,000,000 0 200......699 20-69 4th digit 100,000 0 7000.....8499 70-84 5th digit 10,000 0 85000....89999 85-89 6th digit 1,000 0 900000...949999 90-94 7th digit 100 0 9500000..9999999 95-99 8th digit 10 1 55000....86979 5500-8697 6th digit 1,000 1 869800...998999 8698-9989 7th digit 100 1 9990000..9999999 9990-9999 8th digit 10
Costello's complete ISBN schema runs to about 180K, all stuff like this:
<xsd:pattern value="951\s\d([0-9]|\s){5}\d\s[0-9x]"> <xsd:annotation> <xsd:documentation> group/country ID = 951 (space after the 3rd digit) Country = Finland check digit is 0-9 or 'x' </xsd:documentation> </xsd:annotation> </xsd:pattern>
Fascinating, but formidable. The inventors of this scheme must have been chagrined to see Amazon and the rest of the book sites discard this carefully designed information architecture. Can't blame them, though. 180K of regular expressions is a lot of overhead. And even if the hyphens were preserved, there would still be a big problem: a fragmented address space in need of some means of coalescence.
Lorcan Dempsey, who is VP for research at OCLC, wrote to let me know that there is an initiative to achieve that coalescence. From the abstract:
OCLC is investigating how best to implement IFLA's Functional Requirements for Bibliographic Records (FRBR). As part of that work, we have undertaken a series of experiments with algorithms to group existing bibliographic records into works and expressions. Working with both subsets of records and the whole WorldCat database, the algorithm we developed achieved reasonable success identifying all manifestations of a work.
Cool!
Former URL: http://weblog.infoworld.com/udell/2003/01/07.html#a567