I'm Dan Scott: barista, library geek, and free-as-in-freedom software developer.
I hack on projects such as the Evergreen
open-source ILS project and PEAR's File_MARC package .
In June of 2009 I was moaning about how Evergreen, by default, has no identifier index for limiting searches by ISBN / ISSN / LCCN / OCLCnum and that if [fixing this problem] requires work from me, it will probably be 2010 before any of it happens. Due to some of the tools our consortium relies on, we really needed a solution for identifier searches in Z39.50 that was better than just a general keyword search: we were returning too many false positives that cause extra work and frustration for everyone.
Well, here it is, 2010, and as of today Conifer's Evergreen server now has a very handy identifier index. Most of the required pieces were already there, in one form or another, but they all needed to be brought together. This blog post is going to try to do that (and serve as documentation for my ever-decaying brain, too). At the time of this post, we're running a 1.6.0.4-ish Evergreen system; you'll need to be running 1.6.0.4 to get ISSN searching to work properly, too.
First, we need to create the identifier index. Evergreen comes with the following indexes out of the box:
author
title
series
subject
keyword
Pretty standard. With the exception of keyword, each of these indexes is composed of more granular indexes; for example, the title index is composed of the following specific indexes, with the XML format that the MARCXML is converted to and then the XPath expression that extracts the text from the pertinent XML format:
abbreviated - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='abbreviated')]
translated - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='translated')]
alternative - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='alternative')]
uniform - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='uniform')]
proper - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='proper')]
Aside: You can search against these more granular indexes in the Evergreen OPAC, by the way, by appending the granular index name to the index class name with a | as a delimiter. For example, a search query of title|uniform: canada will search only the uniform titles for the term "canada". Okay, sorry for that detour, but I bet you weren't aware of that - we haven't done a good job of exposing some of the magic that has been there for a long time in Evergreen in the OPAC interface.
Back to understanding the configuration - as you can see above, the conversion to MODS does the heavy lifting in pulling out the fields of interest to us from the MARCXML. The full set of indexed fields and their definitions is visible in the database via the query:
SELECT * FROM config.metabib_field;
For our purposes, we're interested in pulling the raw 010 (LCCN), 020 (ISBN), and 022 (ISSN) a subfields directly from the MARCXML source. Our first step is to add an entry to the config.metabib_field table defining our new index. We'll create a new granular index under the "keyword" index class and call it "identifier", because that's what it is, right? That's as easy as:
INSERT INTO config.metabib_field (field_class, name, xpath, weight, format, search_field, facet_field)
VALUES ('keyword', 'identifier',
'//marcxml:datafield[@tag="010" or @tag="020" or @tag="022"]/marcxml:subfield[@code="a"]',
1, 'marcxml', true, false
);
Next, we need to restart the open-ils.storage and open-ils.ingest services to make them aware of this new entry. Go ahead, I'll wait while you run osrf_ctl.sh -a restart_perl or use opensrf-perl.pl to restart the services individually. Done? Good.
We have to make up for lost time, now, as all of the bibliographic records in your system didn't have this definition in place when they were first ingested. The easiest thing to do is to just pull the pertinent data directly from the metabib.full_rec view (which is a shredded version of the source MARCXML from your bibliographic records, with one tag/subfield value per row. Ergo:
-- Get the ID from the row that you just inserted for the new index;
-- we'll use this in the INSERT statement
SELECT id
FROM config.metabib_field
WHERE field_class = 'keyword' AND name = 'identifier'
;
-- Let's say the ID was 18; we'll use that to identify the index in the SELECT statement
INSERT INTO metabib.keyword_field_entry (field, source, value)
SELECT 18, record, agg_text(value)
FROM metabib.full_rec
WHERE tag IN ('010', '020', '022')
AND subfield = 'a'
GROUP BY 1, 2
;
All right! Now you can run some test searches in the OPAC for ISSNs, ISBNs, and LCCNs in your OPAC using the keyword|identifier: some_identifier prefix. Cool. So that's part one, mostly lifted from the "magic spell" in the Evergreen wiki.
Part two is configuring SRU to use the new identifier index. The bulk of the Evergreen SRU implementation is contained in the Perl module OpenILS::WWW::SuperCat.pm (located in your install directory in /openils/lib/perl5/OpenILS/Application/SuperCat.pm). Get out your patch tool or open up the Perl module in a text editor, we're going to make a few changes. The pertinent diff follows:
Essentially, we've defined a new qualifier (eg.identifier) and pointed it and the dc.identifier indexes at the new, more specific keyword|identifier index. Once the updated file is in place, reload your Apache configuration (/etc/init.d/apache reload) and SRU requests using those qualifiers will now point at the identifier index. FABULOUS.
Our last step is to teach our simple2zoom-based Z39.50 configuration about the new index by mapping the corresponding BIB-1 attributes to the new eg.identifier qualifier, like so:
Kill your simple2zoom processes and restart simple2zoom and you should be in heaven - farewell, false positive matches! Oh, and about that SFX target parser for Evergreen; now you can remove all of the gimmickry around exact searches and worrying about ISSNs that contain an 'X' and just point at the identifier index. For example:
if (defined($ISSN)) {
$searchString .= "keyword|identifier: $ISSN";
}
elsif (defined($ISBN)) {
$ISBN =~ s/-//g; # Most of our ISBNs are normalized to no hyphens
$searchString .= "keyword|identifier: $ISBN";
}
Things still aren't perfect in Evergreen identifier-land: we still need to do some work to normalize hyphenation of our ISBNs, for example, and ensure we have 10-digit & 13-digit ISBN equivalents. But we're a lot closer to perfection now - and with the work that Mike Rylander is doing in trunk, normalization of that kind should be relatively straightforward to implement on both the indexing and query-parsing side.
UPDATE 2010-03-05 I just backported Warren's patch for sorting Z39.50 servers to rel_1_6_0 (it counts as a bug fix), so expect to see it in the Evergreen 1.6.0.4 release. Yay!
In Evergreen 1.6, Z39.50 target server configuration (for copy-cataloguing targets) moves into the database. This makes it pretty easy for sites to share their Z39.50 target servers with one another.
I recently added a number of target servers to our configuration, and thought that other academic Evergreen sites might be interested in our set (because we're primarily pointing at other academic libraries) - particularly if they haven't added many of their own yet. You can find a PostgreSQL dump of our current configuration in the ILS-Contrib repository at conifer/branches/rel_1_6_0/tools/config/config_z3950.sql.
I generated this dump of the data using the following command:
(where evergreen is the name of the Evergreen database, naturally!). You should be able to load the data into a clean Evergreen database via psql inside a transaction as follows:
BEGIN;
\i config_z3950.sql
COMMIT;
If you already have other Z39.50 servers in your database configuration, you might need to adjust the ID values in the config.z3950_attr rows. Just prepending a 1 to them ought to do the trick, unless you have masses of Z39.50 servers. In which case, you probably don't need ours!
Oh, one final tip: when you start adding a bunch of Z39.50 target servers, you'll notice that the order in the Import from Z39.50 screen is random; it will drive your cataloguers crazy. Quite some time ago, Warren Layton from Natural Resources Canada submitted a patch for sorting the servers alphabetically that has been committed to trunk and the 1.6 branch, but which hasn't made its way into a 1.6.0 release yet. If, at the time you're reading this, you're on a 1.6 release but your list isn't sorted, get the file and drop it into /openils/var/web/xul/server/cat/z3950.js - your cataloguers will thank you. You, in turn, can thank Warren.
Let's pretend your national library asked you to submit a set of records with holdings representing all of the various formats in your library system. Let's also pretend that you're really lucky and you're running Evergreen. Here's what you would do to get one example of each combination of item type, item form, bibliographic level, literary form, cataloguing form, and video recording format into a scratch table for a given library (ID = 103) in your system:
CREATE TABLE scratchpad.osul_export (record BIGINT);
INSERT INTO scratchpad.osul_export
SELECT record FROM (
SELECT DISTINCT ON (mrd.item_type, mrd.item_form, mrd.bib_level, mrd.lit_form, mrd.cat_form, mrd.vr_format)
mrd.record, mrd.item_type, mrd.item_form, mrd.bib_level, mrd.lit_form, mrd.cat_form, mrd.vr_format
FROM biblio.record_entry bre
INNER JOIN asset.call_number acn ON acn.record = bre.id
INNER JOIN asset.copy ac ON ac.call_number = acn.id
INNER JOIN metabib.rec_descriptor mrd ON mrd.record = bre.id
WHERE bre.deleted IS FALSE AND acn.deleted IS FALSE AND ac.deleted IS FALSE AND acn.owning_lib = 103
ORDER BY mrd.item_type, mrd.item_form, mrd.bib_level, mrd.lit_form, mrd.cat_form, mrd.vr_format
) AS formats
ORDER BY record;
And then, because you were asked to provide a total of 2000 records for this representative sample, you might fill up the remaining 1800 records as follows:
INSERT INTO scratchpad.osul_export
SELECT bre.id
FROM biblio.record_entry bre
INNER JOIN asset.call_number acn ON acn.record = bre.id
INNER JOIN asset.copy ac ON ac.call_number = acn.id
INNER JOIN reporter.super_simple_record rsr ON rsr.id = bre.id
WHERE bre.deleted IS FALSE AND acn.deleted IS FALSE AND ac.deleted IS FALSE AND acn.owning_lib = 103
AND bre.id NOT IN (
SELECT record
FROM scratchpad.osul_export
) AND substring(bre.id::text from (length(bre.id::text)) for 1)::int = 8
AND bre.id % 17 = 0
ORDER BY rsr.author DESC
LIMIT 1800;
... which, of course, gives you the records with a record ID ending in '8' and (to whittle it down further) records where record ID modulo 17 is 0 - and finally, just the first 1800 records ordered by author name in descending order.
All of this will give you 2000 record IDs in scratchpad.osul_export that you can then extract into a text file and feed into Evergreen's Open-ILS/src/support-scripts/marc_export script to dump the MARC records with holdings in the 852 field from your system. Beautiful, eh?
To summarize the results of the Evergreen developer workshop at the OLA SuperConference, I think things went pretty well. The primary focus this time was on the nuts and bolts of building a minimal OpenSRF service and I saw the lights go on in a number of faces as I broke it down. Things got a little hand-wavy in the final half-hour when I leapt into the Dojo JavaScript widgets that have been custom-built for Evergreen interfaces such as the administration and acquisitions functionality. In retrospect, the first half of the session deserves its own half-day, and the second half of the session similarly deserves its own half-day, and something had to give this time around.
I focused on getting hands-on, and for the most part I think it was a success - although even though I had packaged up a virtual image, we still ran into some problems getting it running on some laptops. And due to some communications problems, about half of the participants weren't ready for a hands-on session (read: no laptop, or a netbook that couldn't handle a virtual image). I have real hopes that we'll see some contributions in the next few months from some of the participants, which would be a huge win for Evergreen.