Tuesday, November 17. 2009Doing useful things with periodical holdings, part 2: comparing with print holdings in EvergreenDoing interesting things with Evergreen serials dataUpdate: 2010-05-31 Running through the process again, I found a few typos in the pg_dump commands, so I fixed those up. I'm working on a project to compare our electronic journal holdings with our print journal holdings. This is probably a task that most academic libraries have been working on over the past few years, as collection space dwindles, the duplication of holdings in electronic and print formats increases, and electronic delivery and 24/7 access becomes the default expectation of our patrons. In my previous post, I worked through the hoops required to get our SFX holdings into a usable database for query purposes. In this post, I'll walk through the steps required to get the serials holdings from Evergreen into the same database so that we can generate reports based on the authoritative sources for both our electronic and print holdings. We'll start by dumping the schema for the biblio.record_entry and serial.record_entry tables from our Evergreen database. In the previous post, we could have added the tables from the SFX export to the Evergreen database, but I don't like mixing these more experimental projects with our production system - so we'll work with the a database named periodicals instead.
pg_dump --no-owner --schema-only --table biblio.record_entry \
--table serial.record_entry evergreen > bre_sre_schema.dump
We have to munge the schema to not create the indexes on the tables - should lead to faster loads. Also, remove any triggers that point at other stuff that doesn't exist in this limited subset of data. Then create the schema in our periodicals holdings database: psql -f bre_sre_schema.ddl -d periodicals Now dump the data for those tables from the Evergreen database. If you have a large set of bibliographic records like we do, make sure you have a few gigabytes of space available in the output location.
pg_dump --no-owner --data-only --table biblio.record_entry \
--table serial.record_entry evergreen \
> bre_sre_data.ddl
Okay, now you can load the data into your serials holdings database: psql -f bre_sre_data.dump -d periodicals And now we add the indexes that we previously culled from the schema. You can be more selective in the indexes you create, if you know what you're doing. For some reason, I opted to play with PostgreSQL's support for XML as a native column type and converted the plain text marc column into an XML column:
ALTER TABLE biblio.record_entry
ALTER COLUMN marc
SET DATA TYPE XML USING marc::XML;
Now we add the Evergreen holdings to the holdings.conifer table. We use the xpath() function to retrieve the desired values from the MARC XML in biblio.record_entry, and wrap the results in the unnest() function to return the nodeset as a plain text string, rather than an array of values. The WHERE clause restricts the holdings to those owned by the library in which I am interested.
CREATE TABLE holdings.conifer (
record BIGINT,
issn TEXT,
coverage TEXT,
call_number TEXT
);
INSERT INTO holdings.conifer (record, issn)
SELECT bre.id, UNNEST(XPATH('//*[local-name()="datafield"][@tag="022"]' ||
'/*[local-name()="subfield"][@code="a"]/text()', bre.marc))
FROM biblio.record_entry bre INNER JOIN serial.record_entry sre
ON sre.record = bre.id
WHERE sre.owning_lib = 103
;
We'll populate the call number based on the 852 field in the serial record. We could pull this from the asset.call_number table, but this will be good enough for the first pass.
UPDATE holdings.conifer
SET call_number = UNNEST(
XPATH(
'//*[local-name()="datafield"][@tag="852"]/' ||
'*[local-name()="subfield"][@code="h"]/text()',
(
SELECT sre.marc::xml
FROM serial.record_entry sre
INNER JOIN holdings.conifer hc ON hc.record = sre.record
WHERE hc.record = holdings.conifer.record
LIMIT 1
)
)
)
;
Now we need to generate usable holdings statements for the print. Evergreen includes a great MFHD parsing library written in Perl, and PostgreSQL thankfully enables you to create functions written in Perl, but to get the following to work on a non-Evergreen machine, I had to copy Open-ILS/src/perlmods/OpenILS/Utils/MFHD/* to /usr/local/share/perl/5.10.0 and edit the occurrences of OpenILS::Utils::MFHD::* to *.
CREATE OR REPLACE FUNCTION holdings.parse_mfhd ( xml TEXT ) RETURNS TEXT AS $_$
use MARC::Record;
use MARC::File::XML;
use MFHD;
my $xml = shift;
my $text;
my $captions;
my $marc = MARC::Record->new_from_xml( $xml );
my $mfhd = MFHD->new($marc);
foreach my $field ($marc->field('866')) {
my $holdings = $field->subfield('a');
if ($holdings) {
my $public_note = $field->subfield('z');
if ($public_note) {
$text .= "$holdings - $public_note";
} else {
$text .= "$holdings";
}
}
}
foreach my $cap_id ($mfhd->captions('853')) {
my @curr_holdings = $mfhd->holdings('863', $cap_id);
next unless scalar @curr_holdings;
foreach (@curr_holdings) {
if ($captions) {
$captions .= ', ';
}
$captions .= $_->format();
}
}
if ($text and $captions) {
$text = "$text / $captions";
} else {
$text = "$text$captions";
}
return $text;
$_$ LANGUAGE PLPERLU;
And update the table:
UPDATE holdings.conifer SET coverage = (
SELECT holdings.parse_mfhd(marc)
FROM serial.record_entry
WHERE serial.record_entry.record = holdings.conifer.record
LIMIT 1
);
That almost works, but it only retrieves the coverage from a single serial holdings record for a given bibliographic record, even though there might be multiple serial holdings records. To amend that, we'll create a PL/pgSQL function that concatenates all of the coverage statements from all of the pertinent serial holdings records for a given bibliographic record:
CREATE OR REPLACE FUNCTION holdings.print_coverage(marc_record BIGINT)
RETURNS TEXT AS $$
DECLARE
r RECORD;
coverage TEXT;
BEGIN
-- If coverage is NULL to begin with, then concatenating to it results in NULL
coverage := '';
-- RAISE NOTICE 'marc_record = %', marc_record;
-- Loop over the serial records attached to the targeted bib record
FOR r IN SELECT marc FROM serial.record_entry
WHERE record = marc_record
ORDER BY id
LOOP
coverage := coverage || holdings.parse_mfhd(r.marc);
-- RAISE NOTICE 'r.marc = %', r.marc;
END LOOP;
-- RAISE NOTICE 'coverage = %', coverage;
RETURN coverage;
END
$$ LANGUAGE 'plpgsql';
And we'll use this fancy new function to update the print holdings statements again with the more complete coverage:
UPDATE holdings.conifer SET coverage = (
SELECT holdings.print_coverage(record)
FROM serial.record_entry
WHERE serial.record_entry.record = holdings.conifer.record
LIMIT 1
)
;
Now the payoff: generating a list of matching ISSNs from the electronic holdings and our print holdings, with the coverage statements for each, for a subset of the SFX collections to which we have access:
-- Set the display to expanded format for easy reading
\x
-- Basic report for perusal
SELECT hsfx.issn AS "ISSN", hsfx.title AS "Title",
hsfx.collection AS "SFX Collection",
hsfx.coverage AS "Electronic Coverage",
hc.coverage AS "Print Coverage", hc.call_number AS "Call Number"
FROM holdings.sfx hsfx
INNER JOIN holdings.conifer hc ON hsfx.issn = hc.issn
WHERE (hsfx.collection ILIKE '%JStor%' OR hsfx.collection LIKE '%Scholars%')
AND hc.coverage > ''
LIMIT 5;
That results in: -[ RECORD 1 ]-------+-------------------------------------------------------------------------- ISSN | 0142-2774 Title | Journal of Occupational Behavior SFX Collection | JSTOR Arts and Sciences 4 Electronic Coverage | Available from 1980 until 1987. Print Coverage | Vol. 1 No. - Vol. 8 No. 4 (1980-1987) Call Number | DESM-PER -[ RECORD 2 ]-------+-------------------------------------------------------------------------- ISSN | 0741-6261 Title | The Rand Journal of Economics SFX Collection | JSTOR Arts and Sciences 2 Electronic Coverage | Available from 1984 until 2006. Print Coverage | V.17 (1986) - v.23 (1992) Call Number | DESM-PER -[ RECORD 3 ]-------+-------------------------------------------------------------------------- ISSN | 0002-8614 Title | Journal of the American Geriatrics Society SFX Collection | Scholars Portal Electronic Coverage | Available from 2001 volume: 49 issue: 1 until 2009 volume: 57 issue: 10. Print Coverage | Vol. 1 - 37 (1953-1989) Call Number | DESM-PER -[ RECORD 4 ]-------+-------------------------------------------------------------------------- ISSN | 0023-7639 Title | Land Economics SFX Collection | JSTOR Arts and Sciences 7 Electronic Coverage | Available from 1948 until 2005. Print Coverage | v.62 (1986) - v.68 (1992) Call Number | DESM-PER -[ RECORD 5 ]-------+-------------------------------------------------------------------------- ISSN | 0090-2616 Title | Organizational dynamics SFX Collection | Scholars Portal Electronic Coverage | Available from 1995 volume: 23 issue: 3 until 2009 volume: 38 issue: 3. Print Coverage | Vol. 15 No. - Vol. 23 No. 5 (Summer 1986-Spring 1995) Call Number | DESM-PER Looks pretty good to these eyes. Okay, now we'll get serious and dump the output to a tab-delimited file so we can easily open it in OpenOffice.org Calc or another spreadsheet: -- Set delimiter to TAB (CTRL-V And that's it. It might seem complex, but I've found that investing the effort into learning how to lean on PostgreSQL to do the hard work pays plenty of dividends. This exploration should help me contribute more functionality to Evergreen core; for example, I hope to use my experiments with the pl/Perl function to start populating the serial.bib_summary tables using an INSERT/UPDATE/DELETE trigger on serial.record_entry so that we don't have to generate the summaries for every item details request in the catalogue. Tuesday, November 10. 2009Doing useful things with the TXT dump of SFX holdings, part 1: databaseThere must be other people who have much more intelligent things than me with the TXT dump of SFX holdings that you can generate via the Web administration interface, but as I've gone through this process at least twice and rediscovered it each time, perhaps I'll save myself an hour or two by writing this down. Maybe it will save you a bit of time, too. Or give you an idea of a day in the life of a systems librarian. In part 1, I turn the TXT dump into something that I can load back into a local database. Maybe I want to massage the data, write arbitrary SQL queries, join it against data that I load into other tables in the database... I'll talk about that in a subsequent post. For now, let's get munging!
Now I'm ready to load data from another source - say, from my ILS that includes print holdings - so that I can start matching on ISSN and identify candidates for pruning our print collection based on criteria such as the electronic collection. Perhaps I'll talk about that in part 2... Monday, November 9. 2009FSOSS 2009: Project Conifer updateUpdate: 2009-11-24 James Forrester of the Ontario Academy of Art and Design has posted a short video (Internet Archive) of the presentation. Thanks, James! On Friday, October 30th, I presented a status update on Project Conifer at the Free Software Open Source Symposium (FSOSS). This was a follow-up to the talk I gave with John Fink at last year's FSOSS, with the hopefully interesting twist that instead of talking about what we were going to do, I talked about what we had done, and the lessons learned along the way. This was a slightly modified version of the talk I gave at the Lyrasis/NELINET open source conference earlier in October, aimed at a more general audience. The talk was recorded and will be posted online at the FSOSS site at some point. Here are the slides in (ODP) and (PDF) format. The speaker notes on the slides will give you the meat of the content. Friday, November 6. 2009Changing the default run level in Ubuntu 9.10 Karmic KoalaDear Dan: You felt pretty pleased with yourself for jumping on the latest Ubuntu release back in the alpha stages on your virtual machine, then having it running on your laptop and Lynn's Asus EEE 701 the day after the final release came out. But did you ever have to change the default runlevel? NO, you didn't. You didn't anticipate that you would want to convert that alpha-6 server into a workstation. Nice thinking, hero. And now that you've figured it out, you're going to remind yourself that the place to change it is in /etc/init/rc-sysinit.conf. You might want to look at the env DEFAULT_RUNLEVEL setting. Oh sure, you could manually create an old-school /etc/inittab file, and it would get picked up from there. But, you know, throwing away a grand old tradition like that really feels liberating, doesn't it? So now you know. Don't forget, okay? But if you do, you might just find yourself coming back here.
(Page 1 of 1, totaling 4 entries)
|
QuicksearchAbout MeI'm Dan Scott: barista, library geek, and free-as-in-freedom software developer.
I hack on projects such as the Evergreen
open-source ILS project and PEAR's File_MARC package .
By day I'm the Systems Librarian for Laurentian University. You can reach me by email at dan@coffeecode.net. Identi.ca microblogging
LicenseCategories |
