Monday, July 19. 2010
As part of the informal partnership between the International Institute of Social History (IISH) and Project Conifer, I was pleased to be able to spend the last two weeks in Amsterdam, working side-by-side with one of the Institute's developers, Ole Kerpel, on augmenting the support for MARC21 authorities in Evergreen. To prepare for the work session, I had posted a blueprint for the authorities work on the Evergreen Launchpad instance and circulated the list of requirements we had been asked to address to the broader Evergreen development community. We were fortunate to have the attention of Mike Rylander on the proposal, who not only supplied suggestions for how to implement some of the items, but also committed significant code contributions to the effort that greatly assisted our efforts. Here is a summary of the goals we accomplished in the current development branch of Evergreen (targeted for the 2.0 release), followed by a list of the outstanding items and my finger-in-the-air estimate of how much more time it would take to accomplish each of the tasks:
Accomplishments
- Controllable control numbers
While not, strictly speaking, a requirement for authority control in and of itself, the ability to ensure that the behaviour of the 001/003/035 fields all conformed to the MARC21 specifications was an important requirement for IISH. They plan to provide external access to their authority and bibliographic records, so making the official identifier fields linkable based on the underlying record ID was an important aspect of the work. We implemented this feature as an optional database-level trigger to ensure that the control numbers and control number identifiers are always perfectly in sync with the internal identifier of the particular system on which the records are stored.
- Links
Where having Mike Rylander participate in your review process pays off, part one... Before I even arrived in Amsterdam, Mike implemented a tricky database trigger that tracks the links between a given bibliographic record and the authority records to which it links. The links are tracked at the database level, as well as directly in one or more 0 subfields in each field that is controlled by an authority record. Yes, a given field in a bibliographic record can be controlled by two authority records and it all works. Nice, Mike!
- Syncs
Where having Mike Rylander participate in your review process pays off, part two... Mike also implemented the bulk of the logic for automatically updating bibliographic records that are linked to a given authority record when that authority record is modified. Yes, folks, when you add a death date to an authority record, it will automatically appear in the corresponding bib records.
- Control an uncontrolled set of bibliographic records
You may have dealt with library systems in the past that use some sort of string matching to implement authority support. As noted above, Evergreen is not like that. However, this means that many of us, when migrating to Evergreen, have bibliographic records lacking the 0 subfields that are required for full authority support. Towards that end, I wrote a script that will walk through a set of bibliographic records, search for matching authority records for each controllable field in each bibliographic record, and add the required 0 subfields to the bibliographic records. It certainly won't be a fast solution, but you should only need to do it once, and it worked on the limited test cases that we had ready at hand.
- Teach the MARC editor about authority records
The MARC editor knew all about fixed fields for bibliographic records, and provided a handy grid for editing those fields. However, it didn't even know how to recognize authority records, and presented a fixed field grid that was absolutely meaningless. I spent a chunk of time laboriously transcribing the fixed field rules from MARC documentation into the MARC editor and now the MARC editor presents a reasonable fixed field grid for your editing convenience.
- Merge authority records
Something that often happens in a library is that two authority records are created that identify the same thing. Eventually somebody notices the problem and wants to merge the authority records together. Towards this end, I added a database-level stored procedure that supports the merging of authority records, such that the linked bibliographic records will automatically point to the winning authority record.
- Authority browse interfaces
Where having Mike Rylander participate in your review process pays off, part the third... Mike also implemented basic browse interfaces that presents a series of authority records in MARCXML format matching your requested authority type (author, title, subject, topic) and the matching substring at the /opac/extras/browse and /opac/extras/startwith URL entry points. While still raw at this point, these can provide the basis for classic authority browse interfaces for those who desperately desire them.
Remaining to-do items
Note that any estimates are based on how long I think it would take me to implement, based on my own familiarity with MARC and Evergreen and all things Perl and JavaScript and PostgreSQL, and provided with the granularity of no less than one day. Actual implementation times may vary, of course; if related work items are worked on consecutively, then it is likely to take less time to achieve than if the items are tackled sporadically.
- Add an authority in the flow
When you're working in the MARC Editor and you find that there is no match for an entry that you really think should be controlled, IISH wants to make it easy for a cataloguer to add an authority record for that entry. We thought that there might be two options that we would want to expose - a direct "create an authority record from this field" option that takes no further input, and a "create an authority record from this field and open it in another MARC editor to let me tweak it" option. Estimate: 2 person days
- Highlight controlled fields
This is really a two-part problem. First, for uncontrolled fields, we want to teach the Validate button to offer the kind of automatic matching that the script does and add the required 0 subfield. Second, we want to highlight fields that are explicitly controlled by authority records with a subfield differently from fields that simply match an authority record, but which are not controlled by it. Estimate: 1 person day
- Simplify authority record selection
This two-part requirement would mask many of the fields that are currently offered as options when you right-click on an uncontrolled subfield to display matching authority records. For example, it is a little weird to offer a "See from" heading to a cataloguer; we're trying to avoid adding new records with those headings, right? Heh. Second, we want to introduce the ability to invoke the authority browse list in this interface so that the cataloguer can see a given set of headings in context and select the heading to apply from there. Estimate: 2 person days
- Delete authority record
There is currently no cataloguer-friendly way to delete authority records. We need to expose a list of authority records (probably reusing that browse list again) and make it possible for cataloguers to delete an authority record. When that record is deleted, all bibliographic records that link to it need to have their links removed - and ideally, the cataloguer would be able to tell how many bibliographic records link to that authority before the delete takes place. Estimate: 1 person day
- Edit and merge authority records
Although the database-level support now exists for merging authority records, we need to expose a means for cataloguers to select the authority records that they want to edit or merge. This could just be a slightly evolved version of the "Delete" interface. Estimate: 1 person day
- Expose authority records via SRU/Z39.50/crawlable interface
One of the goals of the IISH is to be able to share their authority records with other institutions. One of the standard methods is SRU + Z39.50 server support; we should be able to build on the SRU/Z39.50 server support for bibliographic records in Evergreen to provide a basic solution for authority records. Interest has also been expressed in having a crawlable implementation that would give the linked data crowd something to play with. Estimate: 2 person days for an SRU/Z39.50 server, 1 person day for a very basic crawlable linked-data implementation
In summary - hurray for Mike Rylander for helping us out to such an extent, and many thanks, again, to IISH for giving me an opportunity to focus on Evergreen development for an extended period of time, and to Laurentian University for supporting my efforts. I hope that between Ole and myself that it will be possible to finish the rest of these work items prior to the Evergreen 2.0 release. It has been exhilarating to see far Evergreen's authority support has come in less than a month, and given a little more time I suspect that Evergreen's authority support will be the envy of other library systems.
Friday, July 2. 2010
As an Evergreen developer, I believe our project has a few significant gaps that projects like RSCEL might be able to help address for the overall good of the community by bringing in outside resources to the project. Or perhaps there are skills within the community that don't feel like they've been called on yet; when I say that we lack skills, I'm basing that on the lack of patches and offers of assistance that I've seen in these areas. I would be delighted to be proven wrong! Either way, I submit this for the community's consideration.
- 3rd party security audit: Before Conifer adopted Evergreen, I had hoped that we would be able to fund a security audit of the code by a trusted and competent 3rd party like OmniTI (from a previous life, I believe that OmniTI employs some of the best people in the business, thus the plug - but there are certainly other options out there). As developers, we try our best to avoid vulnerabilities, but as the recently disclosed vulnerability in open-ils.pcrud attests, we're not experts in security. An audit of the public-facing interfaces (the catalogue, feeds, etc) would be a great help to the project. I would expect a prioritized list of areas that need to be addressed, along with recommendations on how to address those problems (whether they be cross-site scripting, session fixation attacks, authentication encryption attacks, etc). Our community's process (or lack thereof) for reporting and addressing security vulnerabilities might be an appropriate subject for an audit as well.
- Testing framework: Our project is woefully short on tests, either human-powered or automated, for determining the state of the code at any given point in a release cycle. Thus, we have put out release after release that either won't install cleanly, or won't upgrade successfully from a previous release. The trunk version of the code had a error that meant that Evergreen couldn't be compiled; that problem existed for three weeks before somebody noticed and fixed it. I'm not pointing fingers, here; if I did that, I wouldn't have enough fingers to point back at myself for all of the problems I've introduced that other people have had to fix. Johnathan Nightingale in The Most Important Thing … or How Mozilla Does Security and What You Can Steal provides a great overview of Mozilla's philosophy about and approach to testing. There is all kinds of goodness in this presentation, but one of the most interesting points is that "money can be exchanged for services" -- that is to say, if your existing development team doesn't have the skills or time to implement a testing infrastructure, there are companies that do have the ability to put together a test infrastructure for a given project. Once that infrastructure is in place, it tends to get extended and used by the existing development team because it makes their lives easier; they don't need to manually test a given code path every time in the future, or deal with regressions that aren't noticed until months in the future when the changes they were making are no longer fresh in their minds. It sometimes requires a culture change, though.
- Continuous integration: Hand in hand with a testing framework is a continuous integration server that provides testing feedback on every commit to the Evergreen repository for a given set of branches. Even without a testing framework, it is possible to have a continuous integration server run through the process of installing all prerequisites, configuring the code, building and installing the code, and creating the database schema to at least determine whether the basics can be accomplished successfully to confirm that a branch is ready for release. This arguably also goes hand in hand with a team's process for addressing a security vulnerability: if you have a continuous integration server that can tell you if a given fix does not introduce basic build and install errors, then you can get a new release out with much more confidence that you're not going to be encouraging your users to jump to a broken package. Note that Equinox ran a continuous integration server for OpenSRF and Evergreen trunk for a while, but that was killed and replaced by a call for volunteers to build a new continuous integration service (I can't find a more to-the-point call for volunteers, so perhaps it just hasn't been advertised widely enough - or again, perhaps we lack the skills in the community to get a standard CI service like Hudson running.)
- Packaging: To decrease the difficulty of installing and configuring Evergreen, we need more investment in packaging Evergreen and all of its unpackaged dependencies. The idea is that a user should be able to run "aptitude install evergreen" or "yum install evergreen" and have the entire system installed and configured, and then run "aptitude upgrade" or "yum upgrade" to have newer versions installed. Right now the process is still rather onerous and requires a great deal of manual effort, although it has improved significantly since the early days of 2007. Again, this requires a particular set of skills that the Evergreen community does not appear to possess in depth: autoconf, automake, APT and RPM packaging - and perhaps some redesign of elements like skins to make local customizations easier to incorporate and keep up to date. This would be a natural complement to a continuous integration service, but much of the effort could also be done on its own.
Occasionally I drop down to the database level to generate some reporting information. You could probably get the same information through the reporter but I like the precision of SQL. Here are a couple of queries that I've put together recently.
List titles for periodicals published by "Human Kinetics" with subscriptions owned by library ID "OSUL"
SELECT rsr.id, rsr.title
FROM metabib.full_rec mfr
INNER JOIN metabib.rec_descriptor mrd ON mfr.record = mrd.record
INNER JOIN asset.call_number acn ON acn.record = mrd.record
INNER JOIN reporter.super_simple_record rsr ON rsr.id = mrd.record
INNER JOIN actor.org_unit aou ON aou.id = acn.owning_lib
WHERE mfr.tag = '260'
AND mfr.subfield = 'b'
AND mfr.value ilike 'Human Kinetics%'
AND mrd.bib_level = 's'
AND aou.shortname = 'OSUL'
;
Strip out URLs for an online resource to which we no longer subscribe
Occasionally we drop subscriptions to an online resource that we happened to catalogue with an inline 856 field. Our new approach relies on just-in-time results from our link resolver to display accurate access to online resources (or at least consistent representations of what we have access to!), but our legacy records placed all of that information directly in the 856 field in the corresponding bibliographic record. The PostgreSQL regexp_replace() function lets you use regular expressions to match subsets of the MARC record and replace it with... well... nothing, in this case.
As we want to subsequently reingest the MARC records, and we're not running Evergreen trunk yet in which a reingest will automatically be triggered by an update to the biblio.record_entry table, I first push the list of affected IDs into a scratch table. This also lets me put limits on the MARC records that I'm going to touch, so that I don't inadvertently destroy content in another library's set of bibliographic records.
CREATE TABLE scratchpad.urls_to_delete (id BIGINT);
INSERT INTO scratchpad.urls_to_delete
SELECT acn.record
FROM asset.uri au
INNER JOIN asset.uri_call_number_map aucnm ON au.id = aucnm.uri
INNER JOIN asset.call_number acn ON aucnm.call_number = acn.id
INNER JOIN actor.org_unit aou ON acn.owning_lib = aou.id
WHERE au.href ILIKE '%/search.ebscohost.com/direct.asp?db=rch%'
AND aou.shortname = 'OSUL'
;
BEGIN;
UPDATE biblio.record_entry
SET marc = regexp_replace(
marc,
E'<datafield tag="856" ind1="4" ind2="0"><subfield code="z">Available online from Ebsco.*?search.ebscohost.com/direct.asp\\?db=rch.*?</datafield>',
''
)
WHERE id IN (SELECT id FROM scratchpad.urls_to_delete);
Note that the UPDATE statement is preceded by a BEGIN statement so that we can check our results and issue a ROLLBACK if we inadvertently changed too much, or created mangled records. Once you check your work with a SELECT statement or two, you can issue a COMMIT statement to make the changes take effect.
Sunday, June 27. 2010
This conversation on identi.ca has prompted me to publish the rough notes I had prepared for a proposed discussion on making the Android operating system experience more free-as-in-freedom at the Google I/O 2010 Conference Bootcamp "unconference". Unfortunately, my proposal was not one of the top vote-getters (it missed the cut by two votes), so we didn't get to have the discussion there, even though I'm sure we would have had an interesting discussion. But perhaps there's something worthwhile in the roughly formed thoughts that follow...
Making Android more "Free as in Freedom"
What do I mean?
- Not "zero cost", but:
- Free to run for any purpose
- Free to study the source (a critical means of learning how to build better applications)
- Free to redistribute verbatim copies
- Free to modify the source and redistribute the modified version
- Android the operating system may be FaiF, but Android the distribution is not
We have opportunities to win interesting development investments on Android over proprietary platforms; see the Wockets - open source effort to create very low cost motion measurement devices for hobbyists, researchers, and developers interesting in creating software and devices that measure or respond to movement that is developing with Windows Mobile first, and Android second. It's a shame to see an "open" research project being built on a closed base, but there might be some clues in these researchers' rationale that suggest ways that the freedom of Android could be improved.
- Drivers (camera, GPS, etc) bundled as binary blobs are a problem for auditing, bug fixing, innovating
- Current phones get applications delivered out of the box:
- that sometimes suck (GTalk - no way of changing the Google account it uses)
- that you won't use and don't want (Facebook!)
- that you might not trust (this is your phone, +++)
- that you can't legally redistribute (Market?)
- that you can't remove (my precious space!) without installing a new firmware image
- Can be hard to determine what apps are even free software; we might need to combine these multiple, partially overlapping, sometimes contradictory sources: and the Android Market and SlideMe Market don't enable filtering by license
- Opportunities abound for new Free-as-in-Freedom applications to gain a significant foothold:
- No Skype = space for LinPhone / SipDroid to move in (given a quality contact mechanism)
- No good multi-protocol IM client (libpurple via NDK?)
- Boost the Replicant project
- It's in our best interests as Android users and developers to have a free platform - we developers can build on each others work to create a better user experience, rather than starting from scratch every time in our own jealously protected niches.
|