Backlog of Access 2006 notes

Posted on Mon 16 October 2006 in Libraries

Following on my plea for access to Access presentations, I'm in the process of posting the notes I took at the CARL instutitional repository pre-conference and Access 2006. I probably should have posted these to a wiki so that others (like the presenters) could go ahead and make corrections/additions, but I've only got a few minutes before my training course starts.

Note that while a good chunk of the content was transcribed directly from the presenters' words or presentations (therefore all of this is posted under dubious copyright/licensing conditions), I can't guarantee that I actually recorded anything correctly. So, read on, and if something seems outrageously dumb, it's probably my brain acting as the filter for the speaker that introduced the dimness.

First up: the CARL Institutional Repository Preconference!

Leslie Weir, Chief Librarian, Ottawa University

  • CARL - 27 academic research libraries + various government libraries
  • CARL Institutional Repository Project was founded in 2003; several CARL members were planning an IR and decided to work together
  • 2004 - SFU created the harvester
  • Metadata working group -- to improve quality of metadata in repositories
  • Marc Jordan (SFU) highly involved in harvester and metadata initiative
  • Open Access News: many mandates either implemented or in the works for researchers around the globe; self-archiving is likely to become a normal part of research activities
  • 1/2 of CARL members have IRs up and running; 1/4 are planning one

David Moorman, Senior Policy Advisor for SSHRC council, responsible for archiving policy at SSHRC

  • Recently submitted final draft of a 22-country OECD policy on access to research data
  • One way of moving into international arena is through open access to research data
  • SSHRC has embraced open access in principle
  • SSHRC serves 18,000 researchers on a regular basis, moving from principle to action is hard
  • Support for research journals vs. the role of the article as a unit in IRs
  • How do you take the publicly funded research in the universities and make it available to __anyone__, regardless of who they are or where they are?
  • Canada has one of the most highly educated populations with 1/4 of the population having a university degree
  • Issues about research data respecting anonymity of research subjects
  • Citation analysis does not work very well for social sciences or humanities -- SSHRC wants better measures of the effectiveness of their funding for research outcomes
  • In 1991, physical scientists already had access to full-text on disc, while humanities were far behind -- more a measure of institutional barriers than technology
  • David wants to ensure that SSHRC helps libraries
  • SSHRC currently has no policy on IR; some simplistic approaches suggest that IRs should be mandated (ignoring all backlash); however, the policy on open access is the measure that is required to help work out the details that may result in IRs or open access journals
  • Size of a grant is currently based on revenue, but that means that publication in an open access forum is a problem -- currently working on correcting that issue
  • SSHRC is not attached to an insitution, they deal with researchers and the support system that researchers have; vs. CARL which is highly institutional, with universities that are provincially regulated and funded -- so SSHRC can't reach beyond its jurisdictions
  • Of 94 Canadian universities, there are fewer than half that have an IR running or in planning stage; until that number is closer to 80% the remaining resistance to providing resources will not fall

What can CARL do? Build on other experiences out there

  • California is building one common system (roughly comparable to all of Canada)
  • European system development -- the DRIVER project will build a test case for an IR across 10 universities in 8 countries to try solving the technology, political, and cultural issues for all of those at once
  • Public knowledge project -- peers of John Melanski(?) -- originally funded by SSHRC

Facing the social barriers

  • Real challenge is to face the social barriers; faculty are both highly conservative and radical; change to an IR messes with prestige, faculty advancement and tenure -- you're messing with the way things have been for decades
  • Use of longitudinal data sets in research data centers has been a remarkable success -- on any given day 1000 researchers are using the research data centers (up from 7 or 8 researchers at StatsCan a few years ago)
  • SSHRC is going to attempt to pull in the 5,000 papers per year that these researchers generate from the social statistics, beginning with the Universite de Montreal
  • No objection in principle, but highly skeptical that this is going to be successful because these are highly skilled researchers who know what signing an exclusive copyright transfer agreement means
  • Working with Copyright Canada to handle the copyright issues; after a first round of letters to researchers asking them to allow SSHRC to add their paper to the IR -- expect 80% to say they can't because they have signed the exclusive agreement; second round of letters to publishers asking for their permission -- if no success, SSHRC will put up a Creative Commons License on the front page of the RDC to place intense pressure on the journals

Two problems with SSHRC or CISTI creating their own IR

  • Treasury board regulations govern transfer payments from feds to individuals ang groups
  • Grants can only be placed on eligibility and intent, not on post-grant conditions -- so SSHRC and CISTI cannot require funded researchers to make papers available via open access means; they're working on getting the government to fix this limitation but SSHRC or CISTI are relatively low priority
  • Under the Official Languages Act all federally published documents are supposed to be available in both languages -- but this would require translating approximately 15,000 objects per year -- in SSHRC's lawyers' opinion, it would be okay unless somebody complained, at which point it would be brought to the floor of the House of Commons for debate, with no predictable result
  • IDRC is not in the same boat for some reason

Can SSHRC support IRs directly with grants?

  • SSHRC used to support libraries, largely for special collection development projects, but haven't since the early 90's
  • One problem is the direct funding of 300 million from feds to universities, which is meant to cover all of the direct costs of research at the universities
  • Could be funded directly from SSHRC budget, but demand outstrips supply (SSHRC can support 40% of its community, NSERC supports 60% of its community) and researchers would not support cutting their funding to build an IR instead
  • Best approach is to convince researchers to contribute to the virtuous circle to make it better for all

Conclusions

  • In David's opinion, IRs have the best opportunity to succeed, compared to open access journals -- given that Yahoo and Google understand that content is critical to their business models (think about the 1.6 billion for YouTube vs. the cost of something like Elsevier?)
  • Back-issue digitization project for Canadian Historical Association has been a struggle: finding the right people to do the project, the right technology, working out the loss of revenue from sales of print materials
  • SSHRC needs to hear from universities about the problems we face and where we want to go
  • Problems like University of Manitoba, who want to add confidential research data sets to their IR -- RDC enforces confidentiality through law (RDC users are considered StatsCan employees and face massive fines / jail time) -- but how do you enforce that for a university community? Came up as a requirement for the SSHRC grant to make the data set available in an IR
  • Since 1990, SSHRC has had a policy that as a funding requirement you must make your data set available in one of 10 IRs; however, outside of fellowships, about 4,000 data sets since then were funded but only 10 (no zeros missing!) have complied with that regulation

Beyond Institutional and Disciplinary Boundaries: The Role of Open Access Repositories

Leslie Chan, U of T @ Scarborough, Professor of New Media Studies, and on the steering committee of the American Anthropological Associations portal Anthrosource

  • Lessons learned from implementing an IR, working with colleagues on the benefits of participating in an Open Access Repository (OAR)
  • Anthrosource has gone purely digital thanks to a Mellon grant, offering a repository with value-added tools to maintain revenue from individual and institutional memberships rather than just content

Value added tools

  • Membership fee now offers access to all publications, rather than simply one print journal as it used to be
  • Adding third-party journals as well
  • AnthroSource repository will be "Web 2.0": wiki, blog, tagging folksonomies, RSS feeds, community reviewing tools, just-in-time publication tools (e.g. community repository will hold papers in the marginal sections of community until sufficient number of papers is met and a new issue is published), data sets, data mining texts that go back a hundred years now that retro-digitization project is complete
  • Why deposit in AnthroSource? One should deposit in home institution as well, but AnthroSource has members internationally and many researchers either don't have or don't trust their home institution
  • Oddly enough, AAA staff signed a letter against federal research open access proposal, while steering committee (academics) are laying out a strategy towards open access

Open source, open repositories

  • KMDI is an interdisciplinary "virtual institute" across the three campuses of U of T, considered an intellectual
  • Two years ago U of T provost funded Project OS|OR (open source open repositories) to discover and build linkages between OS and OR projects at U of T
    • Discovered many professors were editing open access journals
    • Open source shares many of the same goals as open access -- building community, not for profit, collaboration
    • Existing open source projects at U of T include:
      • ATutor - a free competitor to Blackboard available in 22 languages
      • epresence - for archiving digital video, such as lectures by Michael Geist
      • TSpace - an instance of DSpace

Opportunities for increased access via electronic resources:

  • integrate knowledge from diverse locations and sources
  • infrastructure for an expanded view of scholarly communication
  • new services and business models
  • implementation of Semantic Web technologies
  • E-Science (UK), E-Research (Australia), Cyberinfrastructure (US) are large-scale compute clusters that also make research results available -- includes humanities and social sciences

OAI-PMH model separates data and service providers:

  • Content deposit -&rt; Metadata generation -&rt; Aggregation -&rt; End-user services
  • An IR is a "place for people to put their stuff" -- search included with DSpace, for example, is rudimentary, but value-added services such as citation linking, use frequency statistics, etc -- which could be "deluxe" options for revenue generation
  • "In using the term 'open access', we mean the free availability of peer-reviewed literature on the public internet..." (http://www.soros.org/openaccess/read.shtml)

Open access enables:

  • New "business" and funding models (author or "input" pay model, which seems more suitable for hard sciences and not humanities -- subscription model effectivly prevents readership)
  • Disaggregation of "content" and "services"
  • Post-publication evaluation and impact (we know that ISI analysis is bad for humanities, but continue to use it anyways; one hope is that with openly accessible content new and better models of impact analysis will arise)
  • Integration of open source software and open standards:
    • URI (new proposal available at http://uri.org to provide permanent URLs)
    • OpenURL vs DOI (DOI is a proprietary model that is a pay service: $10,000 to enter, $0.10/click, completely unacceptable for developing countries)
    • CrossRef (relies on DOI for analysis)
  • Linking of research and education: breaks down the wall imposed by federal funding for research vs. provincial funding for education

What authors need:

  • Support for new modes of authorship and dissemination of data (video ethnographies with GB of data, archaeological 3D scans with GB of data -- how do you store, make this accessible, and make it persistent through different formats over time?)
  • Data base storage
  • Media integration
  • Long term archiving (one of Leslie's colleagues lost 30 years of research data when one building was being 'tidied up' and the boxes ended up in Michigan landfill)
  • Linking research and teaching

TSpace:

  • UofT decided to make communities very loosely organized; departments, or collaborative groups of researchers (including researchers from other institutions) -- requirement for entry is showing institutional affiliation
  • Using their IR for purposes other than preprint/postprint:
  • Supporting local journals (Women's Health and Urban Life: separate presentation for the journal, but pulls the article from the repository)
  • Supporting international journals (archiving Bioline International, defunct journals)
  • Enabling digital scholarship (new forms of publications that are digital-only)
  • Easily found in Google Scholar (gives repositories priority ranking, also enables citation analysis -- especially with defunct journals that suddenly become available agian)
  • Curriculum vitae (ability to link directly to full-text publications); University of Rochester builds online CVs for their faculty
  • http://romeo.eprints.org lists journals and associated policies that allow authors to post their publications; however, it tends to lag behind the actual journal policies -- good starting point though
  • International outreach: Ptolemy project for east African surgeons gives them access to UofT's e-journals, and these surgeons and their students are in turn being encouraged to write research papers that will be deposited in the IR

A review of content recruitment strategies

Tim Mark, Executive Director of CARL

  • http://www.ifla.org holds the original paper on which this presentation is based
  • International review of strategies performed in 2005
  • Researchers are enthusiastic in principle, but in practice the follow-through has not been there when voluntary compliance is relied upon

What excites researchers is not what excites librarians:

  • Impact
  • Visibility
  • Reputation

Therefore we need to speak the same language as the research community (permanent URLs and digital preservation doesn't work)

CARL IR Project launched in 2002:

  • Purpose: To enhance the visibility and impact of Canadian scholarly output and Canadian research institutions by building a number of robust, content-rich university-based repositories that will form part of a larger federation of institutional repositories worldwide.
  • Assigned Kathleen Shearer as the part-time coordinator of the CARL IR Project

CARL IR Project Activities:

  • CARL Harvester (http://carl-abrc-oai.lib.sfu.ca)
  • Developing a metadata profile for Canadian IRs
  • Advocacy - presentations to faculty and administration, CIHR will be announcing a new open access policy
  • Meetings and conferences
  • Annual survey

CARL Members' implementation status as of June 2006:

  • 14 working
  • 2 pilot
  • 3 planning
  • -- so 19 out of a possible 30 members

Types of content recruitment strategies:

  • Carrot (voluntary)
  • Stick (mandatory policy)

Types of content recruitment strategies (II):

  • General promotional strategies
  • Depositing services (staff to assist deposit of materials)
  • Content harvesting (staff track down already-published materials and arrange for their deposit)
  • Researcher bibliographies (included in the repository)
  • University policies (either encouragement or mandatory)

Case studies:

DARE (Dutch)

  • Funded in 2002, in 2004 began content recruitment strategy across their 16 partner universities
  • Each researcher was given a personal page with photo, biography, awards, research publications, and a link to their own home page
  • All work done by DARENet staff
  • Outcome: 40000 articles, 26000 in full text
  • Very expensive, no indications yet whether faculty will deposit on their own

GLASGOW ePrints Service:

  • In the UK, all researchers must make bibliographic information available
  • Script developed to harvest full-text versions of those articles
  • Still requires staff time to check copyright

QUT ePRINTS (Queensland):

  • Policy 'formally requests' that all researchers deposit articles in the IR when copyright allows
  • Estimate that 45% of research output is being captured

CERN Document Server:

  • Mandatory policy for research staff to deposit publications IF the researcher allows it
  • Library works very hard to find and deposit publications
  • Only 1500 out of 60000/year are directly from the researcher

Conclusions:

  • Ego massaging works:
    • Bio page
    • Ensure researchers know that Google Scholar etc. indexi and rank repositories highly
  • Speak the language of the research faculty
  • Staff will have to find and deposit items, assign metadata, and check copyright agreements
  • CARL has done a large amount of work with Creative Commons Canada to make a Canadian version of the Creative Commons license available -- this may help assuage

Content recruitment is expensive:

  • QUT estimates $10,000 / researcher, $50 / document until process becomes standardized -- then estimate is $10 / document
  • Mandated archiving is most effective means, but even then staff time is required and compliance is still relatively low (~50%)
  • Sustainable solutions require long-term commitments: current CARL experience is one person a couple of days every week or so, but this is not expected to be a sustainable approach

Advocacy materials:

  • CARL will try to make theirs updated and available
  • SPARK web site has open-access presentation templates
  • University of Calgary repository contains the CARL community documents, should have fruitful results under a CC license

CARLCore Metadata Application Profile

Mark Jordan, Simon Fraser University, Head of Systems

  • Mark guides the development of the CARL Harvester, has conducted an analysis of the quality of the metadata that has been harvested (available in Library Hi-Tech)

Background on the Application Profile

  • CARL Harvester was launched June 2004 primarily as a search engine for the aggregated metadata from the participating repositories, uses the Public Knowledge Project (PKP) Metadata Harvester software
  • The CARL Harvester is open to all Canadian scholarly repositories, not just CARL members, as long as that repository implements OAI-PMH

The Problem

  • As more records have been added, reports of increased dissatisfaction with search capabilities (performance and reliability) -- metadata added to institutional records was not being found in CARL Harvester

The Solutions

  • Improve the software -- ended up being a complete rewrite of the harvester, now at version 2
  • Develop an application profile to ensure that the metadata that is harvested is actually consistent and useful

Application profile

  • "A set of metadata elements, policies, and guidelines defined for a particular application or implementation" (e.g. aggregation)
  • Defines best practices appropriate to the application -- goal was to define something both implementable and useful
  • CEN (European Committee for Standardization) CWA 14855

Goals:

  • Develop a profile that:
    • Improves quality of aggregated metadata
    • Is practical
    • Is voluntary
  • Benefits include:
    • Better centralized services
    • Streamlined local practices
    • Guidance for new repositories (avoiding some of the problems with defaults shipped in common repository software)

Development Process

  • Analyze the metadata from June 2005; the data was harvested with a Perl script to avoid quantum effect on the metadata that the Harvester introduces
  • Develop use cases and functional requirements
  • Survey other application profiles
    • ePrints UK "Using Simple Dublin Core to Describe ePrints"
    • "ARROW Discovery Service Harvesting Guide"
  • Proposal delivered October 2006
  • Deadline for comments: November 10, 2006
  • Final release (including French translation): January 31, 2007 -- will include IR platform-specific implementation guidelines
  • Ongoing: CARLCore Level 2 (extend qualified Dublin Core to offer richer services than unqualified Dublin Core)

CARLCore

  • Document is a standard application profile
    • Rationale
    • General principles and recommendations
    • Entries for each uDC element
    • Appendices:
      • Implementation guidelines
      • Sample records
      • Relationship between CARLCore and the CARL Harvester (Harvester 2 enables dynamic interface widgets for scoping searches with the side effect of showing a given institution's level of consistency)

CARLCore Level 1

  • Uses only unqualified Dublin Core
  • Goal is to make use of the DC elements in OAI as consistent as possible

Sample elements of CARLCore Level 1

  • Identifier - an unambiguous reference to the resource at the originating repository (must point to the metadata records at the originating repository, not the document itself)
  • Source - points back to the original source from which the digital version was derived
  • Type - genre of the work
    • Somewhat controversial, as it requires agreement on a controlled vocabulary
    • Equivalence for translations is necessary
    • One hope would be to provide a type mapping between institution's local types and the controlled vocabulary exposed by the CARL Harvester interface

CARLCore Level 2

  • Will add elements to CARLCore Level 1
  • Provide faceted discipline browsing (Using OAI sets? Using one or more non uDC elements?)
  • May focus on disciplinary archives
  • Other features leading to "added value" for users

Implementation issues

  • Legacy metadata -- thousands of existing records
  • Conflicts with local IR metadata practice
  • Inflexible OAI gateways in IR platforms
  • Lack of tools to test compliance; should be fairly simple to adapt Mark's Perl "raw" harvester as a feedback mechanism
  • Yes, using CARLCore is optional... but there is strength in numbers

To do list

  • Take advantage of PKP Harvester's data normalization features to map between local and CARL ontologies
  • CARLCore Level 2
  • Stay current and collaborate with IR platforms

Panel discussion on encouraging increased submissions to IRs

Focus on solving the problems of researchers, and slide the institutional repository in the back door ("by the way, we would like to include a copy of your paper(s) in our IR")

UofT TSpace experience

  • 13,000 archived journal articles, 1,000 self-submitted items
  • Current human resources: .8 admin, 1 system programmer for open conference system, open journal system, and the DSpace instance

Some moves UofT made to make faculty happier:

  • Implemented full-text search instead of just metadata search
  • Making IR visible to Google (better ranking in
  • Addressing metadata quality concerns of subject librarians
  • Ability to restrict document to UofT but decided to enforce metadata access (required a lot of work to implement and not a single document has actually chosen yet)
  • Batch importing
  • RefWorks import / export
  • Full support models for high-profile collections
  • "Home Depot" model for departmental hires for work-study students to assist with IR submissions (interview questions, regular meetings for training)
  • Attempt to help self-archiving using publisher's PDF (to avoid issues with getting the wrong version of the PDF in the repository)

Failures of the repository at U of T:

  • Inflexible modification (vs. Yahoo! "My Profile" model of being able to change info at any time)
  • Require collective agreement with publishers to use publisher's PDF (faculty very rarely reserve the author's version)
  • Statistics on a per-item basis in a scalable way

SFU:

  • Customized interface between open conference system and DSpace to enable batch imports (conference submissions required presenters to sign off on copyright clearance and permission to add paper to SFU repository)
  • ETDs are merged from DSpace URL in spreadsheet to thesis metadata and generating MARC records that are then loaded into the library catalog
  • Student self-submission process for all metadata is foundering on their assignement of call number, so process has been adjusted to include reference librarians in process; technical services is actually happier because theses are extremely specialized
  • Grad students paid by faculty member but working with library hasn't worked out well so far (conflicting priorities means that meeting the exacting metadata standards often doesn't happen) - in case where faculty member was retroactively adding his body of work to the IR
  • For theses, SFU students sign a license that includes the ability of SFU to add the thesis to the repository;
  • SFU faculty members sign a standard license as well

Reason for choosing DSpace:

SFU

  • DSpace is free
  • Cost for customization (if you do any) may be less than the cost of annual maintenance fee for commercial software

UofT:

  • when TSpace was set up, there weren't many commercial options that were viable;
  • ability to customize for user demands has been nice;
  • might choose a hosted solution given the expense involved in hosting their own if starting over but proprietary options don't offer anything startlingly better than TSpace
  • would never pay to have their own theses hosted

Waterloo

  • Examined DSpace capabilities vs. their own in-house ETD solution, found that DSpace offered everything for ETDs and the additional objects they were interested in making available

Preservation:

  • DSpace includes a checksum to enable you to confirm that the file now in the repository == the file that was originally ingested
  • DSpace allows you to configure "supported formats" indicating which formats you're prepared to support in your repository -- so you can specify PDF archival, and prevent PowerPoint
  • UofT has created practical scanning standards for their departments
  • UofT: "You can't guarantee that the PowerPoint file will be useful 500 years from now, but you can guarantee that it won't be useful if you don't submit it to the repository at all."

Including the IR in the realm of article abstracts and indexes:

  • Added the IR to the Library home page as an equal partner
  • Adding theses to the IR and MARC records in the ILS gave SFU major buy-in
  • UofT also adds records to the ILS (and there is some discussion about making those records available through Scholar's Portal)
  • UofT: Including the IR in the federated search (Ontario's take is that first step is local loading, second step is federated search)
  • Waterloo: quoting Lorcan Dempsey, federated search as it is does not work because it takes too long; indexing the sources first, then doing the search locally across those indexed sources provides the speed that is required for a good search