Coffee|Code: Dan Scott's blog - PostgreSQLhttps://coffeecode.net/2013-08-11T15:38:00-04:00Librarian · DeveloperA Flask of full-text search in PostgreSQL2013-08-11T15:38:00-04:002013-08-11T15:38:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2013-08-11:/a-flask-of-full-text-search-in-postgresql.html<p><strong>Update:</strong> More conventional versions of the slides are available from
<a class="reference external" href="https://docs.google.com/presentation/d/1xKwBDYQUjrwMIMh3N64lhKK5H60F3C8SWX49KWJBshM/edit?usp=sharing">Google
Docs</a>
or in <a class="reference external" href="https://speakerdeck.com/pyconca/a-flask-of-full-text-search-with-postgresql-dan-scott">on Speakerdeck
(PDF)</a></p>
<p>.</p>
<p>On August 10, 2013, I gave the following talk at the <a class="reference external" href="http://2013.pycon.ca">PyCon Canada
2013</a> conference:</p>
<p><img alt="image0" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_0.png" style="width: 960px; height: 720px;" /></p>
<p><img alt="image1" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_1.png" style="width: 960px; height: 720px;" /></p>
<p>I’m a systems librarian at Laurentian University.</p>
<p>For the past six years, my day job and research …</p><p><strong>Update:</strong> More conventional versions of the slides are available from
<a class="reference external" href="https://docs.google.com/presentation/d/1xKwBDYQUjrwMIMh3N64lhKK5H60F3C8SWX49KWJBshM/edit?usp=sharing">Google
Docs</a>
or in <a class="reference external" href="https://speakerdeck.com/pyconca/a-flask-of-full-text-search-with-postgresql-dan-scott">on Speakerdeck
(PDF)</a></p>
<p>.</p>
<p>On August 10, 2013, I gave the following talk at the <a class="reference external" href="http://2013.pycon.ca">PyCon Canada
2013</a> conference:</p>
<p><img alt="image0" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_0.png" style="width: 960px; height: 720px;" /></p>
<p><img alt="image1" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_1.png" style="width: 960px; height: 720px;" /></p>
<p>I’m a systems librarian at Laurentian University.</p>
<p>For the past six years, my day job and research have enabled me to
contribute pretty heavily to Evergreen, an open source library system
written largely in Perl and built on PostgreSQL.</p>
<p>But when I have the opportunity to create a project from scratch, for
work or play, Python is my go-to language.</p>
<p><img alt="image2" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_2.png" style="width: 960px; height: 720px;" /></p>
<p>I promised to provide an example of a full-text search engine built with
Flask and PostgreSQL written in under 200 lines of code; you can find
that at either
<a class="reference external" href="https://gitorious.org/postgresql-full-text-search-engine">Gitorious</a>
or
<a class="reference external" href="https://github.com/dbs/postgresql-full-text-search-engine">Github</a>.</p>
<p><img alt="image3" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_3.png" style="width: 960px; height: 720px;" /></p>
<p>Last summer, the Laurentian University Library digitized 50 years of
student newspapers - over 1,000 issues.</p>
<p>We posted them all to the Internet Archive and got OCR’d text as a
result.</p>
<p>But finding things within a particular collection on the Internet
Archive can be difficult for most people, so we felt the need to create
a local search solution.</p>
<p><img alt="image4" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_4.png" style="width: 960px; height: 720px;" /></p>
<p>We were already using PostgreSQL to track all of the individual issues,
with attributes like newspaper name, volume, edition, publication date.</p>
<p>This gave us the ability to filter through the issues by year and issue
number, which was a good start. But we also wanted to be able to search
the full text for strings like “Pierre Trudeau” or “Mike Harris”.</p>
<p>A common approach is to feed the data into a dedicated search engine
like Solr or Sphinx, and build a search interface on top of that.</p>
<p>But PostgreSQL has featured full-text search support since the 8.3
release in 2008. So I opted to keep the moving parts to a minimum and
reuse my existing database as my search platform.</p>
<p><img alt="image5" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_5.png" style="width: 960px; height: 720px;" /></p>
<p>Our example table contains a “doc” column of type TEXT; that’s where we
store the text that we want to search and display. We also have a “tsv”
column of type TSVECTOR to store the normalized text.</p>
<p>The TSVECTOR column is typically maintained by a trigger that fires
whenever the corresponding row is created or updated. So... you just
insert TEXT into the doc column, and the trigger maintains the tsv
column for you.</p>
<p>PostgreSQL includes the tsvector_update_trigger() for your
convenience, as well as a trigger that uses different language-oriented
normalizations based on the value of a specified column. Naturally, you
can define your own trigger that invokes the to_tsvector() function.</p>
<p>To provide good performance, as with any relational table, you need to
define appropriate indexes. In the case of full-text search, you want to
define a GIN (or GiST) index on the TSVECTOR column.</p>
<p>Note: GIN indexes take longer to build, but provide faster lookup than
GiST.</p>
<p>Finally, we INSERT the text documents into the database. The ID and
TSVECTOR columns are automatically generated for us.</p>
<p><img alt="image6" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_6.png" style="width: 960px; height: 720px;" /></p>
<p>Each text document is:</p>
<ol class="arabic simple">
<li>Parsed into tokens (words, white space, URIs, file paths)</li>
<li>Each token is then normalized with one or more dictionaries (word
tokens may get stemming and stop words; URIs might have “www.”
stripped; everything gets folded to lower case)</li>
<li>Et voila: the text-search vector for our text document has been
created!</li>
</ol>
<p>PostgreSQL bundles a number of language-specific dictionaries to support
different stemming algorithms and default sets of stopwords.</p>
<p>In this example, we can see that PostgreSQL has stemmed “sketching” to
remove the verb suffix, removed “the” altogether as a stop word, and
stemmed “trees” to its singular form.</p>
<p>You can also see that the TSVECTOR stores the position where each token
occurs, to support relevancy ranking algorithms that boost the relevancy
of terms that are located closely together.</p>
<p><img alt="image7" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_7.png" style="width: 960px; height: 720px;" /></p>
<p>But it's a doozy of a query!</p>
<p><img alt="image8" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_8.png" style="width: 960px; height: 720px;" /></p>
<p>Yes, there is a <em>lot</em> going on here.</p>
<p>First, this is just a regular SQL statement that happens to use the WITH
clause to define named subqueries (“q” and “ranked”).</p>
<p>The to_tsquery() function takes an incoming full-text search query and
converts that into a parsed, normalized query.</p>
<p>The ts_rank_cd() function compares the TS_VECTOR column against the
query to determine its relevancy score.</p>
<p>We need to restrict it to rows that match our query, so we use the @@
operator (PostgreSQL allows data-type specific operators like this) and
then take the top ten.</p>
<p><strong>*Note*</strong>: the query, limit, and offset are hardcoded here for
illustrative purposes, but in the actual application these are supplied
as parameters to the prepared SQL statement.</p>
<p>Finally, we use the ts_headline() function to give us a highlighted
snippet of the results.</p>
<p><img alt="image9" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_9.png" style="width: 960px; height: 720px;" /></p>
<p>The harvester is a Python 3 application that uses the ‘postgresql’
module to connect to the database.</p>
<p>The REST and Web applications in the IA Harvester application are Python
2, largely because they use Flask (which iswas Python 2 only). But in
the demo application, I’ve converted them to Python 3.</p>
<p>While I could have simply written the Web and REST applications as a
single Flask Web app that talks directly to the database, I opted to
couple them via a JSON interface for a few reasons:</p>
<ul class="simple">
<li>Decoupling the web front end from the search back end means that I
can augment or swap out either piece entirely without affecting the
overall experience.</li>
<li>If performance becomes an issue, I could add caching where profiling
warrants it.</li>
<li>If the service ends up being incredibly popular and PostgreSQL turns
into a bottleneck, even with caching, I could set up a Solr or Sphinx
instance instead.</li>
</ul>
<p>I can change from Flask to another Web application framework on either
piece.</p>
<p>I can separate the hosts if I need to throw more hardware at the
service, and/or virtualize it on something like Google App Engine.</p>
<p><img alt="image10" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_10.png" style="width: 960px; height: 720px;" /></p>
<p>With so many Python web frameworks available, why Flask?</p>
<ul class="simple">
<li>Routes as decorators was admittedly the biggest draw, simplifying
routing enormously!</li>
<li>Solid Unicode support was a must as well, as our newspapers are
bilingual (English / French).</li>
</ul>
<p>At the time I opted to try out Flask, the project’s public stance
towards Python 3 was not warm. However, with the 0.10 release in June
2013, all that has (thankfully!) changed; Python 3.3 is supported.</p>
<p><img alt="image11" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_11.png" style="width: 960px; height: 720px;" /></p>
<p>Flexible format:</p>
<ul class="simple">
<li>query: gives us the original query so we can echo that back to the
user</li>
<li>results: is a list of the highlighted snippets that we returned from
our full-text search</li>
<li>years: a facet that for the years in which hits were found, including
the number of hits per year</li>
<li>collections: a facet for the collections in which hits were found,
including the number of hits per collection</li>
<li>meta:<ul>
<li>total: the total number of hits</li>
<li>page: the page we’re on</li>
<li>limit: the number of hits per page</li>
<li>results: the number of hits on this page of results</li>
</ul>
</li>
</ul>
<p><img alt="image12" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_12.png" style="width: 960px; height: 720px;" /></p>
<p>We’ve already seen the decorator-route pattern before, and of course we
need to quote, encode, and decode our search URL and results.</p>
<p>The Flask-specific parts are helper methods for getting GET params from
the query string, and rendering the template (in this case, via Jinja2),
by passing in values for the template placeholders.</p>
<p><img alt="image13" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_13.png" style="width: 960px; height: 720px;" /></p>
<p><img alt="image14" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_14.png" style="width: 960px; height: 720px;" /></p>
<p>At this point, the UI is functional but spartan. I’m a database / code
guy, not a designer. Luckily, I have a student working on improving the
front end (hi Emily!)</p>
<p><strong>Further information:</strong></p>
<ul>
<li><p class="first">Demonstration application:</p>
</p><ul class="simple">
<li><a class="reference external" href="https://gitorious.org/postgresql-full-text-search-engine">Gitorious
repository</a>
/
<a class="reference external" href="https://github.com/dbs/postgresql-full-text-search-engine">Github</a></li>
</ul>
</li>
<li><p class="first">Internet Archive harvester project:
<a class="reference external" href="https://gitorious.org/ia-harvester">https://gitorious.org/ia-harvester</a></p>
</li>
<li><p class="first">PostgreSQL full-text search:</p>
</p><ul class="simple">
<li><a class="reference external" href="http://postgresql.org/docs/9.3/static/textsearch.html">PostgreSQL official
documentation</a></li>
<li><a class="reference external" href="/archives/260-Seek-and-ye-shall-find-full-text-search-in-PostgreSQL.html">Seek and ye shall find: PostgreSQL full-text search (presentation
at PostgresOpen
2012)</a></li>
</ul>
</li>
<li><p class="first"><a class="reference external" href="http://flask.pocoo.org/docs">Flask official docs</a></p>
</li>
<li><p class="first"><a class="reference external" href="http://jinja.pocoo.org/docs">Jinja2 official docs</a></p>
</li>
</ul>
A Flask of full-text search in PostgreSQL2013-08-11T15:38:00-04:002013-08-11T15:38:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2013-08-11:/a-flask-of-full-text-search-in-postgresql.html<p><strong>Update:</strong> More conventional versions of the slides are available from
<a class="reference external" href="https://docs.google.com/presentation/d/1xKwBDYQUjrwMIMh3N64lhKK5H60F3C8SWX49KWJBshM/edit?usp=sharing">Google
Docs</a>
or in <a class="reference external" href="https://speakerdeck.com/pyconca/a-flask-of-full-text-search-with-postgresql-dan-scott">on Speakerdeck
(PDF)</a></p>
<p>.</p>
<p>On August 10, 2013, I gave the following talk at the <a class="reference external" href="http://2013.pycon.ca">PyCon Canada
2013</a> conference:</p>
<p><img alt="image0" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_0.png" style="width: 960px; height: 720px;" /></p>
<p><img alt="image1" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_1.png" style="width: 960px; height: 720px;" /></p>
<p>I’m a systems librarian at Laurentian University.</p>
<p>For the past six years, my day job and research …</p><p><strong>Update:</strong> More conventional versions of the slides are available from
<a class="reference external" href="https://docs.google.com/presentation/d/1xKwBDYQUjrwMIMh3N64lhKK5H60F3C8SWX49KWJBshM/edit?usp=sharing">Google
Docs</a>
or in <a class="reference external" href="https://speakerdeck.com/pyconca/a-flask-of-full-text-search-with-postgresql-dan-scott">on Speakerdeck
(PDF)</a></p>
<p>.</p>
<p>On August 10, 2013, I gave the following talk at the <a class="reference external" href="http://2013.pycon.ca">PyCon Canada
2013</a> conference:</p>
<p><img alt="image0" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_0.png" style="width: 960px; height: 720px;" /></p>
<p><img alt="image1" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_1.png" style="width: 960px; height: 720px;" /></p>
<p>I’m a systems librarian at Laurentian University.</p>
<p>For the past six years, my day job and research have enabled me to
contribute pretty heavily to Evergreen, an open source library system
written largely in Perl and built on PostgreSQL.</p>
<p>But when I have the opportunity to create a project from scratch, for
work or play, Python is my go-to language.</p>
<p><img alt="image2" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_2.png" style="width: 960px; height: 720px;" /></p>
<p>I promised to provide an example of a full-text search engine built with
Flask and PostgreSQL written in under 200 lines of code; you can find
that at either
<a class="reference external" href="https://gitorious.org/postgresql-full-text-search-engine">Gitorious</a>
or
<a class="reference external" href="https://github.com/dbs/postgresql-full-text-search-engine">Github</a>.</p>
<p><img alt="image3" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_3.png" style="width: 960px; height: 720px;" /></p>
<p>Last summer, the Laurentian University Library digitized 50 years of
student newspapers - over 1,000 issues.</p>
<p>We posted them all to the Internet Archive and got OCR’d text as a
result.</p>
<p>But finding things within a particular collection on the Internet
Archive can be difficult for most people, so we felt the need to create
a local search solution.</p>
<p><img alt="image4" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_4.png" style="width: 960px; height: 720px;" /></p>
<p>We were already using PostgreSQL to track all of the individual issues,
with attributes like newspaper name, volume, edition, publication date.</p>
<p>This gave us the ability to filter through the issues by year and issue
number, which was a good start. But we also wanted to be able to search
the full text for strings like “Pierre Trudeau” or “Mike Harris”.</p>
<p>A common approach is to feed the data into a dedicated search engine
like Solr or Sphinx, and build a search interface on top of that.</p>
<p>But PostgreSQL has featured full-text search support since the 8.3
release in 2008. So I opted to keep the moving parts to a minimum and
reuse my existing database as my search platform.</p>
<p><img alt="image5" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_5.png" style="width: 960px; height: 720px;" /></p>
<p>Our example table contains a “doc” column of type TEXT; that’s where we
store the text that we want to search and display. We also have a “tsv”
column of type TSVECTOR to store the normalized text.</p>
<p>The TSVECTOR column is typically maintained by a trigger that fires
whenever the corresponding row is created or updated. So... you just
insert TEXT into the doc column, and the trigger maintains the tsv
column for you.</p>
<p>PostgreSQL includes the tsvector_update_trigger() for your
convenience, as well as a trigger that uses different language-oriented
normalizations based on the value of a specified column. Naturally, you
can define your own trigger that invokes the to_tsvector() function.</p>
<p>To provide good performance, as with any relational table, you need to
define appropriate indexes. In the case of full-text search, you want to
define a GIN (or GiST) index on the TSVECTOR column.</p>
<p>Note: GIN indexes take longer to build, but provide faster lookup than
GiST.</p>
<p>Finally, we INSERT the text documents into the database. The ID and
TSVECTOR columns are automatically generated for us.</p>
<p><img alt="image6" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_6.png" style="width: 960px; height: 720px;" /></p>
<p>Each text document is:</p>
<ol class="arabic simple">
<li>Parsed into tokens (words, white space, URIs, file paths)</li>
<li>Each token is then normalized with one or more dictionaries (word
tokens may get stemming and stop words; URIs might have “www.”
stripped; everything gets folded to lower case)</li>
<li>Et voila: the text-search vector for our text document has been
created!</li>
</ol>
<p>PostgreSQL bundles a number of language-specific dictionaries to support
different stemming algorithms and default sets of stopwords.</p>
<p>In this example, we can see that PostgreSQL has stemmed “sketching” to
remove the verb suffix, removed “the” altogether as a stop word, and
stemmed “trees” to its singular form.</p>
<p>You can also see that the TSVECTOR stores the position where each token
occurs, to support relevancy ranking algorithms that boost the relevancy
of terms that are located closely together.</p>
<p><img alt="image7" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_7.png" style="width: 960px; height: 720px;" /></p>
<p>But it's a doozy of a query!</p>
<p><img alt="image8" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_8.png" style="width: 960px; height: 720px;" /></p>
<p>Yes, there is a <em>lot</em> going on here.</p>
<p>First, this is just a regular SQL statement that happens to use the WITH
clause to define named subqueries (“q” and “ranked”).</p>
<p>The to_tsquery() function takes an incoming full-text search query and
converts that into a parsed, normalized query.</p>
<p>The ts_rank_cd() function compares the TS_VECTOR column against the
query to determine its relevancy score.</p>
<p>We need to restrict it to rows that match our query, so we use the @@
operator (PostgreSQL allows data-type specific operators like this) and
then take the top ten.</p>
<p><strong>*Note*</strong>: the query, limit, and offset are hardcoded here for
illustrative purposes, but in the actual application these are supplied
as parameters to the prepared SQL statement.</p>
<p>Finally, we use the ts_headline() function to give us a highlighted
snippet of the results.</p>
<p><img alt="image9" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_9.png" style="width: 960px; height: 720px;" /></p>
<p>The harvester is a Python 3 application that uses the ‘postgresql’
module to connect to the database.</p>
<p>The REST and Web applications in the IA Harvester application are Python
2, largely because they use Flask (which iswas Python 2 only). But in
the demo application, I’ve converted them to Python 3.</p>
<p>While I could have simply written the Web and REST applications as a
single Flask Web app that talks directly to the database, I opted to
couple them via a JSON interface for a few reasons:</p>
<ul class="simple">
<li>Decoupling the web front end from the search back end means that I
can augment or swap out either piece entirely without affecting the
overall experience.</li>
<li>If performance becomes an issue, I could add caching where profiling
warrants it.</li>
<li>If the service ends up being incredibly popular and PostgreSQL turns
into a bottleneck, even with caching, I could set up a Solr or Sphinx
instance instead.</li>
</ul>
<p>I can change from Flask to another Web application framework on either
piece.</p>
<p>I can separate the hosts if I need to throw more hardware at the
service, and/or virtualize it on something like Google App Engine.</p>
<p><img alt="image10" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_10.png" style="width: 960px; height: 720px;" /></p>
<p>With so many Python web frameworks available, why Flask?</p>
<ul class="simple">
<li>Routes as decorators was admittedly the biggest draw, simplifying
routing enormously!</li>
<li>Solid Unicode support was a must as well, as our newspapers are
bilingual (English / French).</li>
</ul>
<p>At the time I opted to try out Flask, the project’s public stance
towards Python 3 was not warm. However, with the 0.10 release in June
2013, all that has (thankfully!) changed; Python 3.3 is supported.</p>
<p><img alt="image11" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_11.png" style="width: 960px; height: 720px;" /></p>
<p>Flexible format:</p>
<ul class="simple">
<li>query: gives us the original query so we can echo that back to the
user</li>
<li>results: is a list of the highlighted snippets that we returned from
our full-text search</li>
<li>years: a facet that for the years in which hits were found, including
the number of hits per year</li>
<li>collections: a facet for the collections in which hits were found,
including the number of hits per collection</li>
<li>meta:<ul>
<li>total: the total number of hits</li>
<li>page: the page we’re on</li>
<li>limit: the number of hits per page</li>
<li>results: the number of hits on this page of results</li>
</ul>
</li>
</ul>
<p><img alt="image12" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_12.png" style="width: 960px; height: 720px;" /></p>
<p>We’ve already seen the decorator-route pattern before, and of course we
need to quote, encode, and decode our search URL and results.</p>
<p>The Flask-specific parts are helper methods for getting GET params from
the query string, and rendering the template (in this case, via Jinja2),
by passing in values for the template placeholders.</p>
<p><img alt="image13" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_13.png" style="width: 960px; height: 720px;" /></p>
<p><img alt="image14" class="serendipity-image-center" src="/uploads/talks/2013/postgresql_fts_14.png" style="width: 960px; height: 720px;" /></p>
<p>At this point, the UI is functional but spartan. I’m a database / code
guy, not a designer. Luckily, I have a student working on improving the
front end (hi Emily!)</p>
<p><strong>Further information:</strong></p>
<ul>
<li><p class="first">Demonstration application:</p>
</p><ul class="simple">
<li><a class="reference external" href="https://gitorious.org/postgresql-full-text-search-engine">Gitorious
repository</a>
/
<a class="reference external" href="https://github.com/dbs/postgresql-full-text-search-engine">Github</a></li>
</ul>
</li>
<li><p class="first">Internet Archive harvester project:
<a class="reference external" href="https://gitorious.org/ia-harvester">https://gitorious.org/ia-harvester</a></p>
</li>
<li><p class="first">PostgreSQL full-text search:</p>
</p><ul class="simple">
<li><a class="reference external" href="http://postgresql.org/docs/9.3/static/textsearch.html">PostgreSQL official
documentation</a></li>
<li><a class="reference external" href="/archives/260-Seek-and-ye-shall-find-full-text-search-in-PostgreSQL.html">Seek and ye shall find: PostgreSQL full-text search (presentation
at PostgresOpen
2012)</a></li>
</ul>
</li>
<li><p class="first"><a class="reference external" href="http://flask.pocoo.org/docs">Flask official docs</a></p>
</li>
<li><p class="first"><a class="reference external" href="http://jinja.pocoo.org/docs">Jinja2 official docs</a></p>
</li>
</ul>
PyCon Canada 2013 - PostgreSQL full-text search and Flask2013-07-05T09:00:00-04:002013-07-05T09:00:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2013-07-05:/pycon-canada-2013-postgresql-full-text-search-and-flask.html<p>On August 10, 2013, I'll be giving a twenty-minute talk at PyCon Canada
on <a class="reference external" href="https://2013.pycon.ca/en/schedule/presentation/19/">A Flask of full-text search with
PostgreSQL</a>. I'm
very excited to be talking about Python, at a Python conference, and to
be giving the Python audience a peek at PostgreSQL's full-text search
capabilities. With a twenty …</p><p>On August 10, 2013, I'll be giving a twenty-minute talk at PyCon Canada
on <a class="reference external" href="https://2013.pycon.ca/en/schedule/presentation/19/">A Flask of full-text search with
PostgreSQL</a>. I'm
very excited to be talking about Python, at a Python conference, and to
be giving the Python audience a peek at PostgreSQL's full-text search
capabilities. With a twenty minute slot, I'll be leaning on my code4lib
experience to compress the right amount of technical information into an
entertaining package.</p>
<p>Setting aside my talk, the line-up for PyCon Canada looks fantastic; the
keynote speakers are <a class="reference external" href="https://2013.pycon.ca/en/speaker/profile/92/">Karen
Brennan</a>, <a class="reference external" href="https://2013.pycon.ca/en/speaker/profile/93/">Hilary
Mason</a>, and <a class="reference external" href="https://2013.pycon.ca/en/speaker/profile/91/">Jakob
Kaplan-Moss</a>, and there
are a <em>ton</em> of great talks. Did I mention that I'm really looking
forward to this conference?</p>
<p><strong>Update 2013-07-11:</strong> Now that the schedule is official, the
presentation URL needed to be updated. Also, the impetus for this
proposal came straight from PGCon 2013, where the PostgreSQL community
was urged to get the good word out about PostgreSQL to other
communities. Et voila!</p>
PyCon Canada 2013 - PostgreSQL full-text search and Flask2013-07-05T09:00:00-04:002013-07-05T09:00:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2013-07-05:/pycon-canada-2013-postgresql-full-text-search-and-flask.html<p>On August 10, 2013, I'll be giving a twenty-minute talk at PyCon Canada
on <a class="reference external" href="https://2013.pycon.ca/en/schedule/presentation/19/">A Flask of full-text search with
PostgreSQL</a>. I'm
very excited to be talking about Python, at a Python conference, and to
be giving the Python audience a peek at PostgreSQL's full-text search
capabilities. With a twenty …</p><p>On August 10, 2013, I'll be giving a twenty-minute talk at PyCon Canada
on <a class="reference external" href="https://2013.pycon.ca/en/schedule/presentation/19/">A Flask of full-text search with
PostgreSQL</a>. I'm
very excited to be talking about Python, at a Python conference, and to
be giving the Python audience a peek at PostgreSQL's full-text search
capabilities. With a twenty minute slot, I'll be leaning on my code4lib
experience to compress the right amount of technical information into an
entertaining package.</p>
<p>Setting aside my talk, the line-up for PyCon Canada looks fantastic; the
keynote speakers are <a class="reference external" href="https://2013.pycon.ca/en/speaker/profile/92/">Karen
Brennan</a>, <a class="reference external" href="https://2013.pycon.ca/en/speaker/profile/93/">Hilary
Mason</a>, and <a class="reference external" href="https://2013.pycon.ca/en/speaker/profile/91/">Jakob
Kaplan-Moss</a>, and there
are a <em>ton</em> of great talks. Did I mention that I'm really looking
forward to this conference?</p>
<p><strong>Update 2013-07-11:</strong> Now that the schedule is official, the
presentation URL needed to be updated. Also, the impetus for this
proposal came straight from PGCon 2013, where the PostgreSQL community
was urged to get the good word out about PostgreSQL to other
communities. Et voila!</p>
Introducing SQL to Evergreen administrators, round two2013-02-16T02:32:00-05:002013-02-16T02:32:00-05:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2013-02-16:/introducing-sql-to-evergreen-administrators-round-two.html<p><a class="reference external" href="/archives/212-Introduction-to-SQL-for-Evergreen-administrators.html">Three years ago</a> I was
asked to create and deliver a two-day course introducing SQL to Evergreen
users. Things went well and I was able to share the resulting materials with
the Evergreen and PostgreSQL community. Perhaps one of my happiest moments at
the Evergreen conference last year was when …</p><p><a class="reference external" href="/archives/212-Introduction-to-SQL-for-Evergreen-administrators.html">Three years ago</a> I was
asked to create and deliver a two-day course introducing SQL to Evergreen
users. Things went well and I was able to share the resulting materials with
the Evergreen and PostgreSQL community. Perhaps one of my happiest moments at
the Evergreen conference last year was when one of the participants in that
course, told me that many of his fellow participants were still successfully
writing SQL queries and getting work done. Huzzah!</p>
<p>Time goes by and another group, <a class="reference external" href="http://www.ohionet.org">OHIONET</a>, was
running into difficulties getting started with PostgreSQL and Evergreen. They
asked me if I would be willing to give the same sort of training I had given a
few years back. "Sure", I said, thinking it would be a great opportunity to
polish the materials and add some updates to cover new features in PostgreSQL
and Evergreen. We also opted to skip the travel and do an entirely virtual
training session via Google Hangouts, which worked out rather nicely (but
that's a different story).</p>
<p>As it turned out, I probably ended up putting about four days worth of effort
(crammed into lots of late nights, weekends, and vacation days) into
overhauling the instruction materials. But the results were worth it, in my
opinion; I'm rather proud of the content, and while I believe it stands up on
its own, the guidance that I was able to provide during the live instruction
sessions was well-received by the participants.</p>
<p>Thus, I am pleased to be able to offer to the broader community the latest
version of the Introduction to SQL for Evergreen Administrators, under a
Creative Commons Attribution-ShareAlike 3.0 (Unported) license.</p>
<ul class="simple">
<li>Reference documentation--30 pages introducing SQL with examples drawn
from the Evergreen schema:
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/v2/introduction_to_sql.html">HTML</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/v2/introduction_to_sql.pdf">PDF</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/v2/introduction_to_sql.epub">ePub</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/introduction_to_sql.txt">AsciiDoc</a>)</li>
<li>Presentation:
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/SQL_instruction.odp">LibreOffice Impress</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/v2/SQL_instruction.pdf">PDF</a>)</li>
<li>Solutions to exercises:
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/solutions_day_1.txt">Day 1</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/solutions_day_2.txt">Day 2</a>)</li>
</ul>
<p>So, a huge thanks to OHIONET for giving me the impetus to overhaul this
material, and for giving me a chance to introduce them to the wonders of SQL
with PostgreSQL, and to the inner workings of the Evergreen schema. It was a
blast! And thanks for agreeing to let me share these materials with the broader
community.</p>
Introducing SQL to Evergreen administrators, round two2013-02-16T02:32:00-05:002013-02-16T02:32:00-05:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2013-02-16:/introducing-sql-to-evergreen-administrators-round-two.html<p><a class="reference external" href="/archives/212-Introduction-to-SQL-for-Evergreen-administrators.html">Three years ago</a> I was
asked to create and deliver a two-day course introducing SQL to Evergreen
users. Things went well and I was able to share the resulting materials with
the Evergreen and PostgreSQL community. Perhaps one of my happiest moments at
the Evergreen conference last year was when …</p><p><a class="reference external" href="/archives/212-Introduction-to-SQL-for-Evergreen-administrators.html">Three years ago</a> I was
asked to create and deliver a two-day course introducing SQL to Evergreen
users. Things went well and I was able to share the resulting materials with
the Evergreen and PostgreSQL community. Perhaps one of my happiest moments at
the Evergreen conference last year was when one of the participants in that
course, told me that many of his fellow participants were still successfully
writing SQL queries and getting work done. Huzzah!</p>
<p>Time goes by and another group, <a class="reference external" href="http://www.ohionet.org">OHIONET</a>, was
running into difficulties getting started with PostgreSQL and Evergreen. They
asked me if I would be willing to give the same sort of training I had given a
few years back. "Sure", I said, thinking it would be a great opportunity to
polish the materials and add some updates to cover new features in PostgreSQL
and Evergreen. We also opted to skip the travel and do an entirely virtual
training session via Google Hangouts, which worked out rather nicely (but
that's a different story).</p>
<p>As it turned out, I probably ended up putting about four days worth of effort
(crammed into lots of late nights, weekends, and vacation days) into
overhauling the instruction materials. But the results were worth it, in my
opinion; I'm rather proud of the content, and while I believe it stands up on
its own, the guidance that I was able to provide during the live instruction
sessions was well-received by the participants.</p>
<p>Thus, I am pleased to be able to offer to the broader community the latest
version of the Introduction to SQL for Evergreen Administrators, under a
Creative Commons Attribution-ShareAlike 3.0 (Unported) license.</p>
<ul class="simple">
<li>Reference documentation--30 pages introducing SQL with examples drawn
from the Evergreen schema:
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/v2/introduction_to_sql.html">HTML</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/v2/introduction_to_sql.pdf">PDF</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/v2/introduction_to_sql.epub">ePub</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/introduction_to_sql.txt">AsciiDoc</a>)</li>
<li>Presentation:
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/SQL_instruction.odp">LibreOffice Impress</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/v2/SQL_instruction.pdf">PDF</a>)</li>
<li>Solutions to exercises:
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/solutions_day_1.txt">Day 1</a>)
(<a class="reference external" href="http://bzr.coffeecode.net/intro_to_sql/solutions_day_2.txt">Day 2</a>)</li>
</ul>
<p>So, a huge thanks to OHIONET for giving me the impetus to overhaul this
material, and for giving me a chance to introduce them to the wonders of SQL
with PostgreSQL, and to the inner workings of the Evergreen schema. It was a
blast! And thanks for agreeing to let me share these materials with the broader
community.</p>
Triumph of the tiny brain: Dan vs. Drupal / Panels2012-10-18T21:48:00-04:002012-10-18T21:48:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2012-10-18:/triumph-of-the-tiny-brain-dan-vs-drupal-panels.html<p>A while ago I inherited responsibility for a Drupal 6 instance and a
rather out-of-date server. (You know it's not good when your production
operating system is so old that it is no longer getting security
updates).</p>
<p>I'm not a Drupal person. I dabbled with Drupal years and years ago …</p><p>A while ago I inherited responsibility for a Drupal 6 instance and a
rather out-of-date server. (You know it's not good when your production
operating system is so old that it is no longer getting security
updates).</p>
<p>I'm not a Drupal person. I dabbled with Drupal years and years ago when
I was heavily into PHP, but it never stuck with me. Every time I poked
around at the database schema, with serialized objects stuck inside
columns, I found something else that I wanted to work on instead. Thus,
inheriting a Drupal instance wasn't something I had been looking forward
to. As this production server was running a number of different services
that were in use by our library, I went through a number of trial runs
to ensure that the base packages wouldn't introduce regressions or
outages. Fast-forward past a reasonably successful early-morning upgrade
from Debian Lenny to Squeeze and I was able to start looking at
addressing the Drupal instance that was also approximately 18 months out
of date.</p>
<p>Initially, after I worked out the how-to of Drupal upgrades (in short:
upgrade just Drupal core, then upgrade the modules), I thought all was
well. I even got over the hump of realizing that our instance had had
all of the modules dumped into Drupal's core directory, rather than
<tt class="docutils literal">sites/all/modules</tt>, and (even more impressively) got over the problem
that the core <em>bluemarine</em> them had been hacked directly rather than
having been separated out into a new custom theme. After working through
those learning pains, I realized that somewhere in all of the Drupal and
module upgrades, that something got "more secure" and started truncating
IMG links to files with spaces in them at the first space. So
"foo%20bar.jpg" was becoming "foo.jpg" and we were getting 404s
everywhere.</p>
<p>Did I mention that I didn't notice this until I upgraded our production
instance? Oh yes, I went through iteration after iteration of upgrades
on the test server, and dutifully fixed up the problems that I found in
the subset of content that I was testing against. I discovered and fixed
problems like the production server content linking directly to the test
server (slight copy-and-paste errors on the part of the content
creators, I suppose). But I didn't notice all of the 404s, because who
uploads images with spaces in their filename?</p>
<p>Turns out, everyone else in my library does that. Of course! And from
what I was able to piece together via Google and browsing drupal.org,
there was supposed to be some sanitization of the incoming filenames so
that spaces would be normalized, etc. But either that wasn't introduced
until well after our content had been created, or my predecessor had
lightly hacked one of the modules, or Drupal itself, and hadn't bothered
to use a source code repository to track those customizations. So,
realizing that I needed to make some bulk changes, I went at it with a
two-step plan:</p>
<ol class="arabic simple">
<li>Create symbolic links for both the truncated filename and the
spaces-normalized-to-underscores filenames. Creating symlinks for the
truncated filenames would fix the 404s immediately, at the cost of
some clash in the intended targets; there were plenty of
<tt class="docutils literal">Foo illustration.JPG</tt> and <tt class="docutils literal">Foo info.JPG</tt> pairs of files, but
like the Highlander, there can be only on <tt class="docutils literal">Foo.JPG</tt>.</li>
<li>Munge the database entries so that all of those now apparently
insecure %20-containing filenames would become underscores.</li>
</ol>
<p>If you're a Drupal user or a Drupal with Panels module user, you might
know that the database schema suffers from some fairly horrible tricks
being played on it. In this case, the Panels module creates a
<tt class="docutils literal">panels_pane</tt> table with a <tt class="docutils literal">configuration</tt> TEXT column. Based on the
name alone, it might seem odd that column is used to store the HTML
content of the corresponding panel. Even odder to me is that this is not
just a TEXT column, it's a column that expects a very particular
structure - something like:</p>
<pre class="literal-block">
a:5:{s:11:"admin_title";s:5:"RACER";s:5:"title";s:0:"";s:4:"body";s:639:"<p><img width="225" height [...]}
</pre>
<p>Ah, nothing like storing an object within a single database column. Of
particular interest was the result that I had when I tested updating the
column value with a basic "replace(configuration, '%20', '_')" - the
panel showed only <strong>n/a</strong>, presumably because the size (defined by the
<tt class="docutils literal">s</tt> properties in the object) for the "body" text property was no
longer a match. That would be an instance of
<a class="reference external" href="http://drupal.org/node/926448">http://drupal.org/node/926448</a> - so okay, clearly I had to change tactics
and update the entire object.</p>
<p>I tried quickly finding the Drupal way to do this: clearly there's an
API and there must be some simple way to retrieve an object, change it's
values, and update it so that the serialized object gets stored in the
database and Drupal is happy. However, I couldn't find a simple
tutorial, and trying #drupal on Freenode was unfortunately fruitless as
well (although some people did try to suggest running REPLACE() at the
database level, that was nice but they didn't recognize that that would
actually damage things significantly).</p>
<p>So... out came the Perl, and here's what I hacked together:</p>
<pre class="literal-block">
#!/bin/perluse strict;use warnings;foreach (<DATA>) { chomp(); my $i = 0; my $body = 0; my @fixed; my @row = split /\t/; my $pid = $row[1]; my $configuration = $row[0]; my @chunks = split /";s:/, $configuration; foreach my $chunk (@chunks) { if (!$i++) { push @fixed, $chunk; next; } if ($chunk =~ m/"body/) { $body = 1; push @fixed, $chunk; next; } if ($body) { my ($length, $content) = $chunk =~ m/^(\d+):"(.+)$/; for (my $j = 0; $j < 50; $j++) { $content =~ s{(/pictures/[^\./]+?)%20}{$1_}g; } $content =~ s{%20}{+}g; $length = length($content); $chunk = "$length:\"$content"; $body = 0; } push @fixed, $chunk; } print 'UPDATE panels_pane SET configuration = $ashes$' . join('";s:', @fixed) . '$ashes$' . " WHERE pid = $pid;\n";}__DATA__
</pre>
<p>Against the trusty database (I ? PostgreSQL!), I ran
<tt class="docutils literal">COPY (SELECT configuration, pid FROM panels_pane WHERE configuration ~ '%20') TO 'conf_pids.out';</tt>,
then slapped the Perl code on top and generated a load of UPDATE
statements. It's far from my best Perl code, but it worked and once I
gave up on doing things the Drupal way I was able to put it together in
a handful of minutes. I now have a functional Drupal 6 instance again,
updated such that there are no known security vulnerabilities with
either core or the modules we're using, and there are no broken image
links.</p>
<p>And now I need to begin working towards either grokking Drupal, or
finding a content management system that my tiny brain can comprehend,
because I don't want to have to go through these kinds of contortions
again with future upgrades... Suggestions welcome!</p>
Triumph of the tiny brain: Dan vs. Drupal / Panels2012-10-18T21:48:00-04:002012-10-18T21:48:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2012-10-18:/triumph-of-the-tiny-brain-dan-vs-drupal-panels.html<p>A while ago I inherited responsibility for a Drupal 6 instance and a
rather out-of-date server. (You know it's not good when your production
operating system is so old that it is no longer getting security
updates).</p>
<p>I'm not a Drupal person. I dabbled with Drupal years and years ago …</p><p>A while ago I inherited responsibility for a Drupal 6 instance and a
rather out-of-date server. (You know it's not good when your production
operating system is so old that it is no longer getting security
updates).</p>
<p>I'm not a Drupal person. I dabbled with Drupal years and years ago when
I was heavily into PHP, but it never stuck with me. Every time I poked
around at the database schema, with serialized objects stuck inside
columns, I found something else that I wanted to work on instead. Thus,
inheriting a Drupal instance wasn't something I had been looking forward
to. As this production server was running a number of different services
that were in use by our library, I went through a number of trial runs
to ensure that the base packages wouldn't introduce regressions or
outages. Fast-forward past a reasonably successful early-morning upgrade
from Debian Lenny to Squeeze and I was able to start looking at
addressing the Drupal instance that was also approximately 18 months out
of date.</p>
<p>Initially, after I worked out the how-to of Drupal upgrades (in short:
upgrade just Drupal core, then upgrade the modules), I thought all was
well. I even got over the hump of realizing that our instance had had
all of the modules dumped into Drupal's core directory, rather than
<tt class="docutils literal">sites/all/modules</tt>, and (even more impressively) got over the problem
that the core <em>bluemarine</em> them had been hacked directly rather than
having been separated out into a new custom theme. After working through
those learning pains, I realized that somewhere in all of the Drupal and
module upgrades, that something got "more secure" and started truncating
IMG links to files with spaces in them at the first space. So
"foo%20bar.jpg" was becoming "foo.jpg" and we were getting 404s
everywhere.</p>
<p>Did I mention that I didn't notice this until I upgraded our production
instance? Oh yes, I went through iteration after iteration of upgrades
on the test server, and dutifully fixed up the problems that I found in
the subset of content that I was testing against. I discovered and fixed
problems like the production server content linking directly to the test
server (slight copy-and-paste errors on the part of the content
creators, I suppose). But I didn't notice all of the 404s, because who
uploads images with spaces in their filename?</p>
<p>Turns out, everyone else in my library does that. Of course! And from
what I was able to piece together via Google and browsing drupal.org,
there was supposed to be some sanitization of the incoming filenames so
that spaces would be normalized, etc. But either that wasn't introduced
until well after our content had been created, or my predecessor had
lightly hacked one of the modules, or Drupal itself, and hadn't bothered
to use a source code repository to track those customizations. So,
realizing that I needed to make some bulk changes, I went at it with a
two-step plan:</p>
<ol class="arabic simple">
<li>Create symbolic links for both the truncated filename and the
spaces-normalized-to-underscores filenames. Creating symlinks for the
truncated filenames would fix the 404s immediately, at the cost of
some clash in the intended targets; there were plenty of
<tt class="docutils literal">Foo illustration.JPG</tt> and <tt class="docutils literal">Foo info.JPG</tt> pairs of files, but
like the Highlander, there can be only on <tt class="docutils literal">Foo.JPG</tt>.</li>
<li>Munge the database entries so that all of those now apparently
insecure %20-containing filenames would become underscores.</li>
</ol>
<p>If you're a Drupal user or a Drupal with Panels module user, you might
know that the database schema suffers from some fairly horrible tricks
being played on it. In this case, the Panels module creates a
<tt class="docutils literal">panels_pane</tt> table with a <tt class="docutils literal">configuration</tt> TEXT column. Based on the
name alone, it might seem odd that column is used to store the HTML
content of the corresponding panel. Even odder to me is that this is not
just a TEXT column, it's a column that expects a very particular
structure - something like:</p>
<pre class="literal-block">
a:5:{s:11:"admin_title";s:5:"RACER";s:5:"title";s:0:"";s:4:"body";s:639:"<p><img width="225" height [...]}
</pre>
<p>Ah, nothing like storing an object within a single database column. Of
particular interest was the result that I had when I tested updating the
column value with a basic "replace(configuration, '%20', '_')" - the
panel showed only <strong>n/a</strong>, presumably because the size (defined by the
<tt class="docutils literal">s</tt> properties in the object) for the "body" text property was no
longer a match. That would be an instance of
<a class="reference external" href="http://drupal.org/node/926448">http://drupal.org/node/926448</a> - so okay, clearly I had to change tactics
and update the entire object.</p>
<p>I tried quickly finding the Drupal way to do this: clearly there's an
API and there must be some simple way to retrieve an object, change it's
values, and update it so that the serialized object gets stored in the
database and Drupal is happy. However, I couldn't find a simple
tutorial, and trying #drupal on Freenode was unfortunately fruitless as
well (although some people did try to suggest running REPLACE() at the
database level, that was nice but they didn't recognize that that would
actually damage things significantly).</p>
<p>So... out came the Perl, and here's what I hacked together:</p>
<pre class="literal-block">
#!/bin/perluse strict;use warnings;foreach (<DATA>) { chomp(); my $i = 0; my $body = 0; my @fixed; my @row = split /\t/; my $pid = $row[1]; my $configuration = $row[0]; my @chunks = split /";s:/, $configuration; foreach my $chunk (@chunks) { if (!$i++) { push @fixed, $chunk; next; } if ($chunk =~ m/"body/) { $body = 1; push @fixed, $chunk; next; } if ($body) { my ($length, $content) = $chunk =~ m/^(\d+):"(.+)$/; for (my $j = 0; $j < 50; $j++) { $content =~ s{(/pictures/[^\./]+?)%20}{$1_}g; } $content =~ s{%20}{+}g; $length = length($content); $chunk = "$length:\"$content"; $body = 0; } push @fixed, $chunk; } print 'UPDATE panels_pane SET configuration = $ashes$' . join('";s:', @fixed) . '$ashes$' . " WHERE pid = $pid;\n";}__DATA__
</pre>
<p>Against the trusty database (I ? PostgreSQL!), I ran
<tt class="docutils literal">COPY (SELECT configuration, pid FROM panels_pane WHERE configuration ~ '%20') TO 'conf_pids.out';</tt>,
then slapped the Perl code on top and generated a load of UPDATE
statements. It's far from my best Perl code, but it worked and once I
gave up on doing things the Drupal way I was able to put it together in
a handful of minutes. I now have a functional Drupal 6 instance again,
updated such that there are no known security vulnerabilities with
either core or the modules we're using, and there are no broken image
links.</p>
<p>And now I need to begin working towards either grokking Drupal, or
finding a content management system that my tiny brain can comprehend,
because I don't want to have to go through these kinds of contortions
again with future upgrades... Suggestions welcome!</p>
Seek and ye shall find: full-text search in PostgreSQL2012-09-18T22:19:00-04:002012-09-18T22:19:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2012-09-18:/seek-and-ye-shall-find-full-text-search-in-postgresql.html<p>I'm at <a class="reference external" href="http://postgresopen.org/2012">PostgresOpen</a> in Chicago, and
just gave my talk on <a class="reference external" href="http://stuff.coffeecode.net/2012/pgopen_fulltext/pgsql-fulltext-intro.html">Implementing full-text search in
PostgreSQL</a>.
The goal was to give novice users the understanding and examples they
needed to build a workable search solution using PostgreSQL's full-text
search. And it went (in my opinion) well - an almost full room …</p><p>I'm at <a class="reference external" href="http://postgresopen.org/2012">PostgresOpen</a> in Chicago, and
just gave my talk on <a class="reference external" href="http://stuff.coffeecode.net/2012/pgopen_fulltext/pgsql-fulltext-intro.html">Implementing full-text search in
PostgreSQL</a>.
The goal was to give novice users the understanding and examples they
needed to build a workable search solution using PostgreSQL's full-text
search. And it went (in my opinion) well - an almost full room, lots of
audience interaction (thanks Bruce Momjian, Jonathan Scott, Jonathan
Katz, et al) a lot of nodding heads, and nobody running out of the room
screaming. So... yay!</p>
<p>A few takeaways from prepping for the presentation:</p>
<ul class="simple">
<li>I suspect that some effort on making the full-text search parser
extensible would go a long way towards resolving problems that you
currently have to work around by manipulating the text before it hits
the parser. For example, if you pass in a string like <tt class="docutils literal">file/path</tt>,
PostgreSQL classifies that as a <tt class="docutils literal">file</tt> token and stores it as-is -
but you might want to be able to search against either "file" or
"path" as well as the concatenated form. Right now you have to
preparse that string to break it up yourself (via regexp_replace()
or the like), but it would be much nicer if you could teach the
parser new tricks (without having to modify the source and recompile
it).</li>
<li><tt class="docutils literal">ts_headline()</tt> might be a bottleneck for large documents - and (a)
solution might be to just bust the document up. *Note to self*: dig
into the underlying code to see if there's any chance of using
indexes to enable improvement.</li>
<li>Ran into a bug with <tt class="docutils literal">ts_rewrite()</tt> while building the tutorial, and
have not yet worked out whether that was due to my local
configuration or an actual bug... <strong>TO DO!</strong></li>
</ul>
<p>Also - PostgresOpen has had a great vibe so far; a relatively small but
very high-quality conference with lots of knowledgeable, friendly
participants. Selena (one of the organizers) had a goal of creating an
environment similar to PgCon, and I would say from my limited experience
attending one PgCon and one PostgresOpen that she and the rest of the
conference team have done a great job!</p>
Seek and ye shall find: full-text search in PostgreSQL2012-09-18T22:19:00-04:002012-09-18T22:19:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2012-09-18:/seek-and-ye-shall-find-full-text-search-in-postgresql.html<p>I'm at <a class="reference external" href="http://postgresopen.org/2012">PostgresOpen</a> in Chicago, and
just gave my talk on <a class="reference external" href="http://stuff.coffeecode.net/2012/pgopen_fulltext/pgsql-fulltext-intro.html">Implementing full-text search in
PostgreSQL</a>.
The goal was to give novice users the understanding and examples they
needed to build a workable search solution using PostgreSQL's full-text
search. And it went (in my opinion) well - an almost full room …</p><p>I'm at <a class="reference external" href="http://postgresopen.org/2012">PostgresOpen</a> in Chicago, and
just gave my talk on <a class="reference external" href="http://stuff.coffeecode.net/2012/pgopen_fulltext/pgsql-fulltext-intro.html">Implementing full-text search in
PostgreSQL</a>.
The goal was to give novice users the understanding and examples they
needed to build a workable search solution using PostgreSQL's full-text
search. And it went (in my opinion) well - an almost full room, lots of
audience interaction (thanks Bruce Momjian, Jonathan Scott, Jonathan
Katz, et al) a lot of nodding heads, and nobody running out of the room
screaming. So... yay!</p>
<p>A few takeaways from prepping for the presentation:</p>
<ul class="simple">
<li>I suspect that some effort on making the full-text search parser
extensible would go a long way towards resolving problems that you
currently have to work around by manipulating the text before it hits
the parser. For example, if you pass in a string like <tt class="docutils literal">file/path</tt>,
PostgreSQL classifies that as a <tt class="docutils literal">file</tt> token and stores it as-is -
but you might want to be able to search against either "file" or
"path" as well as the concatenated form. Right now you have to
preparse that string to break it up yourself (via regexp_replace()
or the like), but it would be much nicer if you could teach the
parser new tricks (without having to modify the source and recompile
it).</li>
<li><tt class="docutils literal">ts_headline()</tt> might be a bottleneck for large documents - and (a)
solution might be to just bust the document up. *Note to self*: dig
into the underlying code to see if there's any chance of using
indexes to enable improvement.</li>
<li>Ran into a bug with <tt class="docutils literal">ts_rewrite()</tt> while building the tutorial, and
have not yet worked out whether that was due to my local
configuration or an actual bug... <strong>TO DO!</strong></li>
</ul>
<p>Also - PostgresOpen has had a great vibe so far; a relatively small but
very high-quality conference with lots of knowledgeable, friendly
participants. Selena (one of the organizers) had a goal of creating an
environment similar to PgCon, and I would say from my limited experience
attending one PgCon and one PostgresOpen that she and the rest of the
conference team have done a great job!</p>
Running libraries on PostgreSQL: PGCon 2012 talk2012-05-20T17:57:00-04:002012-05-20T17:57:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2012-05-20:/running-libraries-on-postgresql-pgcon-2012-talk.html<p>On Friday, May 18th I gave <a class="reference external" href="http://www.pgcon.org/2012/schedule/events/465.en.html">a talk</a> at the PGCon 2012
conference on the use of PostgreSQL by the Evergreen project. My talk fell in
the <em>case study</em> track, which meant that I had been asked to describe to
PostgreSQL developers what Evergreen was, why it was a project …</p><p>On Friday, May 18th I gave <a class="reference external" href="http://www.pgcon.org/2012/schedule/events/465.en.html">a talk</a> at the PGCon 2012
conference on the use of PostgreSQL by the Evergreen project. My talk fell in
the <em>case study</em> track, which meant that I had been asked to describe to
PostgreSQL developers what Evergreen was, why it was a project they might want
to care about, enumerate the advantages that Evergreen gets from using
PostgreSQL, and where our project has some difficulties with PostgreSQL.</p>
<p>I have given a lot of talks before, but I’m used to being on the developer side
of the discussion. In this case, the tables were turned; with noted PostgreSQL
contributors like Josh Berkus, Chris Brown, Simon Riggs, and Robert Treat in
the audience, I was a user talking to the developers of something that I was
very much dependent on and which I understood at a much more basic level than
they did. This was both liberating <em>and</em> humbling; it definitely adds some
perspective to my experiences as a developer in the Evergreen project.</p>
<p>Along with my slides, the whole talk has been professionally recorded - both
video and audio - thanks to Heroku’s sponsorship, so you will be able to relive
each and every word if you really want to. I’ll summarize the main points that
I wanted to convey to the PostgreSQL developers:</p>
<ul class="simple">
<li>I was quite candid that most libraries can’t afford dedicated database
administrators, and that therefore the more that PostgreSQL can provide
reasonable out-of-the-box configuration settings, the better. For example,
results from <a class="reference external" href="http://evergreen-ils.org/~denials/postgresql_survey.html">the survey that I sent out at the last minute</a> (THANK YOU to
the nine sites that responded!) showed many sites running with a default
statistics target of 50, whereas the default had been increased to 100 back
in PostgreSQL 8.1 and much higher settings are often recommended to help the
planner make its decisions. That said, my survey didn’t ask for table-level
statistics settings (did you <strong>know</strong> that you could change the statistics
for particular tables?), so perhaps some sites are using higher statistics
levels for particular tables and a lower default threshold.</li>
<li>It was probably hokey, but I noted that as libraries are often called the
heart of their community, that PostgreSQL was effectively the heart of
Evergreen — and I invited the PostgreSQL community to help our heart beat
faster. With the Evergreen Oversight Board contemplating a strategic
investment fund for initiatives that will have a long-term benefit to
Evergreen, this might be an avenue for getting PostgreSQL experts to assist
us on areas that represent particular bottlenecks (beyond helping us out of
the goodness of their own hearts). As well, I invited the PostgreSQL
community to join in advocacy efforts to get their local libraries to
consider adopting Evergreen.</li>
<li>I described, at a high-level, many of the PostgreSQL features that Evergreen
relies on (full-text search, stored procedures, Hstore, inheritance) and
tried to convey why our schema takes up 355 tables (and counting) to deal
with what, from outside a library perspective, must seem like a relatively
simple problem to deal with. And of course I gave most of the credit for
Evergreen’s PostgreSQL-savviness on multiple levels to Mike Rylander.</li>
</ul>
<p>The talk was well-received, based on a number of people who approached me
afterward to continue the discussion. Josh called it one of the first times he
had seen a presentation designed to solicit assistance directly from the
developers in attendance (I probably overplayed the "help us poor harried
library system administrators" hand) and thought that it hit the mark for a
case study; similarly, Simon was quite interested in Evergreen’s adoption
patterns with (I suspect) an eye towards offering possible consulting in
administration and optimization efforts.</p>
<p>On the "immediate takeaways" from that talk:</p>
<ul class="simple">
<li>For straightforward connection pooling, pgbouncer is the current
recommendation over the more flexible but more complicated pgpool-II.</li>
<li>Recent versions of Slony have lifted limitations that bit us in the
past, like the inability to replicate a TRUNCATE command.</li>
<li>Solr, as a potential alternative to PostgreSQL’s full-text search, is
seen as fast but brittle to manage, and adds in overhead to maintain
consistency with the contents of the database. (I’m not so sure about the
brittleness, given Hathitrust’s ability to run a massive Solr index, but it
is worth following up on…)</li>
<li>Streaming replication in 9.1 has improved significantly over 9.0,
although you’ll still want to have WAL archiving in case of disaster.</li>
</ul>
<p>I have a lot more to say about the intersection of the PostgreSQL and Evergreen
communities in general, but on the whole I think that a closer relationship has
been long overdue. I was delighted that Ben Shum and Robin Isard were both able
to attend the conference, and I firmly believe that building more PostgreSQL
development and administration expertise within the Evergreen community is
critical to our long-term success. While I have long been an advocate of
pointing community members to the documentation of the underlying
infrastructure components for specific administration recommendations, I
believe that effective PostgreSQL tuning and administration is so critical to
the successful implementation of a production Evergreen site that we should add
a section to the Evergreen documentation containing a small set of
considerations and/or processes for going into production—and I hope to start
that relatively soon.</p>
Running libraries on PostgreSQL: PGCon 2012 talk2012-05-20T17:57:00-04:002012-05-20T17:57:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2012-05-20:/running-libraries-on-postgresql-pgcon-2012-talk.html<p>On Friday, May 18th I gave <a class="reference external" href="http://www.pgcon.org/2012/schedule/events/465.en.html">a talk</a> at the PGCon 2012
conference on the use of PostgreSQL by the Evergreen project. My talk fell in
the <em>case study</em> track, which meant that I had been asked to describe to
PostgreSQL developers what Evergreen was, why it was a project …</p><p>On Friday, May 18th I gave <a class="reference external" href="http://www.pgcon.org/2012/schedule/events/465.en.html">a talk</a> at the PGCon 2012
conference on the use of PostgreSQL by the Evergreen project. My talk fell in
the <em>case study</em> track, which meant that I had been asked to describe to
PostgreSQL developers what Evergreen was, why it was a project they might want
to care about, enumerate the advantages that Evergreen gets from using
PostgreSQL, and where our project has some difficulties with PostgreSQL.</p>
<p>I have given a lot of talks before, but I’m used to being on the developer side
of the discussion. In this case, the tables were turned; with noted PostgreSQL
contributors like Josh Berkus, Chris Brown, Simon Riggs, and Robert Treat in
the audience, I was a user talking to the developers of something that I was
very much dependent on and which I understood at a much more basic level than
they did. This was both liberating <em>and</em> humbling; it definitely adds some
perspective to my experiences as a developer in the Evergreen project.</p>
<p>Along with my slides, the whole talk has been professionally recorded - both
video and audio - thanks to Heroku’s sponsorship, so you will be able to relive
each and every word if you really want to. I’ll summarize the main points that
I wanted to convey to the PostgreSQL developers:</p>
<ul class="simple">
<li>I was quite candid that most libraries can’t afford dedicated database
administrators, and that therefore the more that PostgreSQL can provide
reasonable out-of-the-box configuration settings, the better. For example,
results from <a class="reference external" href="http://evergreen-ils.org/~denials/postgresql_survey.html">the survey that I sent out at the last minute</a> (THANK YOU to
the nine sites that responded!) showed many sites running with a default
statistics target of 50, whereas the default had been increased to 100 back
in PostgreSQL 8.1 and much higher settings are often recommended to help the
planner make its decisions. That said, my survey didn’t ask for table-level
statistics settings (did you <strong>know</strong> that you could change the statistics
for particular tables?), so perhaps some sites are using higher statistics
levels for particular tables and a lower default threshold.</li>
<li>It was probably hokey, but I noted that as libraries are often called the
heart of their community, that PostgreSQL was effectively the heart of
Evergreen — and I invited the PostgreSQL community to help our heart beat
faster. With the Evergreen Oversight Board contemplating a strategic
investment fund for initiatives that will have a long-term benefit to
Evergreen, this might be an avenue for getting PostgreSQL experts to assist
us on areas that represent particular bottlenecks (beyond helping us out of
the goodness of their own hearts). As well, I invited the PostgreSQL
community to join in advocacy efforts to get their local libraries to
consider adopting Evergreen.</li>
<li>I described, at a high-level, many of the PostgreSQL features that Evergreen
relies on (full-text search, stored procedures, Hstore, inheritance) and
tried to convey why our schema takes up 355 tables (and counting) to deal
with what, from outside a library perspective, must seem like a relatively
simple problem to deal with. And of course I gave most of the credit for
Evergreen’s PostgreSQL-savviness on multiple levels to Mike Rylander.</li>
</ul>
<p>The talk was well-received, based on a number of people who approached me
afterward to continue the discussion. Josh called it one of the first times he
had seen a presentation designed to solicit assistance directly from the
developers in attendance (I probably overplayed the "help us poor harried
library system administrators" hand) and thought that it hit the mark for a
case study; similarly, Simon was quite interested in Evergreen’s adoption
patterns with (I suspect) an eye towards offering possible consulting in
administration and optimization efforts.</p>
<p>On the "immediate takeaways" from that talk:</p>
<ul class="simple">
<li>For straightforward connection pooling, pgbouncer is the current
recommendation over the more flexible but more complicated pgpool-II.</li>
<li>Recent versions of Slony have lifted limitations that bit us in the
past, like the inability to replicate a TRUNCATE command.</li>
<li>Solr, as a potential alternative to PostgreSQL’s full-text search, is
seen as fast but brittle to manage, and adds in overhead to maintain
consistency with the contents of the database. (I’m not so sure about the
brittleness, given Hathitrust’s ability to run a massive Solr index, but it
is worth following up on…)</li>
<li>Streaming replication in 9.1 has improved significantly over 9.0,
although you’ll still want to have WAL archiving in case of disaster.</li>
</ul>
<p>I have a lot more to say about the intersection of the PostgreSQL and Evergreen
communities in general, but on the whole I think that a closer relationship has
been long overdue. I was delighted that Ben Shum and Robin Isard were both able
to attend the conference, and I firmly believe that building more PostgreSQL
development and administration expertise within the Evergreen community is
critical to our long-term success. While I have long been an advocate of
pointing community members to the documentation of the underlying
infrastructure components for specific administration recommendations, I
believe that effective PostgreSQL tuning and administration is so critical to
the successful implementation of a production Evergreen site that we should add
a section to the Evergreen documentation containing a small set of
considerations and/or processes for going into production—and I hope to start
that relatively soon.</p>
Tuning PostgreSQL for Evergreen on a test server2008-04-14T18:48:00-04:002008-04-14T18:48:00-04:00dan@coffeecode.net (Dan Scott)tag:coffeecode.net,2008-04-14:/tuning-postgresql-for-evergreen-on-a-test-server.html<p><strong>Update 2008-05-01</strong>: Fixed a typo for sysctl: -a parameter simply
shows all settings; -w parameter is needed to write the setting. Duh.</p>
<p>Once you have decided on and acquired your <a class="reference external" href="http://www.coffeecode.net/archives/155-Test-server-strategies.html">test hardware for
Evergreen</a>,
you need to think about tuning your PostgreSQL database server. Once you
start loading bibliographic records …</p><p><strong>Update 2008-05-01</strong>: Fixed a typo for sysctl: -a parameter simply
shows all settings; -w parameter is needed to write the setting. Duh.</p>
<p>Once you have decided on and acquired your <a class="reference external" href="http://www.coffeecode.net/archives/155-Test-server-strategies.html">test hardware for
Evergreen</a>,
you need to think about tuning your PostgreSQL database server. Once you
start loading bibliographic records, you might notice that after 100,000
records or so that your search response times aren't too snappy. Don't
snarl at Evergreen. By default, PostgreSQL ships with very conservative
settings (something like machines with 256 MB of RAM!) so if you don't
tune those settings you're getting a false representation of your
system's capabilities.</p>
<p>The "right" settings for PostgreSQL depend significantly on your
hardware and deployment context, but in almost any circumstance you will
want to bump up the settings from the delivered defaults. To give you an
idea of what you need to consider, I thought I would share the settings
that we're currently using on our Evergreen test server at Laurentian
University. You might be able to use these as a starting point and
adjust them accordingly once you've run some representative load tests
against your configuration. And it's useful documentation for me to fall
back on in a few months, when all of this has escaped my grasp <img alt=":-)" class="emoticon" src="/images/smile.png" /></p>
<div class="section" id="the-defaults-as-shipped-in-debian-etch">
<h2>The defaults (as shipped in Debian Etch)</h2>
<p>The defaults in Debian Etch are quite conservative. Consider that our
test server has 12GB of RAM. The default only allocates 1MB of RAM to
work memory (which is critical for sorting performance) and only 8MB of
RAM to shared buffers. Following are the defaults set in
/etc/postgresql/8.1/main/postgresql.conf:</p>
<pre class="literal-block">
# - Memory -#shared_buffers = 1000 # min 16 or max_connections*2, 8KB each#temp_buffers = 1000 # min 100, 8KB each#max_prepared_transactions = 5 # can be 0 or more# note: increasing max_prepared_transactions costs ~600 bytes of shared memory# per transaction slot, plus lock space (see max_locks_per_transaction).#work_mem = 1024 # min 64, size in KB#maintenance_work_mem = 16384 # min 1024, size in KB#max_stack_depth = 2048 # min 100, size in KB# - Free Space Map -#max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes each#max_fsm_relations = 1000 # min 100, ~70 bytes each
</pre>
</div>
<div class="section" id="our-test-server-settings">
<h2>Our test server settings</h2>
<p>Our test server has 12 GB of RAM. Assuming that the PostgreSQL defaults
were set for a system with 1 GB of RAM, we should be able to multiply
the memory-based settings by at least a factor of 12. We're a little bit
more aggressive than that in our settings. Note, however, that this is a
single-server install of Evergreen, so we're also running memcached,
ejabberd, Apache, and all of the Evergreen services as well as the
database - oh, and a test instance of an institutional repository, among
other apps - so we're not nearly as aggressive as we would be in a
dedicated PostgreSQL server configuration. Please note that I'm making
no claims that this is the optimal set of configuration values for
PostgreSQL even on our own hardware!</p>
<pre class="literal-block">
# shared_buffers: much of our performance depends on sorting, so we'll set it 100X the default# some tuning guides suggest cranking this up to as much 30% of your available RAMshared_buffers = 100000 # 8K * 100000 = ~ 0.8 GB# work_mem: how much RAM each concurrent process is allowed to claim before swapping to disk# your workload will probably have a large number of concurrent processeswork_mem=524288 # 512 MB# max_fsm_pages: increased because PostgreSQL demanded itmax_fsm_pages = 200000
</pre>
<p>After you change these settings, you will need to restart PostgreSQL to
make the settings take effect.</p>
</div>
<div class="section" id="kernel-tuning">
<h2>Kernel tuning</h2>
<p>In addition to PostgreSQL complaining about max_fsm_pages not being
high enough, your operating system kernel defaults for SysV shared
memory might not be high enough to support the amount of RAM PostgreSQL
demands as a result of your modifications. In one of our test
configurations, we had cranked up work_mem to 8GB; Debian complained
about an insufficient SHMMAX setting, so we were able to adjust that by
running the following command as root to set the kernel SHMMAX to 8GB
(8*1024^2):</p>
<pre class="literal-block">
sysctl -w kernel.shmmax=8589934592
</pre>
<p>To make this setting sticky through reboots, you can simply modify
/etc/sysctl.conf to include the following line:</p>
<pre class="literal-block">
# Set SHMMAX to 8GB for PostgreSQL#kernel.shmmax=8589934592
</pre>
</div>
<div class="section" id="other-measures">
<h2>Other measures</h2>
<p>Debian Etch comes with PostgreSQL 8.1. The first version of PostgreSQL
8.1 was released in November 2005. That's a long time in computer years.
Version 8.2, which was released less than a year later, "adds many
functionality and performance improvements" (according to the <a class="reference external" href="http://www.postgresql.org/docs/8.2/static/release-8-2.html">release
notes</a>).
If you're not getting the performance you expect from your hardware with
Debian Etch, perhaps a <a class="reference external" href="%20http://packages.debian.org/etch-backports/postgresql-8.2">backport of PostgreSQL
8.2</a>
would help out.</p>
</div>
<div class="section" id="further-resources">
<h2>Further resources</h2>
<p>This is just a shallow dip into PostgreSQL tuning for Evergreen -
hopefully enough to alert you to some of the factors you need to
consider if you're putting Evergreen into a serious testing environment
or production environment. Here are a few places to dig deeper into the
art of PostgreSQL tuning:</p>
<ul class="simple">
<li>PostgreSQL manual, resource consumption section of server
configuration: <a class="reference external" href="http://www.postgresql.org/docs/8.1/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY">version
8.1</a>
and <a class="reference external" href="http://www.postgresql.org/docs/8.2/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY">version
8.2</a></li>
<li>An annotated version of the 8.0 parameters with more explicit advice
is available at
` <<a class="reference external" href="http://www.powerpostgresql.com/Downloads/annotated_conf_80.html">http://www.powerpostgresql.com/Downloads/annotated_conf_80.html</a>>`__</li>
<li>Some good advice is buried about halfway down <a class="reference external" href="http://cbbrowne.com/info/postgresql.html">Christopher Browne's
page</a> under the heading
"Tuning PostgreSQL", along with links to further resources</li>
<li>The "Performance Whack-A-Mole" presentation at
<a class="reference external" href="http://www.powerpostgresql.com/Docs">PowerPostgreSQL</a> is a great
tutorial for holistic system tuning</li>
</ul>
</div>