Postgres is one of the best F/OSS databases partly because of its immense feature-set.
One of my favorite Postgres features is Full Text Search (FTS). Search is a common requirement of applications. Well-known search engines like Solr and ElasticSearch are often a first choice, but with Postgres in your stack you've got a great chance for Pareto improvement at low complexity cost.
Many projects function just fine with Postgres Full Text Search and other built-in extensions like trigram search (pg_trgm
). GitLab's blog has a great article on their use of Trigram indices to speed up search.
Search Engines: The Next Generation
These days, the debate isn't just Postgres versus Solr or ElasticSearch, a new generation of F/OSS search engines has arrived. To name a few:
Well OK, some of these engines aren't quite new; SQLite is stable and has supported the search use-case for a while and OpenSearch is a fork of ElasticSearch. While not new, both are worth comparing as choices in front of engineers today (spoiler alert: we're going to compare them).
These search engines are all impressive, but it's hard to evaluate these options in a VACUUM
.
If Postgres is already your main database, then it's probably easier to integrate than anything else. But would it be better to take on more complexity by choosing a dedicated search engine?
While search is an endlessly complicated problem and all of these tools are configurable, a comparison of minimal setup and tuning should yield some interesting insights.
Before we get into it though, a brief primer on what "Full Text Search" means, in Postgres land.
A whirlwind tour through FTS in Postgres
If we want to compare other solutions to Postgres FTS, we'd better review how to set up and use Postgres FTS!
Wrangling your existing data
Imagine you have some information to search through, stored in your main database.
Maybe a movies
table with a structure like this:
_10 create table movies (_10 id bigint primary key generated by default as identity,_10 title text not null,_10 original_title text not null,_10 overview text not null,_10 created_at timestamptz not null default now()_10);
While you're inserting data you might perform INSERTs
like this one:
_11insert into movies (_11 title,_11 original_title,_11 overview_11)_11values_11(_11 "Avengers: Age of Ultron",_11 "Avengers: Age of Ultron",_11 "When Tony Stark tries to jumpstart a dormant peacekeeping program, things go awry and Earth’s Mightiest Heroes are put to the ultimate test as the fate of the planet hangs in the balance. As the villainous Ultron emerges, it is up to The Avengers to stop him from enacting his terrible plans, and soon uneasy alliances and unexpected action pave the way for an epic and unique global adventure.",_11);
Basic searching with select and like
With nothing but the raw data in your database, we can actually do some basic searching using select
and comparison tools like like
.
_10select_10 *_10from movies_10where title like '%Avengers%' or overview like '%Avengers%';
The LIKE
operator as used in the query above will match only the term "Avengers", but it will match it in any part of a given row's title
or overview
.
We can hack around this basic functionality and make it a bit more robust:
- Storing the text as case insensitive text (
citext
) - Combining the text that will be searched into one column
- Using a GIN index
Our "v0" would be better, but using like
isn't quite what people mean when they ask for robust full text search. Not a bad first version, but Postgres can do much better.
Robust Full Text Search with Postgres
Postgres has built-in Full Text Search which we can use instead of cobbling together our own with like
.
Luckily for us, the Postgres FTS documentation is quite extensive, containing lots of information on the all-important Postgres search data types: tsvector
and tsquery
.
State of the art search engines like Solr do things to make searches work well:
- Stemming ("jumped", "jumper" -> "jump")
- Lemmatization ("better" -> "good")
- Result Ranking configuration
- Search by edit distance
- Vector based similarity (which you might use for a basic recommendation system)
Postgres FTS does not do all of the things a purpose-built search engine can do, but Postgres can produce very good results with decent performance.
A working Postgres FTS setup
Integrating Full Text Search into our imaginary movies table is as simple as executing the following:
_12-- Add a generated column that contains the search document_12alter table movies_12 add column fts_doc_en_12 generated always as to_tsvector (_12 'english', title || ' ' || original_title || ' ' || overview_12 )_12 stored;_12_12-- Create a GIN index to make our searches faster_12create index movies_fts_doc_en_idx_12 on movies_12 using gin (fts_doc_en);
Postgres supports even more features like facilities for ranking search results but for now we'll simply use title
, original_title
, and overview
as they are.
_10select_10 * from movies_10where doc_en @@ websearch_to_tsquery('english', 'Avengers');
Comparable FTS engines
Now that we've got a grasp on what FTS is and how it's used in Postgres, let's discuss a few of the alternative solutions Postgres can be compared to.
As you might expect, this isn't an exhaustive list of FTS engines, there are many more that we're not covering here (If you've got another search engine you'd like to see featured, file an issue).
SQLite FTS
SQLite is arguably the most widely deployed database in the world. SQLite is a library that produces and manages single file (or in-memory) databases that run on phones, planes, laptops, kiosks, Raspberry Pis, and everything in between.
Much like Postgres, SQLite is an excellent F/OSS project that has adopted the ability to perform Full Text Search.
The SQLite FTS documentation is extensive, and the FTS subsystems have undergone a few version upgrades -- FTS 3/4 and SQLite FTS 5 are both options.
For our experiments we'll use SQLite FTS5.
TypeSense
TypeSense is a relatively new project that focuses on lightning fast, typo-tolerant open source search. Typesense boasts a high performance C++ codebase, promising nearly instant search results.
TypeSense provides many demos with example datasets available for query, which you can try out:
Typesense quite ambitiously brands itself as the "Algolia" and "Elasticsearch" alternative, offering to solve your search problems without requiring a PhD.
MeiliSearch
MeiliSearch is a new search engine written from the ground up in Rust, which also promises lightning fast searches, flexibility, and typo tolerance.
MeiliSearch has extensive documentation, boasts an easy to use yet extendable architecture, and is well supported by a large community and team.
OpenSearch
AWS OpenSearch is AWS's answer to ElasticSearch's license changes. While opinions are varied, ElasticSearch built a great piece (formerly Apache 2.0 licensed) piece of software, and the AWS fork represents at least a stale measure of ElasticSearch's capabilities.
OpenSearch is the same as ElasticSearch (since the fork) but as it's license is much more permissive, it's the easy choice as a comparison to Postgres.
In production make sure to evaluate ElasticSearch versus OpenSearch (or other solutions) sufficiently for your use case.
Testing our FTS solutions
Well, how do we test performance of these disparate complex systems on an infinitely redefinable, hard-to-solve problem? By sticking to the use cases.
Any search engine has two jobs:
- Ingest information (usually referred to as "documents")
- Return documents that match user queries
Ingesting information can be important, but "search engine performance" usually refers to the second step - taking queries and outputting results.
That is to say, the primary concern is assuming a cluster already has proper data loaded on to it, how fast can it deliver the relevant results for a user-submitted query?
We'll focus on the end-user-centric meaning of performance here (query speed).
Getting a dataset
Hugging Face has a 32MB movie data set. We're interested in the follow columns in this data set:
title
original_title
overview
Postgres is capable of loading CSVs, but since we'll be using other search engines as well, let's convert to format that's much easier to use and process - Newline Delimited JSON.
After a few lines of code and one csv2ndjson.mjs
script later, we have a movies.ndjson.json
full of JSON documents that we can easily ingest into any search engine (or other database for that matter!).
Inserting the data
Each system ingests data slightly differently, but with a little scripting we can paper over the differences.
The code contains a scheme of "drivers" which are minimal ES6 modules.
Here's an example of the ingest
function used for Postgres (with slonik
as our DB driver):
_31// Ingesting search documents_31async function ingest({ document, pool }) {_31 // Ignore rows without proper release dates_31 if (document.release_date.trim().length === 0) {_31 return_31 }_31_31 await pool.query(sql`_31insert into movies (_31 id, _31 title, _31 original_title, _31 overview, _31 released_at_31)_31values_31(_31 ${document.id},_31 ${document.title},_31 ${document.original_title},_31 ${document.overview},_31 ${new Date(document.release_date).toISOString()}::timestamptz_31)_31on conflict (id) do update set_31 title=${document.title}, _31 original_title=${document.original_title}, _31 overview=${document.overview}, _31 released_at=${new Date(document.release_date).toISOString()}::timestamptz_31;_31`)_31}
The repo contains the complete code for Postgres as well as the other search engines, written in much this same fashion.
Picking a set of queries
This part is somewhat arbitrary - I've gotten a list of queries here that one might search for in terms of movies:
_10"super hero"_10"superhero"_10"superman"_10"suprman"_10"love"_10"world war"_10"spy"_10"romance"_10"comedy"_10"awakening"
Thanks to boinboing.net's article from 2018, there are some words that show up quite often in movie titles that we can search for.
The data set isn't very big, but doing all these searches in quick succession should be enough to get a realistic performance baseline.
Running the queries
Similar to ingesting data, all the search engines take slightly different input for queries, so we'll change our simple string queries into whatever the relevant search engine expects to receive.
Here's what that script looks like for Postgres:
_10// Querying search phrases_10async function query({ phrase, pool }) {_10 const results = await pool.query(sql`_10select id, title_10from movies_10where fts_doc_en @@ websearch_to_tsquery('english', ${phrase}::text)_10`)_10 const ids = results.rows.map((r) => r.id)_10 return ids_10}
No surprises there, just like the Postgres FTS example from earlier.
Automating our toil
Once we've combined our scripting magic with some Makefile
incantations, running the queries produces output like this:
_15TIMING=true FTS_ENGINE=pg make query_15[info] importing driver from [./pg.mjs]_15[info] finished initializing driver [pg]_15[info] processing lines in [/path/to/pg-fts-benchmark/search-phrases.ndjson.json]..._15[timing] phrase [super hero]: returned [34] results in 3.132471.ms_15[timing] phrase [superhero]: returned [86] results in 1.180798.ms_15[timing] phrase [superman]: returned [47] results in 0.912615.ms_15[timing] phrase [suprman]: returned [0] results in 0.781712.ms_15[timing] phrase [love]: returned [5417] results in 19.088668.ms_15[timing] phrase [world war]: returned [834] results in 2.902097.ms_15[timing] phrase [spy]: returned [349] results in 1.497892.ms_15[timing] phrase [romance]: returned [630] results in 1.91661.ms_15[timing] phrase [comedy]: returned [1213] results in 2.534538.ms_15[timing] phrase [awakening]: returned [210] results in 2.443748.ms_15[info] successfully processed [10] lines
Right off the bat we can see that Postgres is plenty quick! Some results are quite obvious - as you might expect, "love" is quite a common search result.
Results
With the ingest
and query
machinery defined for the search engines, it's easy to get some results.
What does the data tell us?
There's a lot of tuning left undone here, but a few points stand out right away:
- Even when consuming similar content, engines can produce different results, but generally ratios between queries on the same engine should be consistent.
- Postgres FTS is quite close performance-wise to many other solutions, at least in their default configuration.
- Only Typesense and MeiliSearch properly handled mis-spellings (the "suprman" query).
- Typesense was relatively strict with matches compared to other engines.
- OpenSearch was very fast with ingest, but the default configuration doesn't index misspellings.
- In-memory SQLite is by far the fastest, and PG isn't too far behind for this small data set.
Feedback from the other products
Since this is a benchmark, we felt it was fair to give the other products an opportunity to give their feedback.
Meilisearch
I agree that using PG in simple use cases is an excellent way to be pragmatic. Unfortunately, PG will never be able to offer you a search experience like the one you could have with Meilisearch. Meilisearch has perfect management of typos and searches by prefix, allowing a search at each keystroke. Enhanced relevancy with rules such as the number of words present, the importance of the attribute, the proximity of the terms searched in the document, and custom parameters. The ability to mix textual, geographic, and facet searches. Meilisearch automatically understands almost all languages with the possibility of having synonyms and stop words. Meilisearch has an incredible performance even on large volumes of data, and no backend proxy is needed, thanks to complete API Key management.
Quentin de Quelen, Meilisearch CEO
Typesense
Some more context around Typesense's behavior: how loose or exact matches are done are configurable in Typesense. For eg, you can configure the number of typos that are accounted for (num_typos), and when typo tolerance should kick-in, only if a configurable amount of results are not found (typo_tokens_threshold). For multi keyword searches, you can configure Typesense to expand the search by dropping some of the keywords, until at least a configurable number of search results are found (drop_tokens_threshold). We've chosen specifically to keep the search tight because we received feedback from users that some results were confusing if we let the defaults be too loose.
Jason Bosco, Typesense CEO
Wrap-up
While this certainly isn't "big data", it looks like Postgres has very much held it's own, even without too much wrangling/tuning. Postgres is clearly more than capable of delivering good enough search for this very basic case.
Special purpose systems are expected to out-perform a more general system like Postgres, but clearly result quality and search speed provided Postgres is likely to be good enough for many use cases, and is “complexity neutral” -- no new systems needed!