PostgreSQL has a rich and extensible type system. Beyond enums and composite types, we can:
- apply data validation rules
- override comparison operators like
=
/+
/-
- create custom aggregations
- define casting rules between types
With a little effort, a user-defined type can feel indistinguishable from a built-in. In this article we focus on validation and ergonomics while quickly touching on a few other concepts.
To illustrate, we’ll create an semver
data type to represent Semantic Versioning values. We’ll then add validation rules to make invalid states unrepresentable.
SemVer
A (very) loose primer on SemVer:
SemVer is a specification for representing software versions that communicate information about backwards compatibility. The type is typically represented as a string with 5 components.
Where pre-release
and metadata
are optional.
The intent of each component is outside the scope of this article but, as an example, incrementing the major version number notifies users that the release includes at least one backwards incompatible change.
For a concise representation of the full spec, check out the grammar.
SQL
For our purposes, we’ll assume that the SemVer type is a critical component of the application that needs to be queried flexibly and efficiently.
Storing Components
To that end, we’ll store each component of the version as a separate field on a composite type.
_10create type semver_components as (_10 major int,_10 minor int,_10 patch int,_10 pre_release text[],_10 build_metadata text[]_10);
We can create an instance of this type in SQL by casting a tuple as the semver_components
type.
_10select_10 (1, 2, 3, array['beta', '1'], array['meta'])::semver_components_10-- returns: (1,2,3,{'beta','1'},{'meta'})
Unfortunately, our definition is far too permissive.
_10select_10 (null, -500, null, array['?'], array[''])::semver_components_10-- returns: (,-500,,{'?'},{''
Our data type has no problem accepting invalid components. To list a few of the SemVer rules we violated:
- Major version must not be null
- Minor version must be ≥ 0
- Patch version must not be null
- Pre-release elements must only include characters [A-z0-9]
- Build metadata elements may not be empty strings
We need to add some validation rules to meet our “make invalid states unrepresentable” goal.
Validation
Domains are Postgres’ solution for optionally layering constraints over a data type. Domains are to types what check constraints are to tables. If you’re not familiar with check constraints, you can think of them as equivalent to zod/pydantic in javascript/python.
Let's codify some SemVer rules, layer them on the semver_components
type, and give the new domain a friendly name.
_18create domain semver_18 as semver_components_18 check (_18 -- major: non-null positive integer_18 (value).major is not null and (value).major >= 0_18 -- minor: non-null positive integer_18 and (value).minor is not null and (value).minor >= 0_18 -- patch: non-null positive integer_18 and (value).patch is not null and (value).patch >= 0_18 and semver_elements_match_regex(_18 (value).pre_release,_18 '^[A-z0-9]{1,255}$'_18 )_18 and semver_elements_match_regex(_18 (value).build_metadata,_18 '^[A-z0-9\.]{1,255}$'_18 )_18 );
which references a helper function:
_23create or replace function semver_elements_match_regex(_23 parts text[],_23 regex text_23)_23returns bool_23language sql_23as $$_23 -- validates that *parts* nullable array of non-empty strings_23 -- where each element of *parts* matches *regex*_23 select_23 $1 is null_23 or (_23 (_23 select (_23 bool_and(pr_arr.elem is not null)_23 and bool_and(pr_arr.elem ~ $2)_23 )_23 from_23 unnest($1) pr_arr(elem)_23 )_23 and array_length($1, 1) > 0_23 )_23$$;
Now, if we repeat our positive and negative test cases using the semver
type (vs semver_components
) we still accept valid states:
_10-- Success Case_10select_10 (1, 2, 3, array['beta', '1'], array['meta'])::semver,_10-- returns: (1,2,3,{'beta','1'},{'meta'})
while invalid states are rejected with an error:
_10-- Failure Case_10select_10 (null, -500, null, array['?'], array[''])::semver_10-- ERROR: value for domain semver violates check constraint "semver_check"_10-- SQL state: 23514
Testing
Our validation doesn’t have to be called manually. The semver
domain can be used anywhere you’d use the semver_components
type and the validations are automatically applied.
_19-- A table with a semver column_19create table package_version(_19 id bigserial primary key,_19 package_name text not null,_19 package_semver semver not null -- semver column_19);_19_19-- Insert some valid records_19insert into package_version( package_name, package_semver )_19values_19 ('supabase-js', (2, 2, 3, null, null)),_19 ('supabase-js', (2, 0, 0, array['rc', '1'], null)_19);_19_19-- Attempt to insert an invalid record (major is null)_19insert into package_version( package_name, package_semver )_19values_19 ('invalid-js', (null, 1, 0, array['asdf'], null));_19-- ERROR: value for domain semver violates check constraint "semver_check"
Good stuff!
We’re 48 lines of SQL in and have solved for making invalid states unrepresentable. Now lets think about ergonomics.
Displaying
Now that our data type is well constrained, you might notice that selecting values from a semver
typed column returns a tuple, rather than the SemVer string we’re used to seeing.
_10select_10 *_10from_10 package_version_10/*_10id | package_name | package_semver_10-------------------------------------_10 1 | supabase-js | (2,2,3,,)_10 2 | supabase-js | (2,0,0,"{rc,1}",)_10*/
For example: (2,0,0,"{rc,1}",)
vs 2.0.0-rc.1
We could work around that problem with some custom casts, but I’d recommend keeping everything explicit with a function call.
_16create or replace function semver_to_text(semver)_16 returns text_16 immutable_16 language sql_16as $$_16 select_16 format('%s.%s.%s', $1.major, $1.minor, $1.patch)_16 || case_16 when $1.pre_release is null then ''_16 else format('-%s', array_to_string($1.pre_release, '.'))_16 end_16 || case_16 when $1.build_metadata is null then ''_16 else format('+%s', array_to_string($1.build_metadata, '.'))_16 end_16$$;
Which allows us to query the package_version
table and retrieve a string representation of the data.
_12select_12 id,_12 package_name,_12 semver_to_text(package_semver) as ver -- cast as text_12from_12 package_version_12/*_12id | package_name | ver_12------------------------------_12 1 | supabase-js | 2.2.3_12 2 | supabase-js | 2.0.0-rc.1_12*/
Or, better yet, use a generated column
_10create table package_version(_10 id bigserial primary key,_10 package_name text not null,_10 package_semver semver not null,_10 semver_text text generated always as (semver_to_text(package_semver)) stored_10);
so the text representation is persisted along with the semver
type and incurs no query/filter penalty.
Other Tricks
Postgres provides all the tools you could want to make your data types/domains work with SQL as seamlessly as builtins.
For example, you could:
- add convenience functions to parse a semver type from text
- override the equality operator (
=
) to correctly reflect that versions differing only in build metadata are considered equal - add a
max
function to efficiently query for the newest version of each package from within the database
to name a few.
Aligning the right parts of your business’ logic with the database can dramatically improve throughput, decrease IO, and simplify application code.
Conclusion
Admittedly, building performant and ergonomic custom data types in Postgres involves a lot of ceremony.
That said, in cases where:
- the type’s data integrity is critical
- the type is well specified
- the type’s spec does not change (or changes infrequently)
Teaching Postgres to have first class support for your custom type can be transformative for data integrity and performance.