How to Get Metadata from PostgreSQL System Catalogs
PostgreSQL is a well-known database engine that allows you to write performant SQL queries.
In this tutorial, you’ll learn about the internal building blocks of PostgreSQL – also known as system catalogs – to see how the PostgreSQL backend is structured.
This tutorial aims to give you insights into how PostgreSQL system catalogs can help you manage your schemas, databases, tables, and even table columns.
What is a PostgreSQL system catalog?
When you set up PostgreSQL in your machine, you spin up a PostgreSQL cluster, a server that hosts databases.
This cluster has a set of catalogs. Each catalog is a collection of schemas, and each schema has a set of tables.
System catalogs are where a database management system stores schema metadata, such as information about databases, tables, and columns.
These catalogs are regular tables where you can manipulate data as you do with other tables. You can insert new rows, update values, add columns – or potentially mess up your system. So you shouldn’t change the system catalogs unless you really know what you’re doing.
A few system catalogs are shared across all databases in the PostgreSQL cluster. However, most system catalogs are database-specific.
Each system catalog name starts with “pg_”. PostgreSQL has some funny naming conventions for the column names of system catalogs. A high percentage of columns – if not all – have a prefix of a combination of three letters of the name after “pg_”. For example, the pg_database
has columns starting with “dat”, like datname, which is the database name column.
A couple more examples are:
pg_proc
, which has columns that start with “pro”.pg_namespace
, which has columns that begin with “nsp”.
Another interesting example is pg_class
, which has columns that start with “rel”. This is because the pg_class
catalog stores all information about tables and other objects that have columns. These kinds of objects are referred to as “relations”.
A typical example is a pg_constraint
catalog, which stores primary, unique, and foreign keys and table constraints. Each column starts with “con” except for the oid
column, which is the row identifier.
Getting started: Prerequisities
Following this tutorial, you’ll need to write SQL commands in your desired SQL client. You can also use CoderPad’s sandbox, which uses PostgreSQL 12.
One pro tip on CoderPad sandbox is to head over to the right terminal and run this to prettify the output of the SQL command:
coderpad=# \x
This is useful, especially if the record length is long. For example, this is before running \x
:
And now, to prettify that mess, run \x
on the terminal for expanded formatting for each record:
Why getting metadata is useful
Getting PostgreSQL metadata can help you have information about databases, tables, schema, indexes, columns, functions, and even size information.
To name a few use cases, you can do the following:
- Know who the database owner is
- Extract the number of transactions committed to the database
- List all tables in the PostgreSQL cluster
- Investigate the number of full scans, index scans, etc.
- Retrieve all schemas in your cluster
- Be informed of the tables that have no indexes so you can fix performance issues
- Investigate the data types of columns that have problems in your tables
- List functions based on their inputs
- Know the size of tables, databases, indexes, etc.
How to get database metadata
In this section, I’ll talk about how to get database metadata, get the name of the database owner, and extract statistical information about the database, like the number of transactions.
Using pg_database
The pg_database
catalog stores information about Postgres databases available in your cluster. Whenever you run the SQL command CREATE DATABASE
, a new row is inserted into the pg_database
catalog. This row has metadata related to that particular database.
This pg_database
catalog is system-wide, which means it’s shared across all databases in the cluster.
A query on pg_database
would look something like this:
SELECT datdba
FROM pg_database
WHERE datname = 'coderpad';
Code language: JavaScript (javascript)
This query would get you information about the coderpad database, like the database owner. In this case, it returns the string 10
, which references the OID
in the pg_authid
catalog.
To get the coderpad owner who initially created the database, you’d use this query:
SELECT a.rolname AS "Owner"
FROM pg_database d
JOIN pg_authid a ON a.oid = d.datdba
WHERE datname = 'coderpad';
Code language: PHP (php)
Using pg_stat_database
The pg_stat_database
catalog stores statistical information about the database. This catalog is database-specific.
Here’s an example query using pg_stat_database
:
SELECT xact_commit, xact_rollback
FROM pg_stat_database
WHERE datname = 'coderpad';
Code language: JavaScript (javascript)
This query helps us analyze database transaction metadata. The xact_commit
attribute shows how many transactions are committed in this database, while the xact_rollback
attribute indicates how many transactions have been rolled back (errored) in this database.
How to retrieve table metadata
In this section, I’ll talk about how to get table metadata, list all tables in the cluster, and retrieve statistical information and activity about tables, like the number of different scans.
Using pg_class
As mentioned earlier, the pg_class
catalog stores relations, such as tables, indexes, and views.
To list all ordinary tables in the cluster, run the following query:
SELECT *
FROM pg_class
WHERE relkind = 'r';
Code language: JavaScript (javascript)
The r
string value in the relkind
attribute points to the ordinary table.
Using pg_stat_user_tables
The pg_stat_user_tables
catalog stores stats and activity about tables. For example, take this query:
SELECT seq_scan, idx_scan
FROM pg_stat_user_tables
WHERE relname = 'projects';
Code language: JavaScript (javascript)
It gets the number of sequential scans and the number of index scans initiated on the projects table. Sequential scans are full scans where each row in the table is scanned. In most cases, sequential scans are slow, especially when you retrieve a minimal number from a big table (aka, the selectivity ratio is low). In contrast, index scans do not scan the whole tables and instead scan the tree data structure created when there is an index to that table.
This catalog is a database-specific query, so the project’s relation resides on the coderpad database.
How to get schema metadata
In this section, I’ll talk about how to get schema metadata and get all schema names in your PostgreSQL cluster.
Using pg_namespace
The pg_namespace
catalog stores namespaces. Per the PostgreSQL documentation, a namespace is the structure underlying SQL schemas. That means whenever you run the CREATE SCHEMA
command, a new record is created under pg_namespace
. Each namespace can have a separate collection of relations, types, etc., without name conflicts.
To list all schema names, run the following query:
SELECT nspname
FROM pg_namespace;
How to get indexes metadata
In this section, I’ll talk about how to get indexes metadata and tables that don’t use indexes across the PostgreSQL cluster or on a specific database.
Using pg_index
Part of the information about indexes is stored in the pg_index
catalog. The rest is mostly in pg_class
.
To get tables across the system that have no indexes, I have used the following query (which appeared in a Postgres conference workshop):
SELECT c.oid::regclass c
FROM pg_class c
WHERE relkind = 'r'
AND NOT EXISTS (SELECT 1 FROM pg_index i WHERE i.indrelid = c.oid);
Code language: PHP (php)
Note: the reg*
data type is used to format names. In the previous query, I used regclass
to format the relation’s name according to the class.
If you want to list all tables that don’t have indexes except the ones that exist on information schema (This query is based on one introduced in this Rober Haas workshop):
SELECT c.oid::regclass c
FROM pg_class c
WHERE relkind = 'r'
AND NOT EXISTS (SELECT 1 FROM pg_index i WHERE i.indrelid = c.oid)
AND c.relnamespace NOT IN (13127);
Code language: PHP (php)
Using pg_stat_user_tables.idx_scan
The database-specific query equivalent to the previous system-wide one is:
SELECT relname
FROM pg_stat_user_tables
WHERE idx_scan IS NULL;
Code language: PHP (php)
How to get columns metadata
In this section, I’ll talk about how to get column metadata and know their data types across the PostgreSQL cluster.
Using pg_attribute
The pg_attribute
catalog stores information about table columns. Every column in any table has a pg_attribute
row in the database.
To list the first 50 attribute names with their corresponding data type, run the following query:
SELECT attname, atttypid::regtype
FROM pg_attribute
LIMIT 50;
Code language: CSS (css)
Note: The regtype
data type here is used to format the attribute type id into a data type like name
, boolean
, smallint
, and more.
How to get functions metadata
In this section, I’ll talk about how to get function metadata and get specific kinds of functions based on the input across the cluster.
Using pg_proc
To get all PostgreSQL functions that take text, run the following query (introduced in a PGCon workshop):
SELECT oid::regprocedure
FROM pg_proc
WHERE 'text'::regtype = ANY(proargtypes);
Code language: PHP (php)
Note: The regprocedure
data type gets you each function’s name and argument type, such as starts_with(text, text)
.
How to get size information
In this section, I’ll discuss how to get size information about tables, databases, indexes, etc.
Table size using pg_table_size()
pg_table_size
is a database object management function that determines the disk space used by the specified table.
Take a look at the following example (mentioned in the PGCon workshop):
SELECT name, pg_size_pretty(size)
FROM (SELECT oid::regclass AS name, pg_table_size(oid) AS size
FROM pg_class WHERE relkind = 'r') x
ORDER BY size DESC;
Code language: PHP (php)
This query orders the cluster relations sizes in descending order. The pg_size_pretty()
function makes the size numbers into a human-readable format like in KB, MB, etc., as the default output size is in bytes.
Database size using pg_database_size()
Similarly, the pg_database_size
is a database function that computes disk space used by the specified database. So the following query (mentioned in the Robert Haas workshop) orders each database in descending order according to the corresponding size:
SELECT datname, pg_size_pretty(size)
FROM (SELECT datname, pg_database_size(oid) AS size
FROM pg_database) x
ORDER BY size DESC;
Code language: PHP (php)
Indexes size using pg_indexes_size()
The pg_indexes_size()
is another function that computes disk space used by the specified index.
The following query gets you records of index names ordered by size:
SELECT i.indexrelname, x.indexrelid, pg_size_pretty(size)
FROM (SELECT indexrelid, pg_indexes_size(indrelid) AS size
FROM pg_index) x
JOIN pg_stat_user_indexes i ON i.indexrelid = x.indexrelid
ORDER BY size DESC;
Code language: PHP (php)
Why knowing about system catalogs is useful
Getting information about your PostgreSQL database is not that hard if you know how to use system catalogs.
In addition to what we talked about in this article, here are some other things you can do with your new knowledge of system catalogs:
- You can see how PostgreSQL is configured, so you check that
pg_ table
that contains almost all you need. - Have a slow query? Tracking how many sequential scans or indexes are used might solve this performance problem.
- If you’re worried about the disk space of your database, you can now more easily check how big your databases are getting to see which ones may cause issues.
The system catalog will know about every SQL query you write. Practicing more with this wonderful feature will help you and your team to investigate performance, disk space, and memory usage in your cluster.
I’m Ezz. I’m an AWS Certified Machine Learning Specialist and a Data Platform Engineer. I help SaaS companies rank on Google. Check out my website for more.