RAXXO Studios April 28, 2026 11 min read No time? Make it a 1 min read

The 5 Postgres Extensions Every Shopify Backend Needs

Database

11 min read

TLDR

pg_cron schedules order and inventory mirror jobs inside Postgres without a separate worker
pgvector turns product titles and descriptions into semantic search and recommendations
pg_partman partitions orders and webhook events by month before they grind queries to a halt
pg_stat_statements surfaces the slowest Shopify mirror queries before customers feel them
postgres_fdw federates the Shopify mirror with the analytics warehouse without ETL
Bottom line: install in this order, pg_stat_statements first, pg_cron second, pg_partman third, then pgvector and postgres_fdw as needs arrive

A custom Shopify backend is not really a Shopify project. It is a Postgres project that happens to talk to Shopify. The webhooks, the order mirror, the inventory cache, the product embeddings, the analytics rollups, all of it lives or dies on what the database can do. Generic Postgres is fine for a SaaS app with a few thousand rows per table. It is not fine for a backend that ingests every order webhook, every inventory delta, every customer event, and tries to keep them queryable for years. By month six the orders table has 4 million rows, the webhook log has 80 million, and a query that used to return in 30ms now takes 2.4 seconds. The fix is not more indexes (I covered those in the 7 Postgres indexes that took an API from 400ms to 40ms). The fix is five extensions that turn generic Postgres into a Shopify aware database.

pg_cron, schedule mirror jobs without an external worker

Every Shopify backend needs scheduled work. Reconcile the order mirror against the Shopify Orders API every 15 minutes in case a webhook was dropped. Refresh the inventory snapshot every hour. Rebuild the product embedding index every night. Vacuum the webhook log every Sunday. The bad version of this is a separate Node worker, a separate Heroku scheduler, a separate cron container, each with its own deploy story and its own way of failing silently.

pg_cron runs cron jobs inside Postgres. The schedule is a row in a table, the job is plain SQL, and if Postgres is up the schedule runs. If Postgres is down nothing else matters anyway.


CREATE EXTENSION IF NOT EXISTS pg_cron;

-- Reconcile orders every 15 minutes
SELECT cron.schedule(
  'reconcile-orders',
  '*/15 * * * *',
  $$ CALL shopify.reconcile_orders_since(NOW() - INTERVAL '30 minutes'); $$
);

-- Refresh inventory snapshot every hour at :07
SELECT cron.schedule(
  'refresh-inventory',
  '7 * * * *',
  $$ REFRESH MATERIALIZED VIEW CONCURRENTLY shopify.inventory_snapshot; $$
);

-- Vacuum the webhook log every Sunday at 03:00
SELECT cron.schedule(
  'vacuum-webhooks',
  '0 3 * * 0',
  $$ VACUUM (ANALYZE) shopify.webhook_events; $$
);

On managed Postgres (Supabase, Neon, RDS) pg_cron is a one click extension. On self hosted you add `cron.database_name = 'app'` to `postgresql.conf` and restart. Every job logs to `cron.job_run_details`, which means a single SQL query tells me which mirror jobs are failing and how long they take. That is the observability story I want, not a Datadog dashboard for a worker that runs four times a day.

The only sharp edge: pg_cron jobs run as the database superuser by default. For a multi tenant backend I create a `cron_runner` role with exactly the grants the jobs need and use `cron.schedule_in_database` to scope them.

pgvector, semantic product search and recommendations

Shopify's native search is keyword based. A customer searching "warm jacket for cold rainy commutes" gets nothing because no product literally contains that phrase. Semantic search fixes this by embedding every product title and description into a vector and finding the nearest neighbours to the query embedding. pgvector lets Postgres store those vectors and run nearest neighbour search with an index, which means I do not need Pinecone, Qdrant, or a separate vector service.


CREATE EXTENSION IF NOT EXISTS vector;

ALTER TABLE shopify.products
  ADD COLUMN embedding vector(1536);

-- HNSW index for fast approximate nearest neighbour
CREATE INDEX products_embedding_hnsw
  ON shopify.products
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Find the 10 products closest to a query embedding
SELECT id, title, 1 - (embedding <=> $1) AS similarity
FROM shopify.products
WHERE status = 'active'
ORDER BY embedding <=> $1
LIMIT 10;

The `<=>` operator is cosine distance. The HNSW index makes a query over 200,000 products return in 8 to 12ms on a 4 vCPU instance, which is faster than most keyword searches against a B tree on the title column. For embeddings I use OpenAI's `text-embedding-3-small` (1536 dimensions, 0.02 EUR per million tokens) and re embed on product update via a trigger that pushes to a queue table that pg_cron drains every minute. That trigger plus queue pattern is one of the 8 Drizzle ORM patterns I use in every Shopify backend, because it survives backfills and rate limits.

Recommendations are the same query with a different input vector. "Customers who bought this also bought" becomes "find the 8 products closest to the centroid of this customer's purchase history." Same index, same operator, no new infrastructure. That is the leverage.

pg_partman, partition tables that grow forever

The `orders` table grows by every order. The `webhook_events` table grows by every webhook (an active store can generate 50,000 events a day). The `customer_events` table grows by every page view if you mirror that. By year two these tables are tens of millions of rows. Vacuum takes hours. Index bloat slows every write. A query like "orders from the last 30 days" scans data from years one and two even with a perfect index.

pg_partman automates declarative partitioning by time range. I partition by month, keep the last 24 months hot, and detach older partitions to cheap storage or drop them entirely.


CREATE EXTENSION IF NOT EXISTS pg_partman;

-- Partition orders by created_at, monthly
CREATE TABLE shopify.orders (
  id           bigint        NOT NULL,
  shop_id      bigint        NOT NULL,
  created_at   timestamptz   NOT NULL,
  total_price  numeric(12,2) NOT NULL,
  payload      jsonb         NOT NULL
) PARTITION BY RANGE (created_at);

SELECT partman.create_parent(
  p_parent_table     => 'shopify.orders',
  p_control          => 'created_at',
  p_type             => 'range',
  p_interval         => '1 month',
  p_premake          => 3
);

-- Retention: keep 24 months, detach older
UPDATE partman.part_config
   SET retention             = '24 months',
       retention_keep_table  = false,
       retention_keep_index  = false
 WHERE parent_table = 'shopify.orders';

-- Run maintenance hourly via pg_cron
SELECT cron.schedule(
  'partman-maintenance',
  '0 * * * *',
  $$ CALL partman.run_maintenance_proc(); $$
);

The wins are concrete. A query for last 30 days only scans one or two partitions instead of the whole table, so it goes from 1.8 seconds to 40ms. Vacuum on a 600,000 row monthly partition takes 12 seconds instead of 4 hours on a 40 million row monolith. Dropping a 24 month old partition is instant, no DELETE, no bloat. Pair this with the time series indexes from the indexes article and the orders table stays fast forever, which it has to, because Shopify keeps every order forever.

pg_stat_statements, find the 5 slowest queries before customers do

Most slow Shopify backends are slow because of three or four specific queries that nobody noticed. The product list endpoint that does an N+1 over variants. The customer detail page that joins six tables without a covering index. The webhook handler that does a sequential scan because the WHERE clause uses a function on the indexed column. pg_stat_statements aggregates every executed query, normalises the parameters, and tracks calls, total time, rows returned, and cache hits.


-- In postgresql.conf
-- shared_preload_libraries = 'pg_stat_statements'
-- pg_stat_statements.track = all

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- The 10 slowest queries by total time
SELECT
  substring(query, 1, 80) AS query_snippet,
  calls,
  round(total_exec_time::numeric, 0)        AS total_ms,
  round(mean_exec_time::numeric, 2)         AS mean_ms,
  round((100 * total_exec_time
         / sum(total_exec_time) OVER ())::numeric, 1) AS pct_total,
  rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

That `pct_total` column is the one that matters. If a single query is 38% of all database time, that is the query I fix today. Usually it is one of three things, a missing index, a JSONB lookup that should be a column, or a join order the planner gets wrong because the statistics are stale.

I run this query at the start of every Friday review, paste the top five into the napkin, and either fix them or ticket them. Six months in, mean response time across the API dropped from 180ms to 42ms with no infrastructure changes, just deleting four bad queries the extension surfaced. This pairs naturally with the Lab overview playbook of measuring before optimising.

The reset cadence matters. I reset stats every Sunday at 03:00 via pg_cron so each week's top offenders are this week's, not a stale tail from a backfill three months ago.

postgres_fdw, federate the mirror with the analytics warehouse

Shopify backends usually have two databases. The hot operational one (orders, products, webhooks, customers) and the analytics warehouse (Postgres, ClickHouse, BigQuery). The bad pattern is an ETL job that copies operational tables to the warehouse every hour. It is brittle, it is stale, it doubles the storage bill, and it is the thing on call gets paged about at 03:00.

postgres_fdw lets Postgres query another Postgres as if its tables were local. I keep the operational DB lean, expose it to the warehouse via a foreign server, and the warehouse joins live operational data against historical aggregates without ETL.


CREATE EXTENSION IF NOT EXISTS postgres_fdw;

-- On the warehouse, point at the operational DB
CREATE SERVER ops_db
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (host 'ops.internal', port '5432', dbname 'shopify_ops');

CREATE USER MAPPING FOR analytics_role
  SERVER ops_db
  OPTIONS (user 'analytics_reader', password '...');

-- Import only what we need
IMPORT FOREIGN SCHEMA shopify
  LIMIT TO (orders, products, customers)
  FROM SERVER ops_db
  INTO ops;

-- Now join live orders with warehouse aggregates
SELECT
  o.id,
  o.total_price,
  c.lifetime_value,
  c.churn_risk_score
FROM ops.orders o
JOIN warehouse.customer_metrics c ON c.shopify_id = o.customer_id
WHERE o.created_at > NOW() - INTERVAL '24 hours'
ORDER BY c.churn_risk_score DESC
LIMIT 50;

The query planner pushes the WHERE clause down to the operational DB so only 24 hours of orders cross the wire, not the whole table. With `use_remote_estimate 'true'` on the foreign server, joins get planned correctly even when the foreign tables are huge. I use this for the same dashboard pattern I described in Shopify metaobjects killed my headless CMS, where one Postgres is the source of truth and other systems read through it instead of duplicating it.

The trap to avoid is using FDW for write paths or for high QPS user facing reads. It is a batch and analytics tool. For the customer facing API the operational DB still answers directly. For the BI dashboard, the nightly cohort calculation, the churn model feature pull, FDW removes a whole ETL pipeline.

Bottom line

Five extensions, but I do not install them at the same time. The order matters because each one earns its place by solving a specific pain.

I install pg_stat_statements first, on day one of every project, before any production traffic. It costs nothing, it teaches me what the database is actually doing, and it tells me which of the next four extensions I need next.

pg_cron goes in second, usually within the first month, the moment I write the second scheduled job. One cron is fine in the app layer, two means I want them in the database where the data is.

pg_partman comes third, around the time the orders table hits 2 million rows or the webhook log hits 20 million. Doing it earlier is premature, doing it later means a painful migration. The 2 million row mark is the sweet spot.

pgvector arrives when product search starts mattering as a feature, not as a checkbox. If the store has under 500 products and customers find what they want via the nav, semantic search is over engineering. Above 5000 products, it is the difference between a store that converts and one that does not.

postgres_fdw is the last one, because by the time I need it I already have a real analytics workload that justifies a second database. Adding FDW before there is a warehouse to federate with is solving a problem I do not have yet.

The meta point: a Shopify backend is a database problem, not a JavaScript problem. The framework on top of it (Next, Hono, Elysia, whatever) is interchangeable. The Postgres underneath is not. Spend the time on the extensions, the indexes, the partitioning strategy, and the rest of the stack gets simpler.