
Using embeddings in production with Postgres & Django for niche ad targeting
This is an update to our original post on content-based ad targeting. In this post, I'll talk a bit more …
Typically, our blog lives at the intersection of privacy, advertising, and building our business in this niche. This time, however, we're going to delve into some database scaling issues we have been hitting for some time and how we're solving them with DuckDB on top of our regular daily driver of PostgreSQL.
At both EthicalAds and at our parent company Read the Docs, we use PostgreSQL heavily. We use Postgres to store basically all our production data and to be our "source of truth" for all advertising stats, billing, and payouts to publishers. Postgres handles everything and is among the most dependable pieces of our infrastructure. Postgres can handle ML embeddings with pgvector for better contextual ad targeting and combined with a read replica we scaled our setup to hundreds of writes per second. At EthicalAds, we generally field ~4-5M ad requests per weekday and Postgres handles this kind of transaction processing effortlessly.
However, one area where we've struggled a bit with Postgres is on analytical queries. When a new advertiser prospect comes to us and asks how much ad inventory we have on content related a tool like Terraform, we want to be able to answer that easily. By aggregating data ahead of time, we can query the data much faster when an advertiser pulls up a report or needs an estimate on how long their campaign will take.
Expensive analytical queries can do a number on your database.
We run these expensive aggregations across our database nightly during lower traffic times and store the results in a series of tables of daily aggregated data by publisher, by country, by advertiser, and so on. Altogether, queries take around 45 minutes to an hour and add significant load to our read replica. Occasionally, (or not so occasionally if they happen to coincide with the autovacuum), they will timeout and have to be re-run. Despite how much we love Postgres at EthicalAds, this specifically has felt like one of the most brittle pieces of our setup.
Typically, these kinds of expensive aggregation queries are better fits for column databases, data warehouses and OLAP databases generally. We considered building out a data warehouse or other kinds of column oriented databases but never found something we really liked and we were always hesitant to add a second production system that could get out of sync with Postgres. Postgres additionally has extensions to add some of these capabilities (eg. citus) but these solutions all either didn't work for our use case or weren't supported on Azure's Managed Postgres, where we are hosted. This is where using DuckDB came to our rescue.
DuckDB is an in-process, analytical database and toolkit for analytical workloads. It's sort of like SQLite but for analytical workloads and querying data anywhere in a variety of formats. Like SQLite, you either run it in your app's process (Python for us) or you can run its own standalone CLI. It can read from CSV or Parquet files stored on disk or in blob storage or directly from an SQL database like Postgres.
Because most of our aggregations are for hourly or daily data and then data virtually never changes once it's written, it's a great match for a write-once system. As a result, we began writing parquet files on a daily basis to cloud storage. Apache Parquet files, are files optimized for column-wise data files that are widely supported by various tools including Python and DuckDB.
“Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.”
Using Python and DuckDB, it's easy to query a day or a month's worth of data whether from a developer laptop or directly from a server without adding any load on your production database. Queries that took 5-10 minutes against Postgres frequently take just a few seconds against parquet files optimized for the purpose. Here's an example of how to query cloud storage parquet files directly from your app server in Python:
import duckdb from fsspec import filesystem # Support for S3 is even more straight-forward duckdb.sql("INSTALL azure") duckdb.sql("LOAD azure") azure_account_name = os.getenv("AZURE_ACCOUNT_NAME") azure_account_key = os.getenv("AZURE_ACCOUNT_KEY") azure_conn_string = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key}" duckdb.register_filesystem(filesystem("abfs", connection_string=azure_conn_string)) # Query the daily parquet files # Want an ORM? There's a SqlAlchemy driver for DuckDB duckdb.sql("""SELECT COUNT(*) FROM read_parquet("abfs://datalake/2025-02-01.parquet");""") # Query a month of data duckdb.sql("""SELECT COUNT(*) FROM read_parquet("abfs://datalake/2025-01-*.parquet");""")
Creating our data lake of parquet files is cool, but the real power comes from "joining" this columnar optimized data back up against Postgres:
import duckdb # Enable the Postgres extension for DuckDB duckdb.sql("INSTALL postgres") duckdb.sql("LOAD postgres") pg_conn_string = os.getenv("REPLICA_DATABASE_URL") duckdb.sql( f"ATTACH IF NOT EXISTS '{pg_conn_string}' AS ads_pg (TYPE POSTGRES)" ) # "Join" a daily aggregation back up with our Postgres source of truth # This query is nearly instant but would take tens of seconds normally duckdb.sql(""" SELECT adv.name as 'advertiser_name', COUNT(*) as 'impression_cnt' FROM read_parquet("abfs://datalake/2025-02-01.parquet") pq INNER JOIN ads_pg.adserver_advertiser adv ON pq.advertiser_id = adv.id GROUP BY adv.name ORDER BY impression_cnt DESC LIMIT 10; """)
This provides a number of advantages including:
COPY
command to run aggregations against parquet files
and copy the aggregated back to Postgres directly from DuckDB.
This sounds counter-intuitive but if you're running many different kinds of aggregations
against the same data, this can be faster than querying directly."But David. Won't it be slow to run a SQL query against a remote file?" Firstly, these queries are strictly analytical queries, nothing transactional. Remember that with any of the major clouds these blob storage files are going to be in or near the data center where the rest of your servers are running. Querying them is a lot faster than I expected it to be. For reports, estimates and other analytical workloads where folks are used to waiting a few seconds, it works fairly well.
While DuckDB is pretty smart about cross database queries, I put "joins" in scare quotes in the previous section for a reason. These are not traditional database joins and at some level DuckDB is querying records from one database into memory and using it to query the other database. In some situations, including some we saw in our workloads, performance can degrade pretty quickly and spike memory and CPU usage on both Postgres and the DuckDB process. Expensive, cross-database queries require a bit of extra testing and scrutiny.
Lastly, if anybody from the Azure team happens to be reading this, we'd love it if you'd add pg_parquet to Azure Managed Postgres now that it supports Azure storage. Dumping parquets from Postgres directly would be much more optimized than doing that from DuckDB. DuckDB is still amazing for reading these files once they're written, but creating them directly with Postgres would be better still.
Hopefully this was helpful to see some concrete examples of using DuckDB in addition to PostgreSQL for analytical workloads. We believe there's a lot of potential to use DuckDB with parquet files for these kinds of queries in more places on EthicalAds and with Read the Docs as well.
Thanks for tuning in to one of our more technical posts about some of the challenges of building ad network without invasive tracking. Please let us know if you have any ideas or feedback on our product or service. We always love to hear from you.
Tagged with: postgresql duckdb performance engineering
This is an update to our original post on content-based ad targeting. In this post, I'll talk a bit more …