Using embeddings in production with Postgres & Django for niche ad targeting

Eric Holscher

Apr 23, 2024

Using embeddings in production with Postgres & Django for niche ad targeting

This is an update to our original post on content-based ad targeting. In this post, I'll talk a bit more about our next step, using machine learning (embeddings specifically) to build better contextual ad targeting.

At the end of our last post, we were crawling all our publisher's pages, and categorizing pages into Topics based on page text. We did this by training a model with ~100 examples of each topic, and then storing the topics in our database for fast ad serving.

This gave us a good starting point for targeting ads by topic, but we wanted to get more granular.

Targeting each page individually with embeddings

Our new approach is to use word embeddings to represent both the advertisers landing page and the publisher pages. This allows us to generate a representation of these pages, which can be compared against each other.

We're currently using Python's SentenceTransformers library to generate these embeddings. We will likely upgrade to a more advanced model in the future, but this was perfect for our initial tests.

Generating, storing, and querying embeddings

A simple example of what this looks like might be:

import requests
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer

# Generate embeddings for a page

model = SentenceTransformer(MODEL_NAME, cache_folder=CACHE_FOLDER)
text = BeautifulSoup(requests.get(url), 'html.parser').get_text()
embedding = model.encode(text)
print(embedding.tolist())

We're then using pgvector and pgvector-python to manage these embeddings in Django & Postgres, which is what we're already using in production.

from django.db import models
from pgvector.django import VectorField

# Store the content in Postgres/Django

class Embedding(models.Model):
    # FK where we keep metadata about the URL
    analyzed_url = models.ForeignKey(
        AnalyzedUrl,
        on_delete=models.CASCADE,
        related_name="embeddings",
    )

    # Model name so we can use different models in the future
    model = models.CharField(max_length=255, default=None, null=True, blank=True)

    # The actual embedding
    vector = VectorField(dimensions=384, default=None, null=True, blank=True)

Then we're able to query the database for the most similar publisher pages to an advertiser's landing page:

from pgvector.django import CosineDistance
from .models import Embedding

# Find the most similar ads for the page we're serving an ad on

Embedding.objects.annotate(
    distance=CosineDistance("vector", embedding)
).order_by("distance")

Try out a demo

You can see a screenshot of our niche targeting in action at the top of this page. This is a simple proof of concept, but you can see how we're able to target ads specifically focusing on MongoDB and Databases, when serving a MongoDB ad.

You can try out our Niche Targeting Demo, and let us know how it goes!

Advantages of Niche Targeting

There is a huge win both in terms of privacy and user experience with this approach:

We're able to target ads to pages without needing to know anything about the user. The better we get at targeting, the more powerful our ethical advertising approach becomes, and the larger we can scale out network.
The user experience of minimalist, well-targeted ads is better. We're able to show fewer ads and charge more for them because they perform better. This is a win-win for everyone.
We were able to implement this approach with minimal changes to our existing infrastructure, mostly because we're already heavily invested in the Python ecosystem and Postgres.

Challenges and Considerations

We have a few challenges to overcome with this approach:

We currently aim for 50ms for ad response time, and this approach is currently slower than that. We're working on optimizing this with indexing, and might look at using an in-memory vector store in the future.
Embeddings currently work pretty well, but can often associate things that are not relevant. For example, we're run into issues where the same words are used to mean different things (eg. "model"), and the embeddings can get confused.

Conclusion

This approach is still in its early stages, but we're excited about the potential. The better we can get at ethical ad targeting, everyone in our network benefits:

Advertisers get better ad targeting, ensuring they show up in front of the right users.
Publishers get more money while showing a single ad rather than resorting to multiple, larger ads that take over their site
Users get a better experience, with ads that are relevant to the content they are reading.

This is our vision for advertising, and we're excited about the potential of this approach.

Thanks so much to Simon Willison for his blog post on embeddings, which is what inspired me to try this approach.