Constraints Over Flexibility: Eight Years of Infrastructure Generation | Avi Zurel

[Strong Opinion]

Most internal platforms fail because they try to be too flexible. They expose arbitrary environment variables, inline overrides, escape hatches, per-service customizations. That flexibility accumulates entropy until the platform becomes unmaintainable.

[/Strong Opinion]

There’s a different approach: constrain the interface aggressively, generate the boring parts, and compile to standard tools. Engineers declare what they’re building in a restricted format, and a compiler produces vanilla Terraform, CI configs, and cloud topology.

This approach has been successful in production for the last ~8 years.

If we think for a second how much has changed over the last 8 years in the infrastructure world, we can understand the power this offered to us.

I believe that this durability comes from what the approach refused to do rather than what it actually does. No clever abstractions. No escape hatches. Just constraints that prevent the platform from becoming a maintenance nightmare.

The Model: A Constrained DSL

At the center of the platform is a JSON-based service-directory.

Engineers don’t write Terraform directly. They don’t have to wire SNS to SQS manually. They don’t configure autoscaling policies or build CI pipelines from scratch.

Instead, they just declare what they want.

For example:

{
  "service_type": "safari-worker",
  "tier": "worker",
  "name": "partner-program-email-worker",
  "responsible_teams": ["papr"],
  "auto_scaling": {
    "min_capacity": 1,
    "max_capacity": 2
  },
  "resources": {
    "memory": 256
  },
  "pubsub": [
    {
      "topic_type": "global",
      "filter_policy": ["import.error-files-ready"]
    }
  ]
}

That’s all you write.

From that single JSON definition, the platform generates everything: CI/CD pipelines, IAM roles, pub/sub wiring, autoscaling configuration, logging, metrics, alerts, dead letter queues, ownership routing, deployment scaffolding, and all the Terraform configuration.

You pick a service type and fill in a few parameters. Everything else is generated for you.

The Service Catalog

Over time, the system converged into a small set of service archetypes:

safari-service: HTTP backend with PostgreSQL

safari-worker: Event-driven async consumer

safari-runner: Cron-based job

s3-vertical: S3 ingestion topology (bucket + notifications + wiring)

file-processing-pipeline: Multi-step file transformation and validation (more on this later)

Instead of engineers asking “How should we deploy this?”, they ask “What type of system is this?” That one shift removes dozens of infrastructure decisions per service.

And when a new pattern emerges - like file processing pipelines - it becomes a new archetype that everyone can use.

Pausing here for a second: I wrote about some of these decisions in The Hippo PAAS, the number of archtypes is small here, and you are probably thinking “this will never work for us, we have dozens of types”. The idea of this platform isn’t to accommodate everything you want to do in every team, it is to help dictate what is the platform topology and how you compose that through service archtypes. So you can make an http-service archtype that will essentially be a catch all of “this service exposes an HTTP interface through an nginx proxy and it has a lint and test commands in the pipeline`. It can be as narrow or as wide as you want.

“We Just Need To Send an Email”

This platform actually changed how we talk about work.

Look, systems have capabilities. Individual teams shouldn’t, at least not in the operational sense.

A domain shouldn’t have to “figure out email.” It should just depend on a capability that already exists: send email. That means one system owns the provider keys, the retries, the monitoring. Not 200 different services all reinventing email sending.

The same pattern applies to storage, pub/sub, cron execution, and external integrations. Once you implement capabilities as standardized systems, you get a bunch of benefits for free: secrets aren’t duplicated across 200 services, you can swap providers without touching business logic, monitoring and retries are consistent everywhere, security posture is uniform, and upgrades roll out once instead of 200 times.

That’s what the platform actually enforces - not just infrastructure patterns, but how teams think about ownership.

Known Shapes

Across different domains, the architecture keeps repeating: a gateway for synchronous entry points, a worker for asynchronous orchestration, and a runner for scheduled execution.

That’s not just cosmetic. It matters because engineers can move between different domains without having to relearn how deployment works, how scaling is configured, how logging is structured, how alerts are routed, or how runtime upgrades happen. Incidents look the same across domains. Dashboards look the same. Changes roll out the same way.

This was true when we had around 60 engineers and it’s still true at around 200. The platform didn’t really make anyone faster at solving their actual business problems. It just eliminated a whole bunch of problems that shouldn’t have existed in the first place.

Infrastructure as Compilation

The most important design decision we made - and we made this back in 2018 - was that the DSL compiles to completely vanilla Terraform.

There’s no runtime control plane. No custom CDK wrapper. No dynamic infrastructure layer. No platform service orchestrating things behind the scenes.

The pipeline looks like this:

Service DSL → Schema validation → Service directory typed client → Python generation code → Jinja2 templates → Terraform → AWS.

The service directory isn’t just a folder of JSON files. It’s a Python library that exposes a typed client. Engineers edit JSON, but when they commit, the JSON gets validated against Marshmallow schemas with an OpenAPI spec. If the JSON is malformed or non-compliant, the commit fails. No arbitrary fields, no wrong types, no escape hatches.

The generation code doesn’t parse JSON directly - it uses the typed client from the service directory. This means we’re always working with validated, structured data. The pipeline is basically a bunch of Python and Jinja2 templates that would make a functional programming purist cry. But it’s predictable.

Terraform stays Terraform. If the generator disappeared tomorrow, the infrastructure would still be standard Terraform code that anyone could work with.

Terraform is generated and committed directly to the terraform repositories. It does not depend on any generation or API in the runtime. We also choose to divide terraform aggressively, which gives us more control over the blast raduis. I wrote a bit about it here: Breaking the terraform monolith

Here’s what the generation layer actually looks like. This is a simplified Jinja2 template for S3 verticals:

{% for s3_vertical in s3_vertical_list %}

module "s3-{{s3_vertical.vertical_name}}" {
  source = "./modules/s3"

  versioning_enabled        = true
  environment               = var.environment
  name                      = "{{s3_vertical.vertical_name}}"
  notifications_enabled     = true
  object_ownership          = "BucketOwnerEnforced"
  cors_enabled              = {{ s3_vertical.cors_enabled | lower }}
  malware_detection_enabled = {{ s3_vertical.malware_detection_enabled | lower }}
  lifecycle_days            = {{ s3_vertical.lifecycle_days | int }}
  {%- if s3_vertical.bucket_notification_prefix_filter != "" %}
  filter_prefix             = "{{ s3_vertical.bucket_notification_prefix_filter }}"
  {%- endif %}

  allowed_file_extensions = {{ s3_vertical.allowed_file_extensions | tojson }}

  allow_roles = local.allow_roles

  notifications = {
    region     = "us-west-2"
    topic_name = "{{s3_vertical.bucket_notification_topic_name}}-${var.environment}"
    account_id = var.parallel_account_id
  }

  tags = {
    Environment = var.environment
    ServiceName = "{{s3_vertical.bucket_notification_topic_name}}"
  }

  replications = [
    {%- for replication in s3_vertical.replications -%}
      {
        environment             = "documents-{{replication.environment}}",
        source_account_id       = "{{replication.source_account_id}}",
        destination_bucket_name = "{{replication.destination_bucket_name}}"
      },
    {%- endfor -%}
  ]

  whitelists = [
    {%- for whitelist in s3_vertical.whitelists -%}
      {
        environment = "documents-{{whitelist.environment}}",
        source_ips  = {{whitelist.source_ips | tojson}}
      },
    {%- endfor -%}
  ]
}

{% endfor %}

Nothing fancy. The Python generation code uses the service directory’s typed client to get the validated service definitions, iterates through them, runs them through Jinja2, outputs Terraform. The Terraform module it references is just a normal Terraform module that anyone on the team could read and understand.

The schema validation is actually a key part of how constraints are enforced. You can’t add arbitrary fields to the JSON because Marshmallow won’t validate them. You can’t use wrong types. You can’t sneak in escape hatches. The system just fails at build time and tells you what you did wrong.

Here’s something to consider: when we built this, Terraform didn’t even have for_each yet. It’s not the same terraform we have today. We were generating what Terraform itself couldn’t express declaratively at the time. But because we compiled to vanilla Terraform primitives, when those features eventually shipped, our output stayed compatible. We didn’t have to rewrite anything.

Constraints Are a Feature

Most internal platforms fail because they try to be too flexible. We deliberately constrained the DSL. You can’t arbitrarily inject environment variables. You can’t manually wire AWS primitives together. You can’t bypass the topology rules.

If something isn’t supported, it has to become a new archetype - not a local exception to the rules.

I’ll be honest: This takes getting used to by engineers. It took time to build this habit of thinking in archtypes and system capabilities. We had to be the “No” team for a while, and that sucked. But the trade-off of that is that we can rotate PG passwords for every single service on our platform in one go. They all look exactly the same, they are stored in a predictable place in our vault, the databasee names follow the convention, etc.

Let’s take a real life example: The safari-service archetype comes with PostgreSQL. That’s it. Over the years, teams would come asking for MongoDB, or DynamoDB, or whatever. We’d have to evaluate it at the company level - is this a pattern we’re going to see repeatedly, or is this team just optimizing locally? Sometimes the answer was “no, use Postgres.” Sometimes it became a new archetype. But we never let individual teams just bolt on their preferred database because it felt right to them. Again, this is a strong stance. It takes real effort to elevate the discussion to that level and stand behind it.

Could we have built more archetypes to accommodate different database choices? Sure. The system could technically handle safari-service-mongo or safari-service-dynamo. But organizationally, we didn’t want to. That’s not a limitation of the DSL or the generation layer - it’s a philosophical choice. We valued operational simplicity over local optimization.

That meant some teams were stuck using a database that wasn’t their first choice. That’s a real cost. But the alternative was 200 services running 15 different data stores with 15 different backup strategies, 15 different failure modes, and 15 different people who knew how to fix them when they broke at 3am.

Decision Elimination Is the ROI

By 2026 we have around 200 services across these archetypes. Every service normally comes with this huge pile of decisions: CI configuration, Docker setup, runtime upgrades, deployment strategy, scaling, logging, metrics, alerts, IAM, networking, secret management.

Before this platform, provisioning a new service meant copying an existing repository, manually editing CI files, creating IAM roles, wiring up SNS/SQS, adding alerts, and coordinating with infrastructure engineers. It could easily take days.

Now it takes a small JSON definition, a pull request, and a pipeline run. Provisioning takes minutes.

But honestly, the real value isn’t even the first deployment. It’s the compounding effect over time.

The ROI is something like: 200 services × dozens of decisions avoided per service × years of that compounding.

No more “how do I deploy this?” threads showing up in Slack at 4pm on a Friday. No more bespoke Dockerfiles that only one person understands. No more incidents caused by configuration drift. No more snowflake services.

Runtime upgrades happen everywhere the same way. CI changes propagate everywhere the same way. Security improvements roll out once, not 200 times.

At small scale, this just feels like a nice convenience. At 200 services, it’s the difference between a platform team that spends all day reviewing infrastructure PRs and one that actually has time to build new things.

A Recent Example: File Processing Pipelines

Just a few weeks ago, a team came with a problem. They needed to process files from partners before ingestion. Files would come in as XLS, needed virus scanning, column validation, format conversion to CSV, and then delivery to their existing ingestion service.

In most organizations, this team would go architect a custom solution. They’d write Terraform for S3 buckets, wire up Step Functions manually, build error handling from scratch, figure out monitoring, and then maintain it forever. And the next team with a similar need would do it all over again, differently.

We recognized this wasn’t a one-off. It’s a pattern. Teams keep needing to transform and validate files before consumption.

So we’re adding it as a new archetype: file-processing-pipeline.

Here’s what it looks like:

{
  "service_type": "file-processing-pipeline",
  "vertical_name": "partner-program-input",
  "input_bucket": {
    "cors_enabled": true
  },
  "destination_bucket": "partner-program-ingestion-bucket",
  "steps": [
    {
      "type": "lambda",
      "name": "virus-scan"
    },
    {
      "type": "lambda",
      "name": "validate-columns"
    },
    {
      "type": "lambda",
      "name": "xls-to-csv"
    },
    {
      "type": "lambda",
      "name": "add-metadata"
    }
  ]
}

That’s all you write. The platform generates the Step Function state machine, all the S3 buckets, IAM roles, SNS topics for errors, DLQs, CloudWatch logging, monitoring, and all the Terraform to wire it together.

Files go through the pipeline steps in order. If they make it through, they land in the destination bucket ready for consumption. If any step fails, an SNS event gets published with details about what failed and why.

The ingestion service stays completely dumb to file formats. It only knows how to process CSV. The pipeline handles everything else.

This is happening right now, in 2026. Eight years after the first version of this system. We’re adding new capabilities using the same philosophy: constrain the interface, generate the boring parts, compile to vanilla infrastructure.

What I’d Tell Someone Starting a Platform Team Today

If I had to boil this down to advice:

Generate the boring parts. Constrain your configuration aggressively. Prefer adding new service types over adding escape hatches. Compile to standard tools your organization already trusts. Make ownership explicit. Treat your platform patterns as products.

We use the term “off the shelf product”. Products are ingredients and the way you combine them is your recipe.

Final Thought

The first iteration of this was in 2018 when I was with a differrent company. Built again at Hippo in 2021 with the same principals. It’s 2026 now and the core principles are still the same.

Engineers describe what they’re building. The platform materializes the infrastructure topology.

Eight years later, the boring parts are still boring.

And that’s exactly the point.

Credits: Thanks to Jesse Myers, Gabe Lam, and Pankaj Ghosh for being amazing design and engineering partners throughout this journey.