Building in Public with AI — 3 Months of Lessons

10 February 2026Enjin Studio

What it’s like running a 12-agent AI team in public: costs, failures, surprising wins, and the cultural impact.

Building in Public with AI — 3 Months of Lessons

We decided to run twelve agents in public for three months. The idea was simple: expose progress, failures, and metrics so users could see what works and what doesn’t. The result was messier and more instructive than we expected.

Why twelve agents?

Twelve mirrored the roles we needed: ingest, triage, validator, scheduler, copywriter, notifier, deployer, monitor, analytical engine, cost manager, reviewer, and a human-overseer bot. Each agent had a clear remit and observable metrics. The human-overseer was not optional — it managed schedules and stepped in when aggregated confidence dropped below a threshold.

Costs and economics

Monthly voice: the baseline compute and API costs for the ensemble averaged about $18k/month. That includes frequent small calls (status checks, validations) and a handful of heavier planning runs. Infra and engineering overhead (observability, state stores) added another $6k/month. So roughly $24k/month to run this modest production-grade multi-agent system.

Surprises and failures

1) Correlated errors: when one agent misclassified an input, downstream agents amplified the error. We learned to add consistency checks and an idempotent design so failures don’t cascade.

2) Human trust: exposing agents in public forced us to design for explainability. Users demanded visible audit trails which, ironically, reduced our freedom to iterate quickly.

3) Cost leakage: small frequent calls add up. We initially left heartbeat intervals aggressive and paid for it. Tuning cadence saved 25% of calls without losing responsiveness.

What worked

- Shadow mode: running new logic in parallel and comparing outcomes before switching saved us from two potential outages. - Micro-commit visibility: publishing short changelogs and examples led to useful feedback from users who caught edge cases we missed. - Clear ownership: assigning a single owner per agent (person + SLA) prevented the blame game and sped incident response.

Culture and team impact

Running public agents made the team more disciplined. Engineers wrote better docs because users would read them. We also saw morale lift when small wins were visible: a 30% reduction in support tickets, a daily summary automating a mundane report — these concrete wins mattered more than internal metrics.

Lessons for others

- Expect higher costs than naive estimates. Budget for infra and observability, not just model calls. - Start with narrow roles and a human overseer. Autonomy should be gradual. - Make everything explainable by default. Users will demand it. - Use shadow mode liberally before any behavioral change goes live.

Closing

Three months was enough to see both the power and the fragility of multi-agent systems. We shipped real value, learned to tune for cost and reliability, and discovered that public scrutiny is a harsh but effective quality tool. If you’re thinking of running agents in public, prepare for scrutiny, budget properly, and treat explainability as a feature, not an afterthought.

← Back to Insights