Building a 712-Species Wildlife Sticker Machine: Dataset-Driven AI Generation

✍️ This post was written collaboratively by Arun Batchu and Cascade, the AI pair programmer that built this pipeline alongside him.

Most AI content pipelines start with a blank canvas: "generate something interesting." The problem is that "interesting" is vague, and AI models left to their own devices tend to converge on the same handful of popular topics. We wanted something different — a pipeline with built-in variety, factual grounding, and a finite scope that would force creative diversity.

The answer was hiding in a government spreadsheet. The Minnesota Department of Natural Resources publishes a complete dataset of all 712 wildlife species found in the state. We took that list and wired it directly into a new Mastra agent: the Minnesota Wildlife Agent. Every day at 8 AM, it picks the next ungenerated species, researches it with Gemini, generates a dreamy watercolor sticker, and publishes it to the Shilpiworks store — fully automatically.

The Pipeline

The workflow has four steps, each with a single responsibility:

Pick Species — Query the AgentRun table for all past mn-wildlife runs, extract the species names, and pass them as an exclusion list to the picker. The picker selects the next species not yet generated.
Research Copy — Use Gemini to generate factual marketing copy about the species: habitat, behavior, conservation status, and what makes it distinctive. This is grounded research, not hallucinated fluff.
Build Prompt — Combine the species name, factual context, and a carefully designed visual style into a single image generation prompt.
Generate & Publish — Call the OpenAI Responses API to generate the watercolor illustration, run OCR and transparency validation, then publish to Vercel Blob + Postgres.

The "No Repeats" Problem

The most interesting engineering challenge was ensuring the agent never generates the same species twice — even across hundreds of runs over months. The solution is simple but effective: before picking a species, we query the AgentRun database table for every successful mn-wildlife run and extract the species name from the theme field.

// Get all previously generated species
const recentRuns = await db.agentRun.findMany({
  where: { status: 'success', type: 'mn-wildlife' },
  select: { theme: true },
});
 
const recentSpecies = recentRuns
  .map(r => r.theme?.replace('Minnesota Wildlife: ', ''))
  .filter(Boolean);
 
// Pass exclusion list to the picker step
const result = await run.start({
  inputData: { recentSpecies },
});

The picker step receives this list and filters it out of the 712-species dataset before selecting. No complex state management, no separate tracking table — the AgentRun table itself is the memory.

Why Factual Copy Matters

Generic sticker copy ("Beautiful wildlife sticker! Perfect for nature lovers!") is SEO noise. For a dataset-driven product line, the copy should be as specific as the subject. We prompt Gemini to research each species and return structured facts: its habitat range, notable behaviors, conservation status, and one distinctive characteristic that most people don't know.

The result is product descriptions that actually teach you something. A sticker of a Ross's Goose isn't just "a cute bird sticker" — it's a product with a description that mentions the species' nesting grounds in the Queen Maud Gulf and its dramatic population recovery from near-extinction in the 1960s. That specificity builds trust and differentiates the product.

The Math of Autonomous Content

712 species ÷ 5 per day = 142 days of fully autonomous content from a single public dataset. No human intervention, no creative block, no repetition. Each sticker is factually grounded, visually distinct (the species determines the composition), and SEO-differentiated by the species name and its unique characteristics.

The insight: Public datasets are an underrated creative resource. A government spreadsheet, a museum catalog, a scientific taxonomy — any authoritative list of distinct subjects can become a content engine when paired with AI generation. The dataset provides variety and factual grounding; the AI provides the creative execution.

What We Learned

Use your database as memory. The AgentRun table already tracks every run. Querying it for exclusion lists is simpler than building a separate state management system.
Factual grounding beats generic prompts. Giving the AI real research about the subject produces dramatically better copy than asking it to "write something interesting about a Ross's Goose."
Dataset-driven pipelines are self-limiting in a good way. The finite scope (712 species) forces the agent to explore the full diversity of the dataset rather than defaulting to the most popular subjects.
Public data is underused. Government agencies, museums, and scientific institutions publish rich, authoritative datasets that are free to use. Most developers never think to reach for them.

The Minnesota Wildlife Agent is now live and running. By the time it exhausts all 712 species — sometime in late 2026 — Shilpiworks will have a complete, factually-grounded collection of every wildlife species in the state of Minnesota. Not bad for a government spreadsheet and a few hundred lines of JavaScript. Browse the collection and grab your favorite species at shilpiworks.com →

The Pipeline

The "No Repeats" Problem

Why Factual Copy Matters

The Math of Autonomous Content

What We Learned

About the author

Arun Batchu

Two Silent Killers: Vercel Cron Auth and the Two Flavors of OpenAI 429

The Vercel Edge Cache Trap: Why force-dynamic Doesn't Always Work