How We Built a Repeatable Paper-to-Podcast Workflow That Actually Ships

At first glance, "paper to podcast" sounds like the kind of thing AI should make trivial. Feed a PDF into a model, ask for a script, generate some voices, and publish the MP3. But when we actually tried to turn research papers into Netrii podcast episodes, the real problem was not generation. The real problem was everything around generation: extracting the source cleanly, writing dialogue that sounded like two humans instead of alternating narrators, managing hardcoded workflow scripts, mixing audio in a way that felt polished, and making sure the final asset actually made it to production.

What emerged from that work was not just a better prompt. It was a repeatable operating procedure. That ended up being the real product: a workflow we can run again without rediscovering the same mistakes every time.

The Naive Version Is Easy

The naive workflow is straightforward. Extract the paper, ask the model for a two-host script, generate the voice lines, stitch them together, and upload the final MP3. If all you care about is "did audio come out," that is enough. We had that working quickly.

But the first passes had all the classic AI-content failure modes. The hosts sounded like they were taking turns reading mini-essays. The openings felt abrupt. Pronunciation was inconsistent. Extra sound effects made the episodes feel more synthetic, not more polished. And even after we had good local audio, production still lagged behind because the website repo had not actually been updated and pushed.

The key lesson: the hard part of AI podcasting was not converting text into speech. It was defining the workflow tightly enough that quality and deployment became repeatable.

What Had To Become Explicit

The turning point was when we stopped treating the process as a loose creative exercise and started documenting it like an operational system. Once we did that, several previously implicit steps had to become explicit.

Source extraction comes first. The PDF is often richer and more structured than the product-page summary, so the paper itself has to be treated as the source of truth.
Dialogue quality needs rules. It is not enough to ask for a conversation. We had to specify greetings, acknowledgement between hosts, conversational pacing, and a ban on alternating monologues.
Hardcoded scripts create real operational friction. Our generation and assembly scripts still point at one episode at a time. That meant retargeting paths carefully for every run.
Audio polish is mostly subtractive. Subtle background music helped. Additional sound effects usually hurt.
Publishing is part of the workflow. Local success means nothing if the website assets are not copied, committed, and pushed.

The Dialogue Problem Was Bigger Than We Expected

One of the biggest surprises was how quickly a technically accurate script can still sound wrong when spoken aloud. AI is very good at producing coherent exposition. It is much less naturally good at producing believable co-host interaction. Without strong constraints, the result is usually two people taking turns delivering polished paragraphs. Informative, yes. Human, no.

We found that small social details mattered disproportionately. The hosts needed to greet each other. Both hosts needed to acknowledge each other by name across the episode. Dense lines had to be split into shorter turns. And whenever only one host sounded relational while the other sounded like a lecturer, the illusion broke immediately.

Open like a real show. Greet listeners and the co-host before starting the argument.
Balance acknowledgements. If only one host says the other person's name, the dialogue feels lopsided.
Split essay sentences. A line that looks elegant on the page may sound stiff in speech.
Review with your ears, not just your eyes. Spoken realism is a separate editing pass.

Audio Quality Turned Out To Be Mostly About Restraint

We also learned that more audio production is not the same thing as better audio production. Our assembly pipeline supported intro, transition, reflective, and outro sound effects. In theory that sounded sophisticated. In practice it made the podcast feel busier and more artificial than it needed to be.

The better standard was simpler: clean dialogue, a subtle background music bed, and no extra audible effects unless they were intentionally requested. That one change made the episodes feel more like thoughtful expert conversations and less like an overproduced demo.

This also forced us to care about details that are easy to dismiss until they are wrong. A brand pronunciation issue like Netrii coming out as "netri-eye" instead of "nethree" is not a minor glitch. In audio, that kind of mistake is part of the product experience.

ffmpeg -i "output/<episode>/<Podcast_Final>.mp3"   -stream_loop -1 -i "output/<episode>/background_music_custom.mp3"   -filter_complex "[1:a]volume=0.05,afade=t=out:st=<fade_start>:d=6[music];[0:a][music]amix=inputs=2:duration=first:dropout_transition=2[out]"   -map "[out]" -c:a libmp3lame -q:a 2   "output/<episode>/<Podcast_WithMusic>.mp3" -y

Another important lesson: in AI audio, polish often comes from removing distracting layers, not adding them.

The Real Failure Mode Was Operational

The most instructive bug in the whole workflow had nothing to do with models, voices, or prompts. The new podcasts were not appearing in production because the updated audio files were sitting locally in the website repo, uncommitted. The site metadata already pointed to the correct paths. The MP3 files had already been copied to the right public directory. But production was still serving the old assets because nothing had actually been pushed.

That is a useful corrective to a lot of AI hype. Once a workflow spans multiple repos, generated artifacts, and a deployment system, the dominant failure mode is often operational discipline, not model capability. The model can be working perfectly while the system still fails to ship.

What Became Repeatable

By the end, we had something much more durable than a single successful run. We had updated skills, revised script heuristics, clearer audio standards, and a known-good publication path. That means the next paper-to-podcast episode no longer starts from blank-slate improvisation.

Use the PDF as source of truth.
Write for conversational realism, not just accuracy.
Retarget hardcoded generation scripts deliberately.
Prefer subtle music over extra SFX.
Fix pronunciation explicitly when brand or product names matter.
Treat publication as part of the workflow, not an afterthought.
Document the process in skills so the learnings compound.

The Broader Point

This project changed how we think about AI content systems. The value was not in proving that a model can generate a podcast. That is table stakes now. The value was in building a workflow where the output is source-grounded, sounds human, respects brand details, and reliably makes it all the way to production. In other words: not just generated, but shipped.

That distinction matters beyond podcasts. A lot of AI systems look impressive in isolated demos and then fall apart in the handoff between generation, review, packaging, and deployment. The work that feels "boring" — process design, quality heuristics, operational clarity — is often the part that makes the system real.

The meta-lesson: if you want AI workflows to produce business value instead of interesting artifacts, you have to design the operating procedure as carefully as the generation step.

The Naive Version Is Easy

What Had To Become Explicit

The Dialogue Problem Was Bigger Than We Expected

Audio Quality Turned Out To Be Mostly About Restraint

The Real Failure Mode Was Operational

What Became Repeatable

The Broader Point

About the author

Arun Batchu

Why We Started an AI Translation Workflow with Community Language Research, Not a Dropdown

Why AI Podcast Dialogue Fails Without Social Rules