Why AI Podcast Dialogue Fails Without Social Rules

One of the most useful lessons from our recent paper-to-podcast work was that correctness is not the same thing as conversational realism. We could get a model to produce technically coherent dialogue quickly. What we could not get for free was the feeling that two actual people were thinking together in real time.

That distinction matters because podcast listeners do not experience a script as text. They experience pacing, acknowledgement, turn-taking, energy, and rhythm. A script that looks polished on the page can still sound like two alternating mini-lectures once voices are generated.

The Default Failure Mode

When asked to create a two-host conversation from a paper, an LLM tends to do something superficially reasonable: it splits exposition between two speakers. But that usually means one host says a paragraph, then the other host says another paragraph, and the result sounds less like a conversation than a relay race of essays.

Nothing in that pattern is factually wrong. The problem is that human conversation contains much more social glue than exposition alone. People greet each other. They acknowledge prior points. They interrupt lightly. They frame questions in response to what was just said. They sound like they are listening, not waiting for their turn to deliver prepared text.

The core lesson: believable dialogue required explicit behavioral constraints. Left alone, the model optimized for coherence, not for human interaction.

What Had To Be Specified

The fix was not one magical prompt. It was a set of social rules that pushed the script toward spoken realism.

Open with a real welcome. The hosts should greet listeners and each other before diving into the substance.
Use names naturally. If only one host ever acknowledges the other by name, the exchange sounds lopsided.
Break long turns apart. Dense sentences that read well often sound stiff when spoken.
Ban alternating monologues. Each line should respond to the prior line, not ignore it.
Write for the ear. Spoken cadence matters as much as informational accuracy.

Why Small Details Matter So Much

What surprised us was how disproportionate the impact of small details turned out to be. A quick “Marcus, that is the part I find most interesting” does not add much information. But it adds a great deal of relational realism. The listener hears one host actually engaging the other, and the whole exchange becomes more believable.

The same thing is true in the other direction. If one host consistently sounds warm and responsive while the other sounds like a lecturer reading notes, the illusion breaks almost immediately. The issue is not just content quality. It is social symmetry.

Why This Generalizes Beyond Podcasts

This is really a lesson about AI systems that interact through language. Many teams evaluate outputs visually and stop when the text “looks good.” But spoken or interactive systems expose a different standard. You are no longer optimizing just for semantic correctness. You are optimizing for how the exchange feels in time.

That means social rules are not cosmetic. They are part of the system design. If the output needs to sound human, then interaction structure has to be designed as carefully as informational content.

Takeaway: if you want AI dialogue to sound human, do not just ask for a conversation. Define the social behavior that makes a conversation feel real.

The Default Failure Mode

What Had To Be Specified

Why Small Details Matter So Much

Why This Generalizes Beyond Podcasts

About the author

Arun Batchu

How We Built a Repeatable Paper-to-Podcast Workflow That Actually Ships

AI Audio Polish Is Mostly Subtractive, Not Additive