import <LLM Story Time>

ShowCAIS S25

Enhancing LLM Story Generation with Knowledge Graph and Multi-agent Framework

LLMs do not handle context well. When generating long form content such as novels, they perform poorly simply because they forget prior connections between characters and events. As a result, I set out to address this issue by determining how these foundational LLMs can be augmented with knowledge graphs for better story generation. My team and I proposed a framework that utilizes agentic AI tools to iteratively write chapters, generate knowledge graphs, and reference these knowledge graphs when writing subsequent chapters. Alongside the framework, we worked on developing a benchmark to compare existing LLM story generation with our augmented workflow. Once determining how to accurately evaluate our framework, we found that our framework showed promising results. I'd share them right now, but I wouldn't want to spoil the paper that we wrote to summarize our findings!

Key skills:

• Agentic AI
• LLM Augmentation
• RAG

Development Process

The first step of this project was establishing a benchmark to determine the state of existing LLM narrative generation. After exploring various LLMs such as Gemini, Claude, and ChatGPT, we were able to generate a couple of narratives to serve as a baseline. We then collected a series of short stories including "Little Red Riding Hood" and "The Lorax" to use as benchmarks of "human" narratives. From there, we began to build out the multi-agent framework. That was when our first hurdle was reached: Autogen. While Autogen is an extremely powerful tool, it is extremely unpredictable. Our framework involved a coder agent to generate the knowledge graph of entities and relationships, a writer agent to pull from these knowledge graphs and write the narrative, and a grapher agent to help visualize the generated knowledge graphs. While the workflow was successful for the first few generated chapters, the coder agent would stop creating thorough knowledge graphs or the writer agent would get stuck. After scrolling through many forums and any documentation we could get our hands on, we worked through debugging these errors and writing prompts to ensure a smooth workflow. Essentially, the agents need to be more aware of the overall order of execution and need to run explicit checks before proceeding to do their task. Once the agentic framework was finally built out, we could acquire some samples. We also found it better to equip the LLM with common story archetypes such as "Rags to Riches". With our samples collected, the last bottleneck that arose was evaluation. Evaluating these stories was extremely challenging due to narrative generation being objective. Through another literature review, we found UNION as an evaluation metric and ran tests using this approach.

iTextbooks Paper Submission