Creative Planning with Language Models

Practice, Evaluation and Applications

NAACL 2025 Tutorial

Saturday, May 3rd, 9-12:30am MDT
Pecos Room

What do 📰 journalism, 🎵 music lyric composition, ⚖️ legal writing, 💭 psychological counseling and 🍽️ menu design all have in common? That's right! They are all human-centered tasks with unclear and subjective rewards. Although language models struggle to perform these tasks, humans are able to interpret and manage this subjectivity well. How can language models learn to do this? Come see how we might make progress in our tutorial.

Meet Your Instructors

Philippe Laban

Philippe Laban

Microsoft Research

X (Twitter) Google Scholar Website

Abstract

The use of large language models (LLMs) in human-centered creative domains — such as journalism, scientific writing, and storytelling — has showcased their potential for content generation but highlighted a critical gap: planning. Planning, a fundamental process in many creative domains, refers to higher level decisions writers (or agents) make that influence textual output they produce. Planning is especially hard to perform in creative domains, where human rewards are often unclear or sparsely observed. This tutorial explores how planning has been learned and deployed in creative workflows. We will cover three aspects of creativity: Problem-Finding (how to define rewards and goals for creative tasks), Path-Finding (how to generate novel creative outputs that meet goals) and Evaluation (how to judge). We will also consider three learning settings: Full Data Regimens (when observational data for decisions and resulting text exist), Partial (when text exists but decisions can be inferred) and Low (when neither exist). The tutorial will end with practical demonstrations in computational journalism, web agents, and other creative domains. By bridging theoretical concepts and practical demonstrations, this tutorial aims to inspire new research directions in leveraging LLMs for creative planning tasks.

Schedule

Introduction
9:00 - 9:15
Problem-finding (Alexander Spangher)
9:15 - 9:45
Path-finding (Tenghao Huang)
9:45 - 10:15
Path-finding (Nanyun Peng)
10:15 - 10:45
Break
10:45 - 11:15
Evaluation (Philippe Laban)
11:15 - 11:45
Demos
11:45 - 12:00
LawFlow. Presented by Debarati Das.
12:00 - 12:15
STORM. Presented by Yucheng Jiang.
12:15 - 12:30

Dive Deeper

Problem-Finding

"The formulation of a problem is often more essential than its solution, which may be merely a matter of mathematical or experimental skill." — Albert Einstein, The Evolution of Physics

In this section, we explore how creative actors define tasks, goal-states, and rewards. We map problem-finding in NLP to learning complex rewards, examining approaches that mix multiple rewards, frame language modeling as inverse-reinforcement learning (IRL), and explore emulation learning settings. We'll look at how humans (and chimpanzees) learn rewards through observing motivations and end-state outputs, and how this applies to understanding human creative processes.

We'll examine how researchers have studied art students and found that those who focused most on defining the problem produced more creative work. In NLP, this translates to learning complex rewards through various approaches: mixing multiple rewards, framing language modeling as inverse-reinforcement learning, and exploring emulation learning settings. We'll discuss how scientists can "read through the lines" to understand actions taken in research papers, even when not explicitly mentioned, and how this ability is crucial for reproducibility in our field.

Bibliography ▼

  • Getzels, J. W., & Csikszentmihalyi, M. (1976). The creative vision: A longitudinal study of problem finding in art. Wiley.
  • Hopper, L. M., et al. (2010). Chimpanzees' socially maintained food preferences indicate both conservatism and conformity. Animal Behaviour.
  • Shi, R., Chen, Y., Hu, Y., Liu, A., Hajishirzi, H., Smith, N. A., & Du, S. S. (2024). Decoding-time language model alignment with multiple objectives. NeurIPS.
  • Wulfmeier, M., et al. (2024). Imitating language via scalable inverse reinforcement learning. NeurIPS.
  • Spangher, A., Tumgoren, S., Welsh, B., Peng, N., Ferrara, E., & May, J. (2024). Tracking the Newsworthiness of Public Documents. ACL.
  • Starace, G., et al. (2025). PaperBench: A Benchmark for Scientific Writing. arXiv preprint.
  • Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. ICML.
  • Cohen, S. (2011). Computational Journalism. Communications of the ACM.
  • Zhou, S., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint.
  • Spangher, A., Peng, N., Gehrmann, S., & Dredze, M. (2024). Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs. EMNLP.

Path-Finding

"Creativity involves breaking out of established patterns to look at things in a different way." — Edward de Bono, Serious Creativity

This section covers how humans develop alternative methods for solving problems through forward and backward approaches. Forward approaches involve direct training or prompting of models to generate sequences of actions, while backward approaches infer sequences of actions from available state information. We'll explore classical methods like means-ends analysis, backtracking, and regression planning, and how they're being incorporated into modern NLP systems.

We'll explore both forward and backward approaches to planning. Forward approaches include prompt-engineering and in-context learning, while researchers with more data can explicitly train planning agents. Backward approaches leverage state information to infer action sequences, drawing from classical methods like means-ends analysis and regression planning. We'll discuss how these methods are being integrated into modern NLP systems through latent variable modeling and variational inference, showing how they should be combined to infer and utilize latent plans in creative tasks.

Bibliography ▼

  • Runco, M. A. (2001). Introduction to the special issue: Commemorating Guilford's 1950 presidential address. Creativity Research Journal.
  • Guilford, J. P. (1950). Creativity. American Psychologist.
  • Guilford, J. P. (1978). Alternate Uses: Manual of Instructions and Interpretations. Orange, CA: Sheridan Psychological Services.
  • Finke, R. A., Ward, T. B., & Smith, S. M. (1996). Creative cognition: Theory, research, and applications. MIT Press.
  • Gentner, D., & Bowdle, B. (2014). Metaphor as structure-mapping. In The Cambridge Handbook of Metaphor and Thought.
  • Tian, Y., & Peng, N. (2023). MacGyver: Zero-shot Tool Learning with Large Language Models. arXiv preprint.
  • Chen, X., & Peng, N. (2024). Self-play Fine-tuning Converts Weak Language Models to Strong Language Models. arXiv preprint.
  • Côté, M. A., et al. (2018). TextWorld: A Learning Environment for Text-based Games. arXiv preprint.
  • Newell, A., & Simon, H. A. (1961). GPS, a program that simulates human thought. In Computers and thought.
  • Golomb, S. W., & Baumert, L. D. (1965). Backtrack programming. Journal of the ACM.
  • McDermott, D. (1991). Regression planning. International Journal of Intelligent Systems.
  • Xu, J., & Zhang, Y. (2019). Regression planning for natural language processing. EMNLP.
  • Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma, A., & Goodman, N. D. (2024). Stream of search (sos): Learning to search in language. arXiv preprint arXiv:2404.03683.
  • Chen, X., et al. (2024). Reverse Engineering Language Models. arXiv preprint.
  • Min, S., Lewis, M., Hajishirzi, H., & Zettlemoyer, L. (2022). Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? arXiv preprint.
  • Pham, T., et al. (2024). TopicGPT: A Prompt-based Topic Modeling Framework. arXiv preprint.
  • Zelikman, E., et al. (2022). STaR: Bootstrapping Reasoning With Reasoning. arXiv preprint.
  • Deng, X., et al. (2024). Mind2Web: Towards a Generalist Agent for the Web. NeurIPS.
  • Chakrabarty, T., Padmakumar, V., He, H., & Peng, N. (2023). Creative Natural Language Generation. EMNLP Tutorial.

Evaluation

"The unexamined life is not worth living." — Plato, Apology of Socrates

Since creative tasks often lack objective metrics for success, this section focuses on evaluation methods based on human preference. We'll examine both offline evaluation (comparing plans to what human plans would have been) and online evaluation (HCI-focused methodologies). We'll explore novel metrics like latent criticism and conditional perplexity, moving beyond traditional surface-level metrics like BLEU or ROUGE scores.

We'll discuss two main modes of evaluation: offline and online. Offline evaluation focuses on comparing our plans to what human plans would have been, using novel metrics like latent criticism and conditional perplexity. These metrics move beyond surface-level comparisons to examine structural aspects of the output. Online evaluation follows HCI methodologies, studying human preferences for recommendations, suggestions, edits, and other AI assistance. We'll explore how these evaluation methods help us understand and improve creative planning systems without relying solely on traditional metrics.

Bibliography ▼

  • Shi, F., & Peng, N. (2023). Large Language Models Can Learn Rules. arXiv preprint.
  • Chen, X., Wang, C., Yu, F., & Song, D. (2019). Evaluating the Factual Consistency of Abstractive Text Summarization. arXiv preprint arXiv:1909.12840.
  • Zhao, Z., et al. (2023). Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046.
  • Clark, E., & Smith, N. A. (2021). Choose your own adventure: Paired suggestions in collaborative writing for evaluating story generation models. NAACL.
  • Laban, P., Vig, J., Hearst, M., Xiong, C., & Wu, C. S. (2024). Beyond the chat: Executable and verifiable text-editing with llms. UIST.
  • Yang, D., et al. (2024). Human-Centered NLP: A Tutorial. NAACL.
  • Hendrycks, D., et al. (2021). Measuring Mathematical Problem Solving with the MATH Dataset. arXiv preprint.
  • Wu, T., Zhang, C., Hu, Q., Spangher, A., & Peng, N. (2023). Learning Action Conditions from Instructional Manuals. ACL.
  • Wu, T., Spangher, A., Alipoormolabashi, P., Freedman, M., Weischedel, R., & Peng, N. (2022). Understanding Multimodal Procedural Knowledge. ACL.
  • Ammanabrolu, P., et al. (2020). Story Realization: Expanding Plot Events into Sentences. EMNLP.
  • Hamborg, F., Donnay, K., & Gipp, B. (2019). Automated identification of media bias in news articles: an interdisciplinary literature review. International Journal on Digital Libraries, 20(4), 391-415.
  • Hosseini-Asl, E., et al. (2020). A Simple Language Model for Task-Oriented Dialogue. NeurIPS.
  • Laban, P., et al. (2024). Beyond the chat: Executable and verifiable text-editing with llms. UIST.

Tutorial Materials

Tutorial Abstract (PDF)

Tutorial Slides (Google Slides) | Tutorial Slides (PDF)

Video (coming soon)