What I Learned Building AI Agents on Top of the Obsidian CLI

I use Obsidian extensively as my personal knowledge management system. It tracks everything: daily notes, tasks, projects, meetings, people, and more. Over the past year I’ve been building AI agent workflows that operate on my vault – using tools like Claude Code and pi (an open-source coding agent) to automatically gather context, update notes, and maintain my system.

When Obsidian released their CLI in version 1.12, I was excited. Here was an official way to interact with the vault programmatically: read files, get backlinks, rename notes with automatic link updates. I immediately started building my agent tooling on top of it.

It worked great – until it didn’t. This post is about what I learned along the way.

The Setup

My vault has over 700 person files, hundreds of daily notes, tasks, projects, and more. I built a set of tools that my AI agents use to gather context about any wikilink in the vault:

  1. Get the note itself – read the full markdown content
  2. Find content references – find every place a wikilink is mentioned in other notes, with intelligent context extraction (preserving bullet hierarchies and paragraph boundaries)
  3. Find frontmatter references – find every note that references the wikilink in its YAML frontmatter (useful for structured relationships like meeting attendees or task assignments)

These tools power skills like “update a person note,” which gathers context from across the vault and populates a person file with relationship details, professional info, and personal notes. The whole thing is driven by the AI agent, which reads the skill instructions, calls the tools, analyzes the results, and writes the updated file.

The Obsidian CLI: What It Does Well

The Obsidian CLI is genuinely valuable for certain things.

Renaming notes is the killer feature. When you rename a note through the CLI, Obsidian automatically updates every wikilink across the entire vault. You simply cannot replicate this safely with grep and sed. The link index knows about aliases, case normalization, and ambiguous matches. This is a write operation that fundamentally requires Obsidian’s understanding of the vault’s link graph.

Backlink discovery is also useful. The CLI can tell you which files link to a given note, using Obsidian’s resolved link index. This handles edge cases that a simple text search might miss.

Single file reads are convenient when you don’t know the exact file path. Obsidian resolves the shortest unambiguous name, so you can ask for “John Smith” without needing to know it lives at People/John Smith.md.

Where It Breaks Down

The problem showed up when I tried to scale things up. I wanted to batch-update all 700+ empty person files in my vault — running the AI agent on each one sequentially. The agent would gather context, analyze it, and write the updated file. Simple enough.

Except each person update requires gathering context, which means:

  1. One CLI call to get backlinks (which files mention this person?)
  2. One CLI call per backlinked file to read its content

Each CLI call takes roughly one second. For someone mentioned in 561 files across my vault, that means find-refs alone would take over nine minutes. Even for a lightly-referenced person with 10 backlinks, that’s 10 seconds of I/O before the AI agent can even start thinking.

When I tried to run this at batch scale, the Obsidian CLI calls would sometimes hang entirely. I’d kick off an update of 5 people and the script would stall on the second or third person, waiting for CLI responses that never came.

The root cause is architectural: the Obsidian CLI is a one-call-at-a-time interface. Every invocation is a separate subprocess that communicates with the running Obsidian instance. There’s no batching, no streaming, no way to say “read me these 50 files in one shot.” And since the CLI requires Obsidian to be running, you’re limited by however fast Obsidian can service these requests.

The Fix: Filesystem for Reads, CLI for Writes

The solution was straightforward once I stopped trying to force everything through the CLI. I added a filesystem backend to my context-gathering tools that bypasses the CLI entirely for read operations:

  • Backlink discovery: Instead of obsidian-cli backlinks, I use ripgrep to scan the entire vault for [[Person Name]]. Ripgrep can search every markdown file in the vault in about 50 milliseconds.
  • File reads: Instead of obsidian-cli read, I just read the file directly from disk. This is essentially instant.

The results were dramatic. Gathering context for a person went from 10-60+ seconds (depending on reference count) down to about 0.2-0.4 seconds. The batch update script that was hanging after one person now processes five people in under a minute.

I kept the CLI available as a backend for interactive use where the convenience of name resolution is nice, and for write operations like rename-note where the CLI’s link-updating capability is difficult to replace.

Lessons for AI Agent Tooling

Building this system taught me a few things that I think generalize beyond Obsidian:

The control plane / data plane distinction matters. The Obsidian CLI is a great control plane – it knows things about your vault’s structure that the filesystem doesn’t, and it can make coordinated changes. But it’s a poor data plane. When an AI agent needs to read 50 files to build context, you want the fastest possible path to those bytes. That’s the filesystem.

AI agents have different performance requirements than humans. A one-second response time is usually not a huge deal for a human typing commands. But AI agent workflows involve tight loops of tool calls — gather context, analyze, gather more context, write output. Each second of latency compounds. A tool that feels fast to a human can be the bottleneck that makes an agent workflow impractical.

Design tools for fan-out. The most common pattern in my agent workflows is: “find all the things related to X, then read all of them.” This is inherently a fan-out operation. Tools that can only handle one item at a time will always be the bottleneck. If I were designing the Obsidian CLI for agent use cases, I’d add batch operations: read multiple files in one call, return backlinks with content included.

Start with the simplest thing that works, then optimize the bottleneck. My original tools used the CLI for everything, and they worked fine for single-note updates. It was only when I tried to batch-process hundreds of notes that the performance issue became obvious. The right time to add the filesystem backend was exactly when I did – when I had a concrete bottleneck, not before.

What Would Make the Obsidian CLI Better for Agents?

The Obsidian CLI is still in active development, and there are a few additions that would make it much more agent-friendly:

  • Batch read: obsidian-cli read file=A file=B file=C returning all contents in one response
  • Backlinks with content: obsidian-cli backlinks file=X --include-content so you get everything in a single round-trip instead of N+1 calls
  • A persistent connection mode: A socket or pipe interface that keeps a connection open, avoiding subprocess overhead per call

These would eliminate the need for the filesystem workaround in most cases. Until then, the hybrid approach works well – use CLI for writes and single lookups, filesystem for bulk reads.

Conclusion

If you’re building AI agent workflows on top of Obsidian, my advice is: use the CLI for what it’s uniquely good at (write operations, link-aware mutations, single convenient lookups), and use direct filesystem access for bulk context gathering. The CLI is a powerful tool, but AI agents are demanding users with very different performance needs than humans. Designing your tooling around that reality will save you a lot of frustration.

Roam Research Notes on “SELF-REFINE: Iterative Refinement with Self-Feedback” by Madaan Et. Al

  • Author:: Madaan Et. Al.
  • Source:: link
  • Review Status:: [[complete]]
  • Recommended By:: [[Andrew Ng]]
  • Anki Tag:: self_refine_iterative_refinement_w_self_feedback_madaan_et_al
  • Anki Deck Link:: link
  • Tags:: #[[Research Paper]] #[[prompting [[Large Language Models (LLM)]]]] #[[reflection ([[Large Language Models (LLM)]])]]
  • Summary

    • Overview
      • SELF-REFINE is a method for improving outputs from large language models (LLMs) through iterative self-feedback and refinement. This approach uses the same LLM to generate an initial output, provide feedback, and refine it iteratively without the need for supervised training or additional data.
    • Key Findings
      • Performance Improvement: Evaluations using GPT-3.5 and GPT-4 across seven tasks show that SELF-REFINE improves performance by about 20%. Outputs are preferred by humans and score better on metrics.
      • Complex Task Handling: LLMs often struggle with complex tasks requiring intricate solutions. Traditional refinement methods need domain-specific data and supervision. SELF-REFINE mimics human iterative refinement, where an initial draft is revised based on self-feedback.
      • Iterative Process: The process uses two steps: FEEDBACK and REFINE, iterating until no further improvements are needed.
    • Specific Task Performance
      • Strong Performance:
        • Constrained Generation: Generating a sentence containing up to 30 given concepts. Iterative refinement allows correction of initial mistakes and better exploration of possible outputs.
        • Preference-based Tasks: Dialogue Response Generation, Sentiment Reversal, Acronym Generation. Significant gains due to improved alignment with human preferences.
      • Weaker Performance:
        • Math Reasoning: Difficulty in accurately identifying nuanced errors in reasoning chains.
    • Additional Insights
      • Avoiding Repetition: SELF-REFINE avoids repeating past mistakes by appending the entire history of previous feedback in the REFINE step.
      • Role-based Feedback: Suggestion to improve results by having specific roles for feedback, like performance, reliability, readability, etc.
        • Related Method: Providing a scoring rubric to the LLM with dimensions over which they should evaluate the output.
      • Specific Feedback Importance: Results are significantly better with specific feedback compared to generic feedback.
      • Iteration Impact: Results improve significantly with the number of iterations (i.e., feedback-refine loops) but with decreasing marginal improvements for each loop. In some cases, like Acronym Generation, quality could improve in one aspect but decline in another. Their solution was to generate numeric scores for different quality aspects, leading to balanced evaluation.
      • Model Size Impact: SELF-REFINE performs well for different model sizes, but for a small enough model (Vicuna-13B), it fails to generate feedback consistently in the required format, often failing even with hard-coded feedback.
    • Relevant [[ChatGPT]] conversations: here, here, here

Roam Notes on The Batch Newsletter (Andrew Ng) – We Need Better Evals for LLM Applications

  • Author:: [[Andrew Ng]]
  • Source:: link
  • Review Status:: [[complete]]
  • Anki Tag:: andrew_ng_the_batch_we_need_better_evals_for_llm_apps
  • Anki Deck Link:: link
  • Tags:: #[[Article]] #[[Large Language Models (LLM)]] #[[evals]] #[[[[AI]] Agents]] #[[Retrieval Augmented Generation (RAG)]]
  • Summary

    • Evaluating Generative AI Applications: Challenges and Solutions
      • Challenges in Evaluation:
        • Evaluating custom AI applications generating free-form text is a barrier to progress.
        • Evaluations of general-purpose models like LLMs use standardized tests (MMLU, HumanEval) and platforms (LMSYS Chatbot Arena, HELM).
        • Current evaluation tools face limitations such as data leakage and subjective human preferences.
      • Types of Applications:
        • Unambiguous Right-or-Wrong Responses:
          • Examples: Extracting job titles from resumes, routing customer emails.
          • Evaluation involves creating labeled test sets, which is costly but manageable.
        • Free-Text Output:
          • Examples: Summarizing customer emails, writing research articles.
          • Evaluation is challenging due to the variability of good responses.
          • Often relies on using advanced LLMs for evaluation, but results can be noisy and expensive.
      • Cost and Time Considerations:
        • [[evals]] can significantly increase development costs.
        • Running [[evals]] is time-consuming, slowing down experimentation and iteration.
      • Future Outlook:
        • Optimistic about developing better evaluation techniques, possibly using agentic workflows such as [[reflection ([[Large Language Models (LLM)]])]].
    • Richer Context for RAG (Retrieval-Augmented Generation)
      • New Development:
        • Researchers at Stanford developed RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval). Link to paper here.
        • RAPTOR provides graduated levels of detail in text summaries, optimizing context within LLM input limits.
      • How RAPTOR Works:
        • Processes documents through cycles of summarizing, embedding, and clustering.
        • Uses SBERT encoder for embedding, Gaussian mixture model (GMM) for clustering, and GPT-3.5-turbo for summarizing.
        • Retrieves and ranks excerpts based on cosine similarity to user prompts, optimizing input length.
      • Results:
        • RAPTOR outperformed other retrievers on the QASPER test set.
      • Importance:
        • Recent LLMs can process very long inputs, but it is costly and time-consuming.
        • RAPTOR enables models with tighter input limits to access more context efficiently.
      • Conclusion:
        • RAPTOR offers a promising solution for developers facing challenges with input context length.
        • This may be a relevant technique to reference if you get around to implement [[Project: Hierarchical File System Summarization using [[Large Language Models (LLM)]]]]
    • Relevant [[ChatGPT]] conversations: here, here