DLLM - D Language 🤖 on 🦙.cpp

By Danny Arends in Projects
Posted at: 24 Mar 2026, 18:11, last edited: 24 Mar 2026, 18:11

Everyone builds LLM agents in Python. LangChain, LlamaIndex, Hugging Face - the ecosystem is enormous and the path of least resistance is obvious. So when I decided to build a local agentic LLM runtime, the sensible choice was clear.

I chose the D language instead.

Here's why, and what I learned along the way.

The Problem With "Just use Python"

Python's LLM ecosystem is genuinely impressive, but it comes with a cost that rarely gets talked about: layers. By the time you have a working agent in Python, you're sitting on top of a framework, which sits on top of a library, which wraps a C++ runtime via ctypes or cffi, which calls CUDA. Four or five abstraction layers, each one a potential source of version conflicts, silent failures, and debugging hell.

I wanted to understand what was actually happening. Not the framework's idea of what was happening - the real thing. Tokens going into a context window, a KV cache filling up, a sampler drawing from a probability distribution. The mechanics.

D gave me that.

What D Brings to the Table

D is a systems language that doesn't feel like one to write. It has Python-style ranges and UFCS (Uniform Function Call Syntax), a garbage collector you can ignore or control, compile-time metaprogramming that's actually usable, and a standard library that covers most of what you need.

But the killer feature for this project was importC - D's ability to directly import C header files and use C APIs as if they were native D code. No bindings. No wrapper libraries. No separate FFI layer.

This one line in my source:

#include "llama.h"

...and suddenly the entire llama.cpp API was available in D. llama_decode, llama_model_load_from_file, llama_sampler_sample - all callable directly, with full type safety. No Python ctypes, no Cython, no manually maintained bindings that go stale every time llama.cpp updates.

The Architecture

DLLM runs three models simultaneously:

  • Agent model - a Qwen3.5-4B on the GPU, doing the actual reasoning and tool calling
  • Summary model - a tiny Qwen2.5-0.5B on CPU, condenses history when the KV cache fills up
  • Embedding model - Nomic embed text v1.5 on CPU, powers the RAG index

The KV cache management is explicit and intentional. When context pressure exceeds 60%, the summary model condenses the conversation history into a paragraph, replacing the middle turns. The agent never notices - it just keeps working.

Tool calling uses a constrained JSON grammar sampler. When the agent decides to use a tool, the JSON sampler takes over and guarantees the output is valid JSON matching the tool's signature. No parsing heuristics, no regex fallbacks - the model literally cannot produce malformed tool calls.

The Tool System

This is the part I'm most happy with. Adding a new tool in DLLM looks like this:

@Tool("Reverse a string.")
string reverseString(string text) {
  return text.retro.array.to!string;
}

That's it. Drop it in any file that mixes in RegisterTools, and it's auto-discovered at compile time via D's User-Defined Attributes and compile-time reflection. No registration boilerplate, no JSON schema to maintain by hand, no decorators that only work at runtime.

The tool description, parameter names, and types are extracted at compile time and used to generate both the JSON schema for the system prompt and the grammar constraint for the sampler. The whole thing is about 80 lines of D metaprogramming.

What Surprised Me

The performance. Running three models simultaneously with explicit KV cache management, on a machine with a mid-range GPU, is fast. Not "fast for Python" fast - genuinely fast. The overhead between the D code and llama.cpp is essentially zero because there is no overhead. It's the same process, the same memory.

The debuggability. When something goes wrong in a Python LLM stack, you're often staring at a traceback that disappears into framework internals. In DLLM, when something goes wrong, it's my code. The KV position is wrong, or the token count is off, or the batch size doesn't match - all things I can reason about directly.

D's metaprogramming. I knew it was powerful, but using static foreach and __traits to build a full tool registration system at compile time, with zero runtime overhead, was genuinely satisfying. It's the kind of thing that would require a lot of Python magic (__init_subclass__, metaclasses, decorators with side effects) and still only run at import time.

What's Included

  • RAG with binary-persisted embeddings and cosine similarity ranking
  • Vision support via mtmd (multimodal - load an image, ask about it)
  • Docker-sandboxed code execution (Python, JavaScript, Bash, R, D)
  • Web search via SearxNG
  • File I/O, date/time, encoding, audio playback tools
  • KV cache condensation via a dedicated summary model
  • Thinking budget enforcement via token limits
  • Memento system - the agent writes notes to its future self between sessions
  • Full interactive and oneshot modes

Should You Build Your Next Agent in D?

Probably not, if you need to ship fast and your team knows Python. The ecosystem gap is real - there's no D equivalent of LangChain, and you will be writing things from scratch.

But if you want to understand how LLM agents actually work at the metal level, if you're interested in a language that gives you Python's expressiveness with C's proximity to the hardware, or if you're just tired of debugging dependency conflicts in a six-layer abstraction stack - D is worth a serious look.

DLLM is open source under GPLv3. The code is small enough to read in an afternoon.

Find it here: github.com/DannyArends/DLLM


Last modified: 24 Mar 2026, 18:11 | Edit