The missing layer in compliance RAG: why your search results need a judge

2026-03-19

If you’re building search over a knowledge base with an LLM — the pattern everyone calls RAG — you’ve seen the standard pipeline: embed the user’s question, find the closest chunks in a vector store, hand them to the LLM, get an answer. For documentation search or internal wikis, this works. The LLM is good at ignoring irrelevant context when the relevant stuff is also in the window.

I’m building a CMMC compliance platform, and I wanted a way to dogfood our own product against our own development process. Every commit we make to the platform touches some aspect of NIST 800-171 — access control, audit logging, encryption, configuration management. I wanted our pull requests to show which compliance controls each change addresses. Not as a compliance artifact (though it could become one), but as a consciousness-raising tool: every engineer on the team sees the compliance implications of their own code, every reviewer sees which controls are being strengthened. It’s ambient education that turns into culture.

So I built a commit-to-control mapping tool — given a git diff, find the NIST 800-171 controls it addresses. The naive pipeline (embed the diff, vector search, return top K controls) gave me plausible but wrong matches 70-80% of the time. It felt like AltaVista in 1998 — the results were related to my query in the way that a thesaurus is related to understanding. Relevant vocabulary, zero comprehension. A diff for “added KMS encryption to S3 buckets” matched SC-28 (Protection of Information at Rest) at 0.89, which is correct. But it also matched SC-8 (Transmission Confidentiality) at 0.84 and AU-9 (Protection of Audit Information) at 0.82. All three controls are about protecting data. Only one is about encryption at rest. The embedding space can’t tell the difference because the vocabulary overlaps — and NIST 800-171 has 110 controls, many of which share phrases like “access control,” “system access,” and “authorized users.”

So I added a judging step between retrieval and synthesis. After the vector search returns candidates, a second LLM call evaluates each one: “Does this control specifically address what the user asked about, or is it about an adjacent control that shares vocabulary?” The judge is a classifier, not a generator — it returns RELEVANT or NOT_RELEVANT with a one-sentence explanation (the explanation is for debugging, not for the user). In my case, only 20-30% of the top K candidates survived — the rest were near-misses that looked right to a vector search but weren’t.

User question
  → Embedding → Vector search → Top K candidates
  → Judge LLM: "Is each candidate relevant to THIS specific question?"
  → Filtered candidates
  → Synthesis LLM: generate answer from filtered candidates only
  → Response with citations

The cost is one additional LLM call on a small payload. I use a Haiku-class model for the judge — it’s a classification task, not a creative one. The latency is negligible compared to the synthesis call. The quality gain was dramatic: false matches dropped from 70-80% to under 5%. The judge understood that SC-8 is about data in transit, not at rest, even though the embedding space put them next to each other.

Here’s what the judge’s output looks like on a real commit — a Dockerfile rewrite that added multi-stage builds, pinned base images, and configured a model artifact path:

✓ CM.L2-3.4.1 — Establishes baseline configuration by pinning the base image
    digest and defining the build stages declaratively in the Dockerfile.
✓ CM.L2-3.4.6 — Configures the container to provide only essential capabilities
    by using a minimal base image and multi-stage build that excludes build tools
    from the runtime image.
✗ SC.L2-3.13.8 — The commit changes container build configuration, not
    transmission encryption. Vocabulary overlap with "protection" but the control
    is about data in transit.
✗ SI.L2-3.14.2 — The commit doesn't add malicious code protection. Using a
    pinned base image is a supply chain practice, not a malware scanning control.
✗ AC.L2-3.1.5 — Least privilege applies to user/role permissions, not to
    Dockerfile layer optimization.

Two controls approved, five rejected. The rejections are where the value is — each one is a near-miss that vector similarity would have confidently returned as a match. “Supply chain practice, not a malware scanning control” is the kind of distinction that matters in an audit and that an embedding model will never make.

The judge prompt is domain-specific. For compliance I ask three things: does this chunk address the specific control the user asked about (not an adjacent one)? Does it contain actionable information? Would citing it in an assessment be accurate? A generic relevance check wouldn’t catch the SC-28 vs SC-8 distinction — you need the prompt to encode the domain’s notion of what “relevant” means.

Given the user's question about {control_id}, evaluate whether
this retrieved section is specifically relevant:

Section: {chunk_text}

Criteria:
- Does it address the specific control, not an adjacent one?
- Does it contain actionable information for this control?
- Would citing this section in an assessment be accurate?

Return: RELEVANT or NOT_RELEVANT with one sentence explanation.

This pattern isn’t unique to compliance. It matters anywhere the embedding space is dense with near-neighbors that a human expert would distinguish instantly but a distance metric can’t. Legal contracts where clauses share terminology but have different implications. Medical literature where symptoms overlap across conditions. Financial regulation where rules reference each other with subtle distinctions. Security advisories where CVEs affect similar but distinct components. In every case the domain has a taxonomy that the embedding model doesn’t understand, and the judge bridges that gap.

The standard RAG pipeline assumes retrieval quality is good enough for the synthesis model to work with. In sparse domains — general knowledge, product documentation, FAQs — that assumption holds because the nearest neighbors in the embedding space are genuinely the right answers. In dense, regulated domains, the nearest neighbors are often the most dangerous kind of wrong: plausible, authoritative-sounding, and subtly off-target. The judge isn’t a filter in the traditional sense — it’s not removing results below a similarity threshold. It’s applying domain reasoning that the embedding model was never trained to do. The embedding knows these chunks are semantically close. The judge knows they’re functionally different.

That’s the layer most RAG architectures are missing. Not because it’s hard to build — it’s one LLM call with a good prompt — but because the standard pipeline works well enough in the domains where most people first encounter RAG. You don’t discover you need a judge until your users start making real decisions based on the answers.

I open-sourced the commit-to-control tool as a GitHub Action. It runs on every PR, maps the diff to NIST 800-171 controls with the judging layer described here, and posts a comment showing which controls the change addresses — with cost tracking in dollars and satoshis. It supports OpenAI, AWS Bedrock, and self-hosted models via Ollama. The entire compliance mapping for a typical PR costs about 3 sats.

#rag #compliance #nist #llm #python #pgvector