Weekend 2: An SHG Loan Rules Chatbot
I built the eval before I built the UI. The numbers told me something I didn't expect: the chatbot is more honest than it is helpful.
The second prototype of the Mridula Initiative is live at shg-loan-chatbot.adityaksingh.dev: a RAG chatbot that answers plain-language questions about RBI and NABARD Self-Help Group lending guidelines, with inline citations to the source documents.
The problem it's trying to solve is simple to state. Dense regulatory text governs how SHGs can borrow — under what terms, at what interest rates, with what documentation, up to what limits. The people most affected by those rules are SHG members and the loan officers and NGO field staff who work with them. Most of them don't have easy access to a clear reading of those rules. This chatbot is an attempt to compress that gap.
What follows isn't a tutorial. It's a note on what the evaluation actually showed, what the numbers mean, and the question they leave me with.
Why evaluation first
I made a decision at the start of this project: I wasn't going to touch the UI until I had a working evaluation harness.
The failure mode I was trying to avoid was building something that felt convincing but was unreliable in ways I couldn't measure. RAG systems are easy to demo. You pick a question where the answer is in the corpus, the system retrieves it, the model paraphrases it fluently, the demo looks good. The failure modes live in the 17th question, not the first.
So the process was: ingest PDFs, write a 25-question fixture set with expected keywords, run retrieval recall and LLM-as-judge scoring, look at what fails, tune, repeat. Only after the eval numbers were stable did I open a code editor to work on the frontend.
On the judge: I used OpenAI's o3 to grade the answers from gpt-4o. The reason is self-evaluation bias — a model grading its own outputs will rate them higher than a separate model would. Whether o3 judging gpt-4o fully solves this — same provider, related training data — is a fair critique I don't have a clean answer to. It's better than self-evaluation; it isn't a randomized trial.
What the numbers showed
The corpus I ingested: the RBI Master Circular on SHG-Bank Linkage Programme (April 2025), NABARD's Status of Microfinance in India 2023-24, and NABARD's SHGBLP Impact Report — around 2,000 chunks after splitting at 800 tokens with 150-token overlap.
The eval ran 25 fixture questions. Results:
| Metric | Score |
|---|---|
| Average keyword recall | 52% |
| Average relevance | 4.0 / 5 |
| Average groundedness | 4.48 / 5 |
| Confidently answered (rel ≥ 4 and gnd ≥ 4) | 11 / 25 |
The 52% recall didn't surprise me. It's a corpus coverage problem — questions about SHG formation procedures, savings account documentation, and default-liability mechanics simply aren't well-covered by the three documents I ingested. Adding the DAY-NRLM operational manual and the NABARD MFI Master Circular would close most of those gaps without any change to the retrieval architecture.
The number that stayed with me was the gap between groundedness (4.48) and relevance (4.0).
The honest chatbot
Here's what that gap means concretely: when the corpus doesn't contain a good answer, the chatbot tends to say so rather than confabulate one. It's more reliable at not misleading than it is at being helpful.
That sounds like faint praise. In this domain, I think it's the right property to have.
The population this is built for — SHG members, loan officers, NGO staff working on credit access in rural UP — is not well served by a chatbot that confidently cites a figure that turns out to be a survey average rather than a policy ceiling. The stakes are real: a borrower who acts on a wrong number about loan limits or interest subvention eligibility can make a genuinely bad financial decision.
Two answers in the eval were flagged for manual review. The figure cited in response to a first-linkage loan amount question (₹71,300) appears in a NABARD source table, but may be a survey average rather than the policy ceiling — the distinction matters if someone is deciding whether to challenge a bank's offer. Two other responses received low groundedness scores for minor unsupported assertions about NABARD grade definitions and required bank records.
Neither failure would have shown up in a demo. They showed up in questions 17 and 25 of an eval.
What was actually hard
Not the retrieval stack. Supabase pgvector with an HNSW cosine index is straightforward to stand up, and the LangChain ingest pipeline is well-documented. The genuinely tricky parts:
Getting clean documents. RBI and NABARD publish their guidelines publicly, but the path from "I want to ingest this" to "I have a clean PDF that produces good chunks" runs through multi-column layouts, footnotes, and tables that don't survive naive PDF parsing. I ended up with chunks that were essentially noise — page headers, table separators, running footnotes — that bloated the corpus and hurt retrieval precision. Better PDF preprocessing is the next infrastructure investment.
Writing honest fixture questions. The 25 questions need to span the realistic query distribution, not just the easy cases. The temptation is to write questions you already know the corpus answers well. I caught myself doing this and had to deliberately write questions about document gaps — formation rules, interest rate ceilings, what happens at default — even knowing they'd score poorly. An eval is only as honest as the question set.
What I cut
Hindi. The original plan was Hindi-first. I cut it from v0. The blocker isn't the language — I speak Hindi fluently. It's that typing Hindi on a keyboard is slow enough to make iterating on prompts and eval questions painful, and the romanized Hindi that appears in some regulatory documents creates its own normalization problems. More importantly, I need field input on the actual vocabulary SHG members use to ask about loan rules before I build it — otherwise I'll optimize for Hindi that nobody actually uses in this context.
Streaming. The chat interface returns the full response at once rather than streaming tokens. Slightly worse UX, but simpler to build, debug, and evaluate. Streaming comes after the eval pipeline is stable.
Session history. Each query is stateless — no multi-turn memory. For regulatory Q&A, single-turn is often sufficient; most questions are self-contained. If users start asking follow-ups, this is the first thing to add.
More documents. I ingested three. The corpus gaps are fixable with two or three more. I held off because adding documents and re-running evals takes time and I wanted to ship the baseline first, see what's missing, then add deliberately rather than speculatively.
What I learned
Eval-first changes the sequence of decisions, not just the code. When you don't have an eval, you iterate on feel — something looks wrong, you tweak a prompt, it looks better. When you have one, you're iterating on numbers. You know what you're tuning and you know when you've made it worse. I found this more satisfying, even when the numbers were disappointing.
Corpus coverage is the ceiling. Recall topped out at 52% not because of the embedding model or retrieval strategy, but because the source documents don't cover SHG formation or default procedures. The model can't retrieve what isn't there. The cleanest path to 70%+ recall is two more documents, not a fancier retrieval approach. This is almost always true in RAG. I keep having to remind myself of it.
Groundedness is a real signal. I'd been skeptical of it as an eval metric — it felt like something you add to a rubric to look thorough. The flagged answers changed my mind. One identified a potential misattribution of a survey figure as a policy figure; another caught an assertion about NABARD grade definitions with no source in the retrieved passages. Both were genuine, actionable findings. Neither was noise.
The question I can't answer from Bengaluru
The metric I'm missing is field accuracy.
The eval tells me what the chatbot gets right relative to the documents. It doesn't tell me whether the documents are current, whether the rules they describe have been superseded by state-level circulars I haven't ingested, or whether a loan officer in Lucknow would recognize the answers as matching what she actually sees in practice.
SHG lending rules are the kind of domain where policy and practice diverge. The RBI circular says one thing. The state-level NRLM guidelines add conditions. The bank's internal credit policy adds more. What actually happens in a branch in Rae Bareli may differ from all three. A chatbot trained on the circulars knows the circulars. It doesn't know the gap.
Closing this requires conversations with people who live in that gap — NABARD field officers, NGO credit program managers, people who run SHG trainings in rural UP. That's the conversation phase I've been holding off until I have something concrete to show. I now have two live prototypes. That's probably enough to start.
Also this weekend
The Organic Farm Listing App — built the same weekend, posted separately: a bilingual marketplace for FPOs in the Kanpur belt to list organic produce, with a buyer inquiry form as the contract. No AI, no data visualization — just the infrastructure layer that sits underneath both. Hindi by default, mobile-first.