Most business questions need data answers, but most people who ask them can’t write SQL—and the analysts who can are a bottleneck. The Text-2-SQL Agent decomposes a plain-English question, writes the query, scores its own answer across seven quality dimensions, and retries when quality is low. A Gies research project competing on the AgentBeats benchmark.
Project Lead: Ash Castelino
Each node does one job and hands state to the next—a LangGraph workflow with a quality-gated retry loop
Introspects the database via PRAGMA—no LLM call. Hashes the schema (SHA-256) and caches with TTL, so repeat questions against the same DB skip the roundtrip entirely.
GPT-5 produces a structured QueryPlan using JSON Schema mode—guaranteed parseable. Decides whether one query or a multi-step chain is needed; predecessor results are injected into later steps.
Writes SQL for each sub-task. On retry, the previous attempt’s targeted feedback (what was wrong, which dimension failed) is injected so the model can correct specifically rather than guess.
Runs the SQL, scores it across 7 dimensions, then runs an independent LLM relevance check. Blends the scores (85% eval + 15% relevance) into a final quality number.
If the score falls below threshold, the pipeline loops back to the generator with category-specific feedback. If it passes, the task completes and the next sub-task begins. Final results are synthesized into a human-readable answer.
Text-to-SQL is easy to prototype and hard to productionize
A single LLM call that writes SQL is fragile—bad joins, wrong aggregations, missing filters. Scoring each result across 7 dimensions gives the agent a clear signal about whether its own output is trustworthy.
Generic “try again” loops waste tokens. This agent tells the model which dimension failed and why, so the retry is corrective—not a random re-roll.
Real analytical questions rarely map to one SQL statement. The planner decomposes them, runs queries in sequence, and feeds predecessor results into later steps—just like a human analyst would.