Building a Production-Ready RAG Pipeline in 10 Hours: How We Won our Global CTO Hackathon
Earlier this month, I teamed up with few colleagues for our company’s internal Global CTO Hackathon. Competing in the “AI for Productivity” track, we tackled a massive bottleneck in the Financial Crime Risk Management (FCRM) space: the hours analysts spend manually writing performance narratives for transaction-monitoring models.
Our project, the Generative AI Narrative Tool, was designed to automate these evidence-based reports without sacrificing accuracy or compliance. I’m incredibly proud to say that our project won the Local First Prize!
While the win was a high point, the real story was the engineering sprint. Building an LLM prototype is easy; building a production-grade, grounded pipeline in a single 10-hour window is a different game entirely.
The Problem: The “Narrative Bottleneck”
In the high-stakes world of FCRM, transaction monitoring isn’t just about flagging suspicious activity, it’s about the narrative that justifies the risk. Analysts spend countless hours manually synthesizing data into evidence-based reports. It’s manual, slow, repetitive, and a massive drain on productivity. We set out to build a tool that doesn’t just summarize but thinks like an analyst, by grounding every sentence in historical evidence and operational data.
The Architecture: A Real-World GCP Sandbox
We knew from the start that a Jupyter notebook wouldn’t win. We needed a deployable artifact. As the lead on setup and deployment, I moved our team toward a containerized, serverless architecture on Google Cloud Platform.
📊 Technical Architecture & Data Flow
We knew we wanted a deployable prototype, not just a Jupyter notebook. We landed on a stack using Google Cloud Platform:
- UI: Streamlit deployed on Cloud Run.
- LLM: Vertex AI Gemini.
- Context: Vertex AI RAG Engine for managed retrieval.
- Evaluation: A separate Judge LLM for cross-evaluating outputs.
---
config:
layout: dagre
theme: mc
---
flowchart TB
subgraph s1["Data Ingestion"]
C["📜 Historical Narratives"]
B["📊 MI Dashboard - Excel"]
D["⚙️ Pipeline Orchestrator"]
A["📄 KRI/KPI summary CSV"]
end
subgraph s2["GCP Backend"]
G["🧠 Gemini 2.5 Pro - Author"]
F[("🗄️ Managed Vector Corpus")]
E["☁️ Vertex AI RAG Engine"]
end
subgraph s3["Quality Assurance & Repair"]
I{"Pass?"}
H["🔍 Gemini 2.5 Flash - Cross-Judge"]
end
subgraph s4["Delivery (Cloud Run)"]
K["📥 Final Exported CSV"]
J["🚀 Streamlit on Cloud Run"]
end
A --> D
B --> D
C --> D
D -- Ingest --> E
E -- Embeddings --> F
D -- Query --> G
F -- Contextual Evidence --> G
G -- Draft JSON --> H
H -- Validate & Score --> I
I -- ❌ Failed --> G
I -- ✅ Passed --> J
J -- HITL Review --> K
style D fill:#f1f3f4,stroke:#3c4043,stroke-width:2px
style E fill:#fef7e0,stroke:#f9ab00,stroke-width:2px
style G fill:#e8f0fe,stroke:#1a73e8,stroke-width:2px,color:#1a73e8
style H fill:#e8f0fe,stroke:#1a73e8,stroke-width:2px,color:#1a73e8
style J fill:#e6f4ea,stroke:#137333,stroke-width:2px,color:#137333
Technical Deep Dive: Engineering the “Win”
The difference between a “cool demo” and a “winning tool” came down to how we handled three specific engineering challenges:
1. Taming the Output (The JSON Repair Loop)
To make the AI’s output useful for downstream systems, we needed it to return structured data-specifically boolean flags like needs_action and strict trigger categories. When the LLM returned conversational text instead of valid JSON, the pipeline would crash. The Solution: We implemented a Repair Loop. If the JSON parsing failed, the error was caught and fed back to the model with the prompt: “You broke the schema, here is the error, try again.” This allowed the system to self-correct in real-time and more bullet-proof.
2. Architecting the Grounded RAG Engine (Retrieval Augmented Generation)
Generic LLM knowledge is a liability in compliance. We utilized the Vertex AI RAG Engine to ground narratives in actual Management Information (MI) dashboards. By ingesting customer trends and transactional summaries into a managed corpus, we forced the model to cite specific historical snapshots. This eliminated generic “chatbot” phrasing and replaced it with evidence-based insights.
Moving from a local prototype to a production-ready system meant shifting to a managed infrastructure. We utilized the Vertex AI RAG Engine to handle the heavy lifting of context retrieval, replacing our initial local ChromaDB approach.
Ingestion & Chunking: We preprocessed complex Management Information (MI) Excel Dashboards which contain number of customer trend data and transactional summaries, into a structured format for ingestion.
Embedding & Storage: Using Vertex AI Embeddings, we converted these documents into high-dimensional vectors stored in a managed corpus (narrative_gen_corpus).
Contextual Retrieval: At runtime, the tool doesn’t just “guess”. It uses the specific record’s KPIs as a query to pull the top-k most relevant historical snapshots from the corpus.
Dynamic Prompt Injection: This retrieved evidence is injected into the prompt, forcing the LLM to ground its narrative in hard evidence rather than generic probability.
3. The “Cross-Judge” Evaluation: Automated Quality Assurance
In a regulated domain, we couldn’t rely on the first output alone. We implemented a multi-agent evaluation pattern where a second “Critic” LLM validates the “Author” LLM.
The logic follows a dual-axis assessment:
Binary Issue Flags: The evaluator checks for “hallucinations” or logical gaps, such as factual inconsistencies with the input data, unsupported recommendations, or missed insights.
Dimension Scoring: Each narrative is rated as Strong, Adequate, or Weak across four key areas: Data Fidelity, AML Reasoning, Recommendation Support, and Clarity/Structure.
The “Confidence Signal”: These scores are synthesized into an overall category (e.g., Satisfactory or Requires Enhancement), providing a clear signal to the human reviewer in the Streamlit app.
Putting it all Together
My focus during the hackathon was ensuring the “plumbing” was as sophisticated as the AI. I took ownership of the CLI deployment pipeline, building the .sh scripts that automated the GCP environment setup and Cloud Run deployment. It was a great opportunity to apply the theory I’d been studying in my GCP Machine Learning Engine (MLE) coursework to a high-pressure, hands-on project. Handling service account permissions, API connections, and containerization from scratch was exactly the kind of battle-tested experience I was looking for.
Conclusion
Winning the hackathon was an amazing feeling, but the biggest takeaway for me was realizing that GenAI isn’t just a magic that resolves everything. It requires robust engineering. Enforcing strict data contracts, building evaluation pipelines, robust CI/CD deployment and integrating a Human-in-the-Loop review are what actually make LLMs useful in a corporate environment. I’m excited to take these lessons into my day-to-day work.