Building a Production-Ready RAG Pipeline in 10 Hours: How We Won our Global CTO Hackathon

Building a Production-Ready RAG Pipeline in 10 Hours: How We Won our Global CTO Hackathon

February 21, 2026

Earlier this month, I teamed up with few colleagues for our company’s internal Global CTO Hackathon. Competing in the “AI for Productivity” track, we tackled a massive bottleneck in the Financial Crime Risk Management (FCRM) space: the hours analysts spend manually writing performance narratives for transaction-monitoring models.

Our project, the Generative AI Narrative Tool, was designed to automate these evidence-based reports without sacrificing accuracy or compliance. I’m incredibly proud to say that our project won the Local First Prize!

While the win was a high point, the real story was the engineering sprint. Building an LLM prototype is easy; building a production-grade, grounded pipeline in a single 10-hour window is a different game entirely.

The Problem: The “Narrative Bottleneck”

In the high-stakes world of FCRM, transaction monitoring isn’t just about flagging suspicious activity, it’s about the narrative that justifies the risk. Analysts spend countless hours manually synthesizing data into evidence-based reports. It’s manual, slow, repetitive, and a massive drain on productivity. We set out to build a tool that doesn’t just summarize but thinks like an analyst, by grounding every sentence in historical evidence and operational data.

The Architecture: A Real-World GCP Sandbox

We knew from the start that a Jupyter notebook wouldn’t win. We needed a deployable artifact. As the lead on setup and deployment, I moved our team toward a containerized, serverless architecture on Google Cloud Platform.

📊 Technical Architecture & Data Flow

We knew we wanted a deployable prototype, not just a Jupyter notebook. We landed on a stack using Google Cloud Platform:

  • UI: Streamlit deployed on Cloud Run.
  • LLM: Vertex AI Gemini.
  • Context: Vertex AI RAG Engine for managed retrieval.
  • Evaluation: A separate Judge LLM for cross-evaluating outputs.
  ---
config:
  layout: dagre
  theme: mc
---
flowchart TB
 subgraph s1["Data Ingestion"]
        C["📜 Historical Narratives"]
        B["📊 MI Dashboard - Excel"]
        D["⚙️ Pipeline Orchestrator"]
        A["📄 KRI/KPI summary CSV"]
  end
 subgraph s2["GCP Backend"]
        G["🧠 Gemini 2.5 Pro - Author"]
        F[("🗄️ Managed Vector Corpus")]
        E["☁️ Vertex AI RAG Engine"]
  end
 subgraph s3["Quality Assurance & Repair"]
        I{"Pass?"}
        H["🔍 Gemini 2.5 Flash - Cross-Judge"]
  end
 subgraph s4["Delivery (Cloud Run)"]
        K["📥 Final Exported CSV"]
        J["🚀 Streamlit on Cloud Run"]
  end
    A --> D
    B --> D
    C --> D
    D -- Ingest --> E
    E -- Embeddings --> F
    D -- Query --> G
    F -- Contextual Evidence --> G
    G -- Draft JSON --> H
    H -- Validate & Score --> I
    I -- ❌ Failed --> G
    I -- ✅ Passed --> J
    J -- HITL Review --> K

    style D fill:#f1f3f4,stroke:#3c4043,stroke-width:2px
    style E fill:#fef7e0,stroke:#f9ab00,stroke-width:2px
    style G fill:#e8f0fe,stroke:#1a73e8,stroke-width:2px,color:#1a73e8
    style H fill:#e8f0fe,stroke:#1a73e8,stroke-width:2px,color:#1a73e8
    style J fill:#e6f4ea,stroke:#137333,stroke-width:2px,color:#137333

Technical Deep Dive: Engineering the “Win”

The difference between a “cool demo” and a “winning tool” came down to how we handled three specific engineering challenges:

1. Taming the Output (The JSON Repair Loop)

To make the AI’s output useful for downstream systems, we needed it to return structured data-specifically boolean flags like needs_action and strict trigger categories. When the LLM returned conversational text instead of valid JSON, the pipeline would crash. The Solution: We implemented a Repair Loop. If the JSON parsing failed, the error was caught and fed back to the model with the prompt: “You broke the schema, here is the error, try again.” This allowed the system to self-correct in real-time and more bullet-proof.

2. Architecting the Grounded RAG Engine (Retrieval Augmented Generation)

Generic LLM knowledge is a liability in compliance. We utilized the Vertex AI RAG Engine to ground narratives in actual Management Information (MI) dashboards. By ingesting customer trends and transactional summaries into a managed corpus, we forced the model to cite specific historical snapshots. This eliminated generic “chatbot” phrasing and replaced it with evidence-based insights.

Moving from a local prototype to a production-ready system meant shifting to a managed infrastructure. We utilized the Vertex AI RAG Engine to handle the heavy lifting of context retrieval, replacing our initial local ChromaDB approach.

  • Ingestion & Chunking: We preprocessed complex Management Information (MI) Excel Dashboards which contain number of customer trend data and transactional summaries, into a structured format for ingestion.

  • Embedding & Storage: Using Vertex AI Embeddings, we converted these documents into high-dimensional vectors stored in a managed corpus (narrative_gen_corpus).

  • Contextual Retrieval: At runtime, the tool doesn’t just “guess”. It uses the specific record’s KPIs as a query to pull the top-k most relevant historical snapshots from the corpus.

  • Dynamic Prompt Injection: This retrieved evidence is injected into the prompt, forcing the LLM to ground its narrative in hard evidence rather than generic probability.

3. The “Cross-Judge” Evaluation: Automated Quality Assurance

In a regulated domain, we couldn’t rely on the first output alone. We implemented a multi-agent evaluation pattern where a second “Critic” LLM validates the “Author” LLM.

The logic follows a dual-axis assessment:

  • Binary Issue Flags: The evaluator checks for “hallucinations” or logical gaps, such as factual inconsistencies with the input data, unsupported recommendations, or missed insights.

  • Dimension Scoring: Each narrative is rated as Strong, Adequate, or Weak across four key areas: Data Fidelity, AML Reasoning, Recommendation Support, and Clarity/Structure.

  • The “Confidence Signal”: These scores are synthesized into an overall category (e.g., Satisfactory or Requires Enhancement), providing a clear signal to the human reviewer in the Streamlit app.

Putting it all Together

My focus during the hackathon was ensuring the “plumbing” was as sophisticated as the AI. I took ownership of the CLI deployment pipeline, building the .sh scripts that automated the GCP environment setup and Cloud Run deployment. It was a great opportunity to apply the theory I’d been studying in my GCP Machine Learning Engine (MLE) coursework to a high-pressure, hands-on project. Handling service account permissions, API connections, and containerization from scratch was exactly the kind of battle-tested experience I was looking for.

Conclusion

Winning the hackathon was an amazing feeling, but the biggest takeaway for me was realizing that GenAI isn’t just a magic that resolves everything. It requires robust engineering. Enforcing strict data contracts, building evaluation pipelines, robust CI/CD deployment and integrating a Human-in-the-Loop review are what actually make LLMs useful in a corporate environment. I’m excited to take these lessons into my day-to-day work.