NLP2026-01

RAG API Using FastAPI

A FastAPI-based Retrieval-Augmented Generation API using Ollama's TinyLlama and ChromaDB for generating contextual responses to user queries.

View on GitHub

Problem Statement

Large Language Models can hallucinate facts and lack access to domain-specific or up-to-date information. Organizations need systems that can ground LLM responses in their proprietary knowledge bases while maintaining low latency and high accuracy.

Methodology

Built a modular RAG pipeline with document ingestion and chunking strategies. Used ChromaDB as the vector store for semantic retrieval and Ollama's TinyLlama for response generation. Created a FastAPI backend with async processing for concurrent queries. Dockerized the application for easy deployment.

Results

Reduced response latency to under 2 seconds for document retrieval and generation. Achieved 92% relevance score in user evaluations. The API handles 100+ concurrent requests with horizontal scaling support.

Tools & Technologies

FastAPIChromaDBTinyLlamaPython 3.9+

View All Projects