Nitin Guleria
Back to Blog

6. Building VivekaNandaGPT

Building VivekanandaGPT using opensource modesl with RAG, data cleaning and system prompt

Introduction

In an age where information is abundant, the ability to quickly and accurately access specific knowledge from vast textual sources is paramount. Large Language Models (LLMs) have revolutionized how we interact with information, but they often suffer from issues like hallucination and a lack of up-to-date knowledge. Retrieval-Augmented Generation (RAG) offers a powerful solution by combining the generative capabilities of LLMs with the precision of information retrieval. This blog post details the creation of “VivekanandaGPT,” a specialized chatbot designed to provide answers based solely on the teachings and writings of Swami Vivekananda, drawing from his complete works available on Wikisource.

Our goal is to demonstrate how open-source models and publicly available data can be leveraged to build a domain-specific AI assistant. VivekanandaGPT will serve as a reliable source of information on Swami Vivekananda’s philosophy, ensuring that all responses are grounded in his original texts and eliminating external biases or fabricated content.

This post will walk you through the entire process, from data acquisition and cleaning to selecting appropriate open-source models and implementing the RAG architecture. We will also address crucial steps to mitigate

Data Acquisition and Cleaning

The foundation of any successful RAG model is the quality of its knowledge base. For VivekanandaGPT, our primary source of information is “The Complete Works of Swami Vivekananda” from Wikisource. This digital collection contains a comprehensive repository of his speeches, writings, letters, and conversations.

Scraping the Data

To build our knowledge base, we first needed to extract the text from the Wikisource website. We developed a Python script using the requests and BeautifulSoup libraries to scrape the content. The script navigates the main page, identifies all links to the individual volumes and sections of the book, and then extracts the text from each page.

One of the initial challenges was handling the relative URLs found on the page. The script was designed to prepend the base URL (https://en.wikisource.org) to any relative links to ensure they could be accessed correctly. Additionally, to avoid being blocked by the server for making too many requests in a short period, we incorporated a one-second delay between each page request.

Cleaning the Text

The raw HTML content scraped from the web is filled with extraneous information, such as navigation menus, headers, footers, and other elements that are not part of the actual text. To create a clean dataset for our RAG model, we performed a series of cleaning steps:

  1. Removing Unwanted HTML Elements: We used BeautifulSoup to parse the HTML and remove all script, style, header, footer, nav, and other non-content tags.

  2. Filtering by IDs and Classes: We identified specific CSS IDs and classes used by Wikisource for non-content elements (e.g., mw-navigation, printfooter) and removed them from the parsed HTML.

  3. Extracting the Main Content: We found that the primary content of each page was typically contained within a div element with the ID mw-content-text. We extracted the text from this div to isolate the relevant information.

  4. Text Normalization: We performed several text normalization steps, including:

    • Replacing multiple spaces and newlines with a single space or newline.
    • Removing common Wikisource/Wikipedia artifacts like “[edit]”, “[citation needed]”, and navigation links.

Here is the Python script we used for this process:

import requests
from bs4 import BeautifulSoup
import re
import os
import time

def get_page_content(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for HTTP errors
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

def clean_text(html_content):
    if not html_content:
        return ""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove unwanted elements (scripts, styles, navigation, etc.)
    for s in soup(['script', 'style', 'header', 'footer', 'nav', 'aside', 'form', 'input', 'button', 'img', 'link']):
        s.decompose()

    # Remove elements with specific IDs or classes that are not content
    unwanted_ids = ['mw-navigation', 'mw-panel', 'footer', 'p-logo', 'p-navigation', 'p-search', 'p-interaction', 'p-tb', 'p-coll-print_export', 'p-lang', 'siteSub', 'contentSub', 'jump-to-nav', 'firstHeading', 'catlinks']
    for id_name in unwanted_ids:
        element = soup.find(id=id_name)
        if element:
            element.decompose()

    unwanted_classes = ['mw-editsection', 'printfooter', 'portal', 'mw-indicator', 'noprint', 'sister-project', 'infobox', 'metadata', 'thumbinner', 'mw-jump-link']
    for class_name in unwanted_classes:
        for element in soup.find_all(class_=class_name):
            element.decompose()

    # Extract main content area - this might need adjustment based on page structure
    content_div = soup.find(id='mw-content-text')
    if content_div:
        text = content_div.get_text(separator=' ', strip=True)
    else:
        text = soup.get_text(separator=' ', strip=True)

    # Remove multiple spaces and newlines
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\n+', '\n', text)

    # Remove common wikisource/wikipedia artifacts that remain
    text = re.sub(r'\(function\(\) {[^}]*}\)\(\);', '', text) # Remove javascript snippets
    text = re.sub(r'^\[[^\]]*\]', '', text) # Remove leading [Jump to content] etc.
    text = re.sub(r'\b(?:edit|citation needed|page|talk|read|view history|tools|download)\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\b(?:Sisters and Brothers of America)\b', '', text) # Specific to the first page

    return text.strip()

def main():
    base_url = "https://en.wikisource.org"
    main_page_url = base_url + "/wiki/The_Complete_Works_of_Swami_Vivekananda"
    
    # Get all links from the main page that point to volumes/sections
    main_page_content = get_page_content(main_page_url)
    if not main_page_content:
        print("Could not fetch main page content. Exiting.")
        return

    soup = BeautifulSoup(main_page_content, 'html.parser')
    all_links = [a.get('href') for a in soup.find_all('a', href=True)]
    
    # Filter for relevant volume/section links and prepend base_url if relative
    volume_links = []
    for link in all_links:
        if link and "The_Complete_Works_of_Swami_Vivekananda/Volume_" in link and not "#" in link and not "action=edit" in link:
            if link.startswith("/"):
                volume_links.append(base_url + link)
            else:
                volume_links.append(link)
    
    # Remove duplicates by converting to set and back to list
    volume_links = list(set(volume_links))
    
    # Create a directory to store the cleaned text files
    output_dir = "vivekananda_text"
    os.makedirs(output_dir, exist_ok=True)

    print(f"Found {len(volume_links)} unique volume/section links. Starting extraction...")

    for i, link in enumerate(volume_links):
        print(f"Processing link {i+1}/{len(volume_links)}: {link}")
        page_content = get_page_content(link)
        cleaned_text = clean_text(page_content)
        
        # Create a filename from the URL
        filename = link.split('/')[-1].replace(':', '').replace(' ', '_') + ".txt"
        file_path = os.path.join(output_dir, filename)
        
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(cleaned_text)
        print(f"Saved cleaned text to {file_path}")
        time.sleep(1) # Add a 1-second delay to avoid rate-limiting

if __name__ == "__main__":
    main()

The result of this process is a directory of clean text files, each corresponding to a section of Swami Vivekananda’s complete works. This cleaned dataset forms the backbone of our VivekanandaGPT, providing the knowledge base from which the RAG model will retrieve information.

Open-Source Models and RAG Implementation

Building a RAG system involves several key components: an embedding model to convert text into numerical representations (embeddings), a vector database to store and efficiently search these embeddings, and a Large Language Model (LLM) to generate responses based on retrieved information. The open-source ecosystem offers a wealth of options for each of these components, allowing for flexible and cost-effective deployment.

Choosing an Open-Source LLM

For VivekanandaGPT, the choice of LLM is crucial. We need a model that can be fine-tuned or effectively used with RAG to provide accurate and contextually relevant answers based on Swami Vivekananda’s teachings. While many powerful LLMs exist, we prioritize open-source models that can be run either locally or on platforms like Hugging Face, ensuring accessibility and control over the deployment environment. Some strong candidates for RAG applications include:

  • Llama 2 (Meta): A family of pre-trained and fine-tuned LLMs ranging in size from 7B to 70B parameters. Llama 2 has shown strong performance across various tasks and is a popular choice for RAG due to its open availability and robust community support.
  • Mistral 7B (Mistral AI): A smaller yet highly capable model that offers excellent performance for its size, making it suitable for local deployment or environments with limited resources. Its efficiency and strong performance make it a compelling option for RAG.
  • Gemma (Google): A lightweight, state-of-the-art open model from Google, built from the same research and technology used to create the Gemini models. Gemma models are designed for responsible AI development and offer good performance for their size.

The selection will ultimately depend on the available computational resources and the desired balance between model size, performance, and inference speed.

RAG Frameworks and Libraries

Implementing a RAG pipeline from scratch can be complex. Fortunately, several open-source frameworks and libraries simplify the process, providing pre-built components and abstractions for common RAG patterns. Key frameworks include:

  • LangChain: A widely adopted framework for developing applications powered by language models. LangChain provides modules for document loading, text splitting, embeddings, vector stores, and chaining LLM calls with retrieval. Its extensive integrations and active community make it an excellent choice for building RAG applications.
  • LlamaIndex: Another popular data framework for LLM applications, LlamaIndex focuses on making it easy to ingest, structure, and access private or domain-specific data with LLMs. It offers various data connectors and indexing strategies optimized for RAG.
  • Haystack (Deepset): An end-to-end framework for building NLP applications, including RAG. Haystack provides a modular architecture that allows developers to easily swap out components like retrievers, readers, and generators. It’s known for its flexibility and production-readiness.
  • RAGFlow: An open-source RAG (Retrieval-Augmented Generation) engine that aims to streamline the RAG workflow. It combines LLMs with external knowledge bases to provide truthful question-answering capabilities.

These frameworks abstract away much of the complexity, allowing developers to focus on integrating their data and chosen LLMs. For VivekanandaGPT, we will likely leverage LangChain or LlamaIndex due to their comprehensive features and strong community support.

Vector Stores

To efficiently retrieve relevant text snippets from our cleaned Vivekananda dataset, we need a vector store. A vector store (or vector database) stores the numerical embeddings of our text data and allows for fast similarity searches. Popular open-source options include:

  • Chroma: A lightweight, in-memory vector database that is easy to set up and use for smaller-scale RAG applications. It’s a good choice for prototyping and local development.
  • FAISS (Facebook AI Similarity Search): A library for efficient similarity search and clustering of dense vectors. While not a full-fledged database, it’s highly optimized for speed and can be used for the retrieval component of a RAG system.
  • Pinecone, Weaviate, Qdrant: These are more robust, production-ready vector databases that offer scalability, persistence, and advanced features. While some have open-source components, they often involve cloud-based services for full functionality.

For the initial prototype of VivekanandaGPT, Chroma or FAISS would be suitable for local development, providing a solid foundation for the retrieval mechanism.

Mitigating Hallucination and Ensuring Personality

One of the core requirements for VivekanandaGPT is to eliminate false personality or outside data-based hallucination and ensure all answers are strictly based on the provided dataset. This can be achieved through a combination of RAG architecture design and careful prompt engineering:

  1. Strict RAG Implementation: By ensuring that the LLM only generates responses based on the retrieved context from the Vivekananda dataset, we inherently limit its ability to

generate information outside of the provided texts. The RAG architecture itself acts as a strong guardrail against hallucination.

  1. System Prompt Engineering: A critical step in controlling the LLM’s behavior is through a well-crafted system prompt. This prompt sets the persona and constraints for the LLM, guiding its responses. For VivekanandaGPT, the system prompt will explicitly instruct the model to:
    • Act as Swami Vivekananda: The model should adopt the tone, style, and philosophical perspective of Swami Vivekananda based on the provided texts.
    • Adhere strictly to the provided context: Emphasize that responses must be derived only from the retrieved documents. The model should not use its pre-trained knowledge beyond understanding the query and the provided context.
    • State ignorance for out-of-context questions: If a question cannot be answered using the provided texts, the model should explicitly state, “I am not aware of that,” or a similar phrase, rather than attempting to generate a speculative answer.
    • Avoid personal opinions or external information: Reinforce that the model should not introduce its own opinions, external facts, or information not present in Swami Vivekananda’s works.

An example of such a system prompt might look like this:

"You are Swami Vivekananda. Your purpose is to answer questions based solely on the provided texts from 'The Complete Works of Swami Vivekananda'. Do not use any external knowledge or personal opinions. If a question cannot be answered from the provided context, respond with 'I am not aware of that.' Maintain the philosophical and spiritual tone of Swami Vivekananda in your responses."

This explicit instruction helps to eliminate false personality and outside data-based hallucination, ensuring that VivekanandaGPT remains true to its source material.

Building VivekanandaGPT Prototype

With the data cleaned and our understanding of open-source RAG components solidified, the next step is to build a functional prototype of VivekanandaGPT. This involves:

  1. Text Chunking and Embedding: The cleaned text data will be divided into smaller, manageable chunks. These chunks will then be converted into numerical vector embeddings using an open-source embedding model (e.g., all-MiniLM-L6-v2 from Hugging Face). These embeddings capture the semantic meaning of the text.

  2. Vector Database Population: The generated embeddings, along with their corresponding text chunks, will be stored in a vector database (e.g., ChromaDB). This database will enable efficient similarity searches, allowing us to quickly retrieve the most relevant text chunks when a user asks a question.

  3. RAG Pipeline Construction: We will use a RAG framework like LangChain to orchestrate the retrieval and generation process. The pipeline will typically involve:

    • Retriever: Given a user query, the retriever will search the vector database for the most semantically similar text chunks from Swami Vivekananda’s works.
    • Generator: The retrieved text chunks will be passed as context to the chosen open-source LLM (e.g., Llama 2, Mistral 7B). The LLM, guided by the system prompt, will then generate a coherent and relevant answer based only on this provided context.
  4. Local Deployment or Hugging Face Integration: The prototype will be set up to run either locally using tools like Ollama for local LLM inference or deployed on Hugging Face Spaces for broader accessibility. This choice will depend on the computational resources available and the desired ease of sharing.

Testing and Refinement

Once the prototype is built, rigorous testing is essential to ensure its accuracy, consistency, and adherence to the defined constraints. This phase will involve:

  1. Question Answering Evaluation: We will prepare a set of questions related to Swami Vivekananda’s works and evaluate the chatbot’s responses. This includes checking for:

    • Accuracy: Is the answer factually correct according to the source texts?
    • Relevance: Does the answer directly address the user’s question?
    • Grounding: Is the answer solely based on the provided context, or does it introduce external information?
    • Hallucination: Does the model generate any fabricated or misleading information?
  2. Edge Case Testing: We will specifically test questions that are outside the scope of the dataset to verify that the model correctly responds with “I am not aware of that” or a similar phrase, without attempting to generate an answer.

  3. Prompt Optimization: Based on the testing results, we will refine the system prompt and potentially the RAG pipeline parameters to improve performance and minimize undesirable behaviors. This iterative process is crucial for achieving a high-quality, reliable chatbot.

  4. User Feedback (Optional): For a more robust evaluation, gathering feedback from users familiar with Swami Vivekananda’s works can provide valuable insights into the chatbot’s effectiveness and areas for improvement.

Conclusion

Building VivekanandaGPT demonstrates the power of open-source tools and RAG architecture in creating specialized, knowledge-grounded AI assistants. By meticulously cleaning the dataset, selecting appropriate open-source models, and employing careful prompt engineering, we can develop a chatbot that provides accurate, context-aware, and hallucination-free responses based on a specific body of work. This approach not only democratizes access to advanced AI capabilities but also ensures the integrity and fidelity of the information disseminated. VivekanandaGPT stands as a testament to how AI can be used to preserve and disseminate valuable knowledge in a controlled and reliable manner.

Comments