Outline:
- Introduction
- Scoping
- Setup on Azure & Databricks(Optional)
- Data
- Modeling
- Deployment
- Monitoring
- Cost Analysis
1. Dataset Information
- Data Ingestion: Data will be sourced primarily from structured and unstructured documents, including PDFs and HTML pages, pertinent to user queries.
- Data Processing: Textual data will be segmented, encoded into embeddings, and stored in Delta Tables for optimized retrieval and processing by the RAG model.
2. Technical Objectives
- RAG Integration: Leverage the Retrieval-Augmented Generation framework to enrich the chatbot’s response mechanism.
- Data Management: Use Delta Tables for robust and scalable data handling.
- Model Management: Implement MLFlow for tracking, versioning, and managing the RAG model throughout its lifecycle.
3. Architecture and Technologies
- Azure Databricks: Central platform for model training, data processing, and analytics.
- Delta Tables: To ensure ACID compliance and unify batch and stream processing.
- MLFlow: For model development and performance tracking.
- Serverless Deployment: Utilize Azure Functions for deploying the RAG model, ensuring cost efficiency and scalability.
4. Potential Challenges and Mitigation Strategies
- Data Security and Compliance: Ensure all data handling complies with legal standards, including GDPR.
- System Integration: Test integrations extensively to prevent disruptions in existing workflows.
- Cost Management: Monitor and optimize resource usage to control operational costs effectively.
5. Offline Evaluation of our RAG-chatbot
Step 1: Creating a Superior Model Endpoint
- External Foundation Model Configuration: Configure an external model endpoint using Azure OpenAI to connect to Azure’s services and access advanced language models like GPT-3.5, Mistral-7b, and Llama-2. This setup enhances the chatbot’s capabilities to handle complex queries.
- Fallback Configuration: If Azure OpenAI services are unavailable or API key issues arise, the system switches to Databricks’ managed Llama-2-70b-chat model as a fallback, ensuring continuous chatbot functionality.
Step 2: Preparing Evaluation Dataset
- The evaluation dataset, stored in Azure Data Lake Storage Gen2 (ADLS Gen2), is crafted using GPT-4 to generate realistic user inquiries and high-quality answers. This dataset is essential for validating the chatbot’s responses and its capability to manage real-world interactions.
Step 3: Offline LLM Evaluation
- Predicting and Storing: Before production, the chatbot undergoes offline evaluation by comparing its responses to expert answers in the evaluation dataset. This step ensures accuracy and reliability, allowing for necessary adjustments before live deployment.
Step 4: LLM as Judge
The evaluation uses various MLflow metrics to assess different aspects of the chatbot’s responses:
- Answer Correctness: This metric measures the chatbot’s response accuracy against expected answers, crucial for user satisfaction. High scores indicate alignment with ground_truth and factual correctness, while low scores reflect discrepancies or inaccuracies. Note that this builds onto answer_similarity.
- Toxicity: It’s important to ensure that the chatbot’s language remains appropriate and free from harmful content, which this metric assesses. This metric ranges from {0: ‘not offensive’, 1: ‘offensive’}
- Readability Scores: These metrics (such as Flesch-Kincaid and ARI) assess the complexity of the chatbot’s responses, ensuring they are accessible and understandable to the target audience.
- Professionalism: We are creating a custom metric “professionalism” which ranges from {1: ‘very casual’, 2: ‘casual’, 3: ‘neutral’, 4: ‘formal’, 5: ‘very formal’}. This new metric evaluates the formality and appropriateness of the chatbot’s communication style, ensuring it is suitable for its intended professional or formal context.
Step 5: Model Becomes Production Ready
- After rigorous testing and validation, and once the model meets all criteria, it is finalized for deployment in a production environment. This step confirms the chatbot’s readiness to handle real-world tasks and ensures its reliability, accuracy, and appropriateness before interacting with users.