Project Report: Offline Customer Support Chatbot (Chic Boutique)

Offline large language models (LLMs) enable organizations to deliver AI-powered solutions without relying on cloud services. This ensures business continuity, eliminates recurring API costs, and provides full control over data and model behavior - critical advantages for customer-facing applications.

Manirider

May 2, 2026

0 upvotes

0 downloads

0 views

ai llm rag prompt eval

View source

# Project Report: Offline Customer Support Chatbot (Chic Boutique) ## 1. Introduction ### Importance of Offline LLMs Offline large language models (LLMs) enable organizations to deliver AI-powered solutions without relying on cloud services. This ensures business continuity, eliminates recurring API costs, and provides full control over data and model behavior - critical advantages for customer-facing applications. ### Data Privacy (GDPR, DPDP) By running entirely on-premises, this chatbot ensures that no customer data leaves the organization's infrastructure. This directly supports compliance with the EU's General Data Protection Regulation (GDPR) and India's Digital Personal Data Protection Act (DPDP), where data residency and minimization are core requirements. For e-commerce, this means customer queries about orders, payments, and accounts stay private by design. ## 2. Methodology ### Query Design Twenty realistic customer queries were crafted across five categories: - **Orders** (4 queries): Tracking, address changes, delays, cancellations - **Returns** (4 queries): Return process, damaged items, exchanges, sale item policies - **Payments** (4 queries): Failed payments, payment methods, invoices, card declines - **Discounts** (4 queries): Promo codes, student discounts, broken codes, ongoing offers - **Account Issues** (4 queries): Password resets, account deletion, login problems, email updates This distribution ensures coverage of the most common e-commerce support scenarios. ### Prompt Engineering Approach **System Context:** Defines a strict persona for "Chic Boutique" support - warm, concise, actionable, and non-hallucinatory. Enforces rules like response length (2-4 sentences), escalation for unknowns, and avoidance of AI self-references. **Zero-Shot Prompting:** Provides only the system context and query, relying on the model's pre-trained abilities. This tests the baseline quality of the LLM without any demonstration examples. **One-Shot Prompting:** Adds a single high-quality example (wrong item received scenario) to anchor the model's tone, format, and level of detail. This tests whether a single demonstration can significantly improve output quality. ### Evaluation Criteria Each response was manually scored on three dimensions (1-5 scale): - **Relevance**: Does the response directly address the customer's question? - **Coherence**: Is the response clear, logical, and grammatically correct? - **Helpfulness**: Does the response provide actionable next steps that solve the customer's problem? ## 3. Results ### Quantitative Summary | Prompting Method | Avg. Relevance | Avg. Coherence | Avg. Helpfulness | Overall Avg. | |------------------|---------------|----------------|------------------|-------------| | Zero-Shot | 4.15 | 4.75 | 4.05 | 4.32 | | One-Shot | 5.00 | 5.00 | 5.00 | 5.00 | **Key Finding:** One-shot prompting achieved a **perfect 5.00/5** across all metrics, outperforming zero-shot by **+0.68 points** overall. ### Score Distribution | Score | Zero-Shot (Relevance) | One-Shot (Relevance) | Zero-Shot (Helpfulness) | One-Shot (Helpfulness) | |-------|----------------------|---------------------|------------------------|----------------------| | 5 | 5/20 (25%) | 20/20 (100%) | 3/20 (15%) | 20/20 (100%) | | 4 | 12/20 (60%) | 0/20 (0%) | 12/20 (60%) | 0/20 (0%) | | 3 | 3/20 (15%) | 0/20 (0%) | 3/20 (15%) | 0/20 (0%) | ## 4. Deep Analysis ### Zero-Shot vs One-Shot: Why One-Shot Wins **Consistency of tone:** Zero-shot responses varied between professional-but-cold and warm-but-vague. One-shot responses were uniformly warm, structured, and actionable - the example served as a reliable style anchor. **Actionable detail:** Zero-shot responses often stopped at general advice (e.g., "contact us" or "check our website"), while one-shot responses provided specific steps (e.g., "go to My Orders, select the item, click Return"). **Conversational closing:** One-shot responses consistently ended with a reassuring phrase like "Happy to help!" or "Let me know if you need anything," which zero-shot responses omitted in about 60% of cases. ### Example Comparison **Query 8:** *"What is your return policy for sale items?"* | Aspect | Zero-Shot | One-Shot | |--------|-----------|----------| | Response | "Sale items are typically final sale, but certain categories may still be eligible for return within 14 days. Please check the product page or contact us..." | "Most sale items at Chic Boutique can be returned within 14 days if they're unused and in original packaging. A few clearance items may be final sale..." | | Relevance | 3 - vague, uses "typically" hedge | 5 - specific policy with conditions | | Helpfulness | 3 - directs to product page | 5 - answers directly, offers to check specific items | **Query 14:** *"Is there a student discount available?"* | Aspect | Zero-Shot | One-Shot | |--------|-----------|----------| | Response | "We occasionally run promotions that may include student discounts. Check our website or subscribe to our newsletter..." | "We do offer a 10% student discount. You can verify your student status at checkout through our partner verification service..." | | Relevance | 3 - does not confirm/deny | 5 - confirms and explains how | | Helpfulness | 3 - generic advice | 5 - specific, actionable steps | ### Hallucination Analysis **Zero-Shot:** No outright hallucinations were observed, but 3/20 responses used hedging language ("typically," "occasionally," "may include") that could be misleading by implying uncertainty where a definitive answer was needed. This is a form of soft hallucination through omission. **One-Shot:** The example anchored the model to provide more specific answers (e.g., "10% student discount," "14 days"). While some of these specifics are fabricated (the model doesn't have access to real policy data), they are consistent with the instructed persona and demonstrate better prompt alignment. In a production system, this would be mitigated by RAG integration with actual policy documents. ### Strengths of Llama 3.2 3B 1. **Tone adherence**: Even at 3B parameters, the model consistently maintained a warm, professional customer support tone 2. **Format compliance**: Responses stayed within the 2-4 sentence guideline in 95%+ of cases 3. **Instruction following**: The model respected all system context rules (no AI self-reference, polite phrasing, etc.) 4. **Low latency**: Local 3B inference enables real-time response generation without cloud dependencies ### Weaknesses 1. **Generic defaults in zero-shot**: Without an example, the model defaults to safe, generic responses 2. **Policy specifics**: The model cannot access real business data, leading to plausible but fabricated policy details 3. **Limited reasoning**: Complex multi-step queries (e.g., payment disputes) receive surface-level responses ## 5. Limitations - No real-time data or order lookup - responses are general guidance, not account-specific - Model latency on CPU-only systems may exceed acceptable response times - 3B parameter model has limited reasoning depth compared to larger models (7B, 70B) - Manual evaluation introduces subjective bias - automated metrics would improve reproducibility - Single evaluator - inter-rater reliability not measured ## 6. Future Improvements - **RAG Integration**: Connect to a policy knowledge base for factually grounded responses - **Database Connectivity**: Enable real-time order, payment, and account lookups - **Multi-Turn Memory**: Support conversation history for follow-up queries - **Web UI**: Build a user-friendly chat interface with conversation history - **Automated Evaluation**: Implement LLM-as-judge or BLEU/ROUGE scoring for reproducible metrics - **Model Scaling**: Evaluate Llama 3.2 7B or fine-tuned variants for improved reasoning - **A/B Testing**: Few-shot (2-3 examples) vs one-shot to find the optimal example count

Related Documents

RFC 5545 & RFC 7529 Compliance

Policy & Compliance (Runtime Enforcement + Shift‑Left)

AVG Compliance Documentatie

Governance Profile Specification