Self-Hosted AI Chatbot: Complete 2026 Guide
A practical guide to self-hosting an AI chatbot in 2026 — open-weight model selection, GPU infrastructure, vector database, observability, and the real cost vs SaaS. With architecture diagrams and decision criteria.
Who this is for
Teams with strict data residency, in-house ML capacity, or ultra-high volume where SaaS economics break. For everyone else, a managed platform like EzyConn is dramatically cheaper and faster.
The Core Stack
- Inference server. vLLM or TensorRT-LLM on GPU. Llama 4 70B, Mistral Large, Qwen 3, or DeepSeek V3 are top open-weight choices.
- Vector database. Qdrant, Weaviate, or self-hosted PostgreSQL + pgvector. For under 10M vectors, pgvector is fine.
- Embedding model. bge-large-en or stella-1.5b for English; multilingual-e5 for global.
- Orchestration layer. A thin API server handling retrieval, system prompts, tool calls, and conversation memory.
- Observability. Langfuse or self-hosted Prometheus + Grafana for token usage, latency, and quality metrics.
- Front-end. Web widget, Slack/Teams app, WhatsApp adapter — each is its own engineering scope.
Hardware Requirements (2026)
- • 8B model: single A100 40GB, ~$2.40/hr cloud, ~$1,700/mo dedicated
- • 70B model (Q4 quantized): single A100 80GB or 2× A100 40GB, ~$3,200/mo
- • 70B full precision: 4× A100 80GB or 2× H100 80GB, $6,500–$12,000/mo
- • 405B model: 8× H100 cluster, $25K+/mo (rarely worth it for chatbots)
Hidden Costs to Plan For
- • Engineering time for prompt tuning, regression testing, and model upgrades
- • On-call rotation when GPUs fail or models OOM
- • Compliance audits ($20K–$80K/yr for SOC 2)
- • Channel maintenance — Slack/WhatsApp APIs change frequently
- • Prompt injection / abuse defense built from scratch
Frequently Asked Questions
How much does it cost?
$1,200–$3,500/mo GPU + $400–$1,200/mo infra at 10K conversations. Engineering time is the bigger hidden cost.
Best open-weight models in 2026?
Llama 4, Mistral Large, Qwen 3, DeepSeek V3. The 70B class is the sweet spot for production chatbots.
Or skip the stack entirely
EzyConn delivers production-grade RAG, multi-channel deployment, and compliance for less than the GPU bill.
Start FreeLast updated . View more guides.