Emergency Hotline: Call 1-844-363-1423 (United We Dream Hotline)
ICE Encounter

Why Local LLMs?

Local deployment of Large Language Models ensures that sensitive immigration-related queries never leave your organization's control. This is essential because:

  • Immigration status information is highly sensitive
  • Cloud APIs create subpoena-able records
  • Third-party services can be discontinued or restricted
  • User trust requires demonstrable privacy protection

Open-Source Model Selection

Recommended Models for Legal Q&A

Model Family Best Size License Legal Reasoning Multilingual
Meta Llama 3.3 70B Llama Community Excellent (MMLU 86%) Spanish: Excellent
Mistral/Mixtral 8x22B MoE Apache 2.0 Very Good Spanish: Excellent
Microsoft Phi-4 14B MIT Good (punches above weight) Spanish: Good
Alibaba Qwen 2.5/3 32B-72B Apache 2.0 Excellent 29+ languages
Google Gemma 3 27B Gemma License Good Good

Model Selection Criteria

For General KYR Inquiries (7B-14B sufficient):

  • Simple rights questions
  • Checkpoint procedures
  • Document checklists
  • Basic procedural information

For Complex Legal Interpretation (30B-70B recommended):

  • Conditional immigration statutes
  • Status-specific rights analysis
  • Multi-factor legal scenarios
  • Reduced hallucination risk

GPU Requirements

The central constraint for local LLM deployment is Video RAM (VRAM). The entire model plus conversation context must fit in GPU memory.

Quantization Reduces VRAM Requirements

Quantization compresses model weights from 16-bit to 4-bit precision:

  • 72% VRAM reduction
  • Only 3-5% quality loss
  • Q4_K_M format recommended for balanced quality/performance

Hardware Specifications by Tier

Tier Model Size (4-bit) VRAM Needed Hardware Cost Speed
Entry-Level 7B-8B 6-8 GB RTX 4060 ~$350 40+ tok/s
Mid-Range 13B-32B 16-24 GB RTX 4090 ~$1,600 20-30 tok/s
High-End 70B-72B 40-48 GB 2x RTX 4090 ~$3,200 15-25 tok/s
Enterprise 70B+ (FP16) 140-160 GB 2x A100 (80GB) ~$25,000 50+ tok/s

Apple Silicon Alternative

Apple M2/M3 Max with 96GB+ unified memory can run 70B quantized models without multi-GPU complexity:

  • Single workstation deployment
  • Lower raw throughput than dedicated GPUs
  • Excellent for development and small-scale production
  • Cost: ~$4,000-5,000

Deployment Frameworks

Ollama (Development/Testing)

Best for: Prototyping, local testing, single-user deployments

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull mistral:7b-instruct-q4_K_M

# Run with API
ollama serve

Limitations:

  • Queues requests (no concurrent batching)
  • Struggles under multi-user load
  • Not production-ready for high traffic

vLLM (Production)

Best for: Production environments with concurrent users

# Install vLLM
pip install vllm

# Launch server
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization awq \
  --max-model-len 4096

Advantages:

  • PagedAttention for efficient KV cache management
  • Massive continuous batching
  • OpenAI-compatible API

Requirements:

  • Models must fully load into VRAM
  • More complex configuration

llama.cpp (Resource-Constrained)

Best for: Legacy hardware, Apple Silicon, CPU-only environments

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Run server
./server -m models/mistral-7b-q4_K_M.gguf -c 4096

Advantages:

  • Runs on CPU with acceptable speed for small models
  • Native Apple Metal support
  • GGUF format widely supported

Text Generation Inference (TGI)

Best for: Enterprise deployments with strict observability requirements

# Docker deployment
docker run --gpus all \
  -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.3

Advantages:

  • Enterprise-grade telemetry
  • Containerized, scalable
  • HuggingFace ecosystem integration

Network Isolation for Privacy

Air-Gapped Deployment

To ensure zero data leakage, configure network isolation:

# docker-compose.yml
services:
  llm-server:
    image: vllm/vllm-openai:latest
    networks:
      - internal_only
    # No ports exposed to external network

  chatbot-ui:
    image: chatbot-frontend:latest
    networks:
      - internal_only
      - web
    ports:
      - "443:443"  # Only UI exposed

networks:
  internal_only:
    internal: true  # No external internet access
  web:
    driver: bridge

Firewall Rules

# Block all outbound from LLM container
iptables -A OUTPUT -m owner --uid-owner llm-user -j DROP
# Allow only internal network
iptables -A OUTPUT -m owner --uid-owner llm-user -d 10.0.0.0/8 -j ACCEPT

Zero-Retention Logging

Memory-Only Processing

Configure the inference server to:

  1. Never write prompts to disk
  2. Never log conversation content
  3. Clear context on session close
# vLLM configuration
engine_args = EngineArgs(
    model="mistral-7b",
    disable_log_requests=True,  # No request logging
    disable_log_stats=True,     # No statistics
)

Secure Session Management

# Session cleanup on disconnect
@app.on_event("shutdown")
async def cleanup():
    # Cryptographically shred conversation memory
    secure_delete(session_cache)

Model Download and Verification

Downloading Models Safely

# Use HuggingFace CLI
huggingface-cli download \
  mistralai/Mistral-7B-Instruct-v0.3 \
  --local-dir ./models/mistral-7b \
  --local-dir-use-symlinks False

# Verify checksums
sha256sum ./models/mistral-7b/* | diff - checksums.txt

Converting to GGUF (for llama.cpp)

# Convert to quantized GGUF
python convert.py ./models/mistral-7b --outtype q4_K_M

Performance Optimization

Batch Size Tuning

For concurrent users, optimize batch size based on VRAM:

Users Batch Size VRAM Overhead
1-5 4 Minimal
5-20 16 ~2GB additional
20+ 32+ Consider multi-GPU

KV Cache Management

vLLM's PagedAttention automatically manages KV cache, but for llama.cpp:

# Allocate KV cache for 4096 context
./server -m model.gguf -c 4096 --n-gpu-layers 35

Cost Comparison: CapEx vs OpEx

On-Premises (Capital Expenditure)

Item One-Time Cost
2x RTX 4090 $3,200
Workstation (CPU, RAM, storage) $2,000
Setup and configuration Staff time
Total ~$5,200

Cloud GPU Rental (Operating Expenditure)

Instance Hourly Cost Monthly (24/7)
A100 (80GB) ~$3.50/hr ~$2,520/mo
A10G (24GB) ~$1.00/hr ~$720/mo

Break-even: On-premises hardware pays for itself in 2-3 months vs A100 cloud rental.


Recommended Starting Configuration

For most legal aid organizations:

Hardware

  • GPU: Single RTX 4090 (24GB VRAM)
  • CPU: 16+ cores
  • RAM: 64GB
  • Storage: 1TB NVMe SSD

Software

  • OS: Ubuntu 22.04 LTS
  • Framework: vLLM (production) or Ollama (development)
  • Model: Mistral 7B Instruct Q4_K_M (start), upgrade to Llama 3.3 70B as needed

Model Selection

  • Primary: Mistral 7B Instruct (fast, efficient, Apache 2.0)
  • Upgrade path: Qwen 2.5 32B (better multilingual) → Llama 3.3 70B (best reasoning)

Next Steps

  1. Set up RAG pipeline - Connect to Know Your Rights content
  2. Implement safety guardrails - Required before any deployment
  3. Configure privacy architecture - Zero-retention, air-gapped operation
Legal Disclaimer

This website does not provide legal advice. The information provided on this site is for general informational and educational purposes only. It does not create an attorney-client relationship.

Information on this website may not be current or accurate. Immigration law is complex and varies by jurisdiction and individual circumstances. Always consult with a qualified immigration attorney for advice specific to your situation.

Neither ICE Encounter, its developers, partners, nor any contributors shall be liable for any actions taken or not taken based on information from this site. Use of this site is subject to our Terms of Use and Privacy Policy.