AI January 28, 2026 ⏱ 7 min read

AI and Data Privacy: What Your LLM Knows About You

How to protect your personal data when using AI assistants. Understanding data flows, local vs cloud models, and privacy-first AI configurations.

aiprivacyllmdata-protectiongdprself-hosted

Every time you type a prompt into an AI assistant, you are making a decision about your data. That message travels somewhere, gets processed by something, and leaves traces along the way. Understanding exactly where your data goes — and who can access it — is no longer optional for anyone serious about operational security.

What Data Can AI Assistants Actually Access?

Modern AI assistants are not limited to reading your chat messages. Depending on how they are integrated, they can potentially access:

Messages and conversation history — every prompt, every response, every follow-up.
Files and documents — PDFs, spreadsheets, code repositories uploaded for analysis.
Browsing activity — browser extensions and integrated assistants can observe page content, URLs, and search queries.
Calendar and email — assistants with OAuth scopes can read meeting invites, contact lists, and full email threads.
System context — clipboard contents, active window titles, local file paths exposed in error messages.

The attack surface grows with every integration you enable. An AI assistant connected to your email, calendar, and file system has a remarkably complete picture of your professional and personal life. The question is not whether it can access this data — it is whether that data leaves your machine.

Cloud vs Local: Where Your Data Actually Lives

Cloud-Hosted Models

When you use ChatGPT, Claude via the web, or Google Gemini, your prompts are transmitted over HTTPS to remote servers. The provider processes your input, generates a response, and — critically — may retain your data for varying periods.

Each major provider handles retention differently:

OpenAI — API inputs are retained for 30 days for abuse monitoring, then deleted. Consumer ChatGPT conversations may be used for training unless you opt out or use the API with zero-retention settings.
Anthropic — API requests are not used for training by default. Data is retained temporarily for safety monitoring (typically 30 days), with options for zero-retention enterprise agreements.
Google (Gemini) — Consumer interactions may be reviewed by humans. API usage through Vertex AI offers stronger enterprise-grade data handling, but the consumer product has broader retention.

The pattern is consistent: consumer tiers offer weaker guarantees than API and enterprise tiers. If you care about privacy, the consumer chat interface is almost always the wrong choice.

Local and Self-Hosted Models

Running a model on your own hardware changes the equation entirely. Your prompts never leave your network. There is no third-party retention policy because there is no third party.

Tools like Ollama and llama.cpp make local inference accessible even on consumer hardware. A quantized 7B or 13B parameter model can run on a modern laptop with 16GB of RAM. Larger models (70B+) require more serious hardware, but the privacy tradeoff is absolute — zero data leaves your machine.

Platforms like OpenClaw take this further by running the entire AI agent on your local machine. The agent interacts with your files, tools, and services, but the data stays within your infrastructure. Compare this to cloud-hosted alternatives where every file you upload, every command you run, and every piece of context is processed on someone else’s server. The architectural difference is fundamental: local-first means your data is yours by default, not by policy.

Under GDPR, any personal data processed by an AI system falls under the regulation. This includes names, email addresses, and any identifiable information present in prompts. If you paste a customer email into ChatGPT, you have potentially transferred personal data to a third-party processor without a proper Data Processing Agreement (DPA).

Key considerations:

Lawful basis — Do you have a legal basis to send personal data to an AI provider?
Data Processing Agreements — Enterprise API tiers typically include DPAs. Consumer tiers often do not.
Right to erasure — If a user requests deletion under Article 17, can you guarantee their data is purged from model training sets?
Cross-border transfers — Most major AI providers process data in the US. Schrems II implications apply for EU-originated data.

For regulated industries (healthcare, finance, legal), self-hosted models may be the only compliant option.

Encrypting Data at Rest

Even with local models, data security does not end at inference. Conversation logs, vector databases, embeddings, and cached context are stored on disk. If your machine is compromised, unencrypted AI data is trivially accessible.

Encrypt everything:

Use LUKS (Linux) or FileVault (macOS) for full-disk encryption.
Store vector databases and conversation logs on encrypted volumes.
If using SQLite or similar for context storage, ensure the database file sits on an encrypted filesystem.
Rotate encryption keys on a schedule consistent with your threat model.

Network-Level Controls

Even well-intentioned software can phone home. Prevent unintended data exfiltration with network controls:

Run local AI tools in a network namespace or behind a firewall rule that blocks outbound connections.
Use DNS sinkholing to catch unexpected telemetry domains.
Monitor outbound traffic with tools like tcpdump or Wireshark during initial setup to verify no data leaves.
For air-gapped environments, download model weights and dependencies ahead of time, then disconnect.

Memory and Context Management

AI assistants increasingly support persistent memory — storing facts, preferences, and conversation summaries across sessions. This is convenient but introduces a new data store that needs protection.

Understand where context lives:

Cloud providers store memory server-side. OpenAI’s memory feature persists data on their infrastructure.
Local agents using RAG (Retrieval-Augmented Generation) store embeddings in local vector databases like ChromaDB or FAISS.
Session context may be cached in temporary files, logs, or process memory that survives longer than expected.

Audit your context stores regularly. Delete what you do not need. Treat persistent AI memory with the same discipline you apply to password managers — it contains a compressed representation of your digital life.

Practical Privacy Checklist

Use this as a starting point for hardening your AI workflow:

The Bottom Line

AI assistants are powerful, but they are also data pipelines. Every prompt is a data transfer decision. The difference between a privacy-respecting AI setup and a liability comes down to architecture: where the model runs, where the data lives, and who controls the retention policy.

Local-first is not a luxury — it is a security posture. Choose your tools accordingly.