Privacy Filter OpenAI: minimizing PII before ChatGPT

On April 22, 2026, OpenAI released Privacy Filter under the Apache 2.0 license on GitHub and Hugging Face.

The model has 1.5 billion total parameters, 50 million active, and masks eight categories of personal data before they leave a local machine.

For French DPOs and CTOs integrating ChatGPT, Claude, or Gemini into an internal pipeline, the arrival of Privacy Filter raises a practical question: what does this change in the GDPR post-Schrems II decision-making process, how to integrate it into an existing RAG pipeline, how does it compare to Microsoft Presidio and OpenMed, and where are the limits that no one talks about?

This article answers these four questions without the hype.

In brief

Privacy Filter masks eight categories of PII locally: 1.5 billion total parameters, 50 million active, F1 96% on PII-Masking-300k, Apache 2.0 license
The tool helps comply with Article 5 GDPR: it minimizes data sent to the cloud without replacing DPA Article 46 or DPIA Article 35
No native support for French formats: NSS, SIRET, and IBAN FR are not recognized without targeted fine-tuning
Mosaic effect and side-channel persist: even with masked payload, logs, telemetry, and quasi-identifiers leak
Test on a business sample before any production rollout remains the only honest qualification method

The GDPR issue Privacy Filter claims to solve

Since the Schrems II ruling by the CJEU in 2020, any transfer of personal data to a US subcontractor remains exposed to FISA 702, which allows federal agencies to request access to data processed by a provider subject to US law.

The EU-US Data Privacy Framework signed in 2023 has tentatively restored a transfer framework, and jurisprudence remains vigilant about remaining technical gaps.

In January 2026, the CNIL fined Free 27 million euros for failing to minimize collected data.

The message is clear: a data controller sending names, addresses, account numbers, or identifiers in plain text to an OpenAI, Anthropic, or Google endpoint faces dual contractual and legal risks.

Privacy Filter addresses this specific point: masking sensitive data locally before they leave the user’s machine or browser.

The analogy is simple: Privacy Filter acts like automatic black marker redaction before sending a file to a foreign consulting firm.

This redaction falls under Article 5 GDPR on minimization, not a promise of total anonymization as per Article 4(5).

The semantic gap is real: pseudonymization remains reversible, anonymization does not, and the EDPB 01/2025 guidelines emphasize this distinction when a quasi-identifier remains in the filtered text.

What Privacy Filter does technically

Privacy Filter is a bidirectional token classifier, derived from the gpt-oss architecture but stripped of its autoregressive generation head.

The model has 1.5 billion total parameters, with only 50 million active per token, thanks to a mixture of experts (MoE) of 128 experts with top-4 routing.

The helpful image: a switchboard with 128 specialized operators, only four of whom are engaged per call, keeping compute costs low.

The architecture includes 8 transformer blocks, a grouped-query attention with 14 query heads and 2 KV heads, RoPE encoding, and a context window of 128,000 tokens.

The output resembles tagging: for each token, the model predicts a label among 33 BIOES classes covering eight categories of PII.

The eight categories are private_person, private_address, private_email, private_phone, private_url, private_date, account_number, and secret.

The model achieves an F1 of 96% on the PII-Masking-300k benchmark, and 97.43% on the corrected version of annotation artifacts, with a throughput of over 1,500 tokens per second on consumer GPUs.

The license is Apache 2.0, with no non-commercialization clause, and the public repositories are github.com/openai/privacy-filter and huggingface.co/openai/privacy-filter.

The model is compact enough to remain inspectable on a workstation, and it’s precisely this inspectability that distinguishes a minimization tool from a compliance black box.

Bright black brush redacting a paper file with columns of names and addresses, golden volumetric light

Integrating Privacy Filter into a RAG pipeline

Integration follows the logic of a classic secure RAG pipeline, with one detail: Privacy Filter is inserted upstream of embeddings and any outgoing call.

The operational image: passing each document through a metal detector before slipping it into the cloud diplomatic pouch.

Target architecture

The recommended flow consists of five stages: raw text, local filtering by Privacy Filter, embeddings, vector store, reranker, and only then the cloud LLM call.

This sequence ensures that the redacted payload is what indexes the corpus, not the original.

On the generation output, a second pass of Privacy Filter remains useful if the cloud model reformulates segments of a document with their source identifiers.

Python code to connect LangChain or LlamaIndex

The transformers API loads openai/privacy-filter in AutoModelForTokenClassification, exposes a token-classification pipeline, and returns detected spans with their offsets.

A redact_pii(text) function replaces each span with a typed placeholder like [PERSON_1], [EMAIL_2], [ACCOUNT_3], and keeps the mapping in memory for depseudonymization after response.

This function plugs into a LangChain LLMChain or a LlamaIndex QueryEngine without modifying the rest of the chain.

Added latency

On consumer GPU in FP16, the overhead measured by OpenAI is around 50 to 100 milliseconds per short request and stays under a second on multi-page documents.

On CPU FP32 without a dedicated accelerator, expect 0.5 to 2 seconds depending on payload length, which remains acceptable for back-office workloads but debatable for real-time chatbots.

An INT8 quantized variant via bitsandbytes or onnxruntime brings CPU latency under 500 ms at the cost of a slight recall drop.

Quick bench vs Microsoft Presidio and OpenMed

French blogs often compare Privacy Filter to Microsoft Presidio as if they do the same job; this is incorrect.

The useful analogy: an Apache 2.0 Swiss army knife versus a modular toolbox; one is ready out of the box, the other is configured for French screws.

What Presidio does

Presidio is an open-source framework maintained by Microsoft, centered around an analyzer and an anonymizer.

The analyzer combines a NER model (spaCy), regular expressions by recognizer, business rules, and checksums (credit cards, generic IBAN).

The anonymizer then applies substitution operators, hash, or Faker pseudonymization.

The key advantage is extensibility: adding a French NSS or SIRET recognizer is feasible, provided you write the regex and supply the validation list.

Compared figures

On the same generic masking task, Privacy Filter shows 96% F1 without fine-tuning, compared to 0.60 to 0.85 for Presidio depending on the corpus and NER configuration.

On English clinical data, recent studies measure Presidio at 0.51 precision and 0.74 recall in default configuration, and up to 0.85 F1 after adding business recognizers.

OpenMed, a community medical model focused on French, claims 97.97% F1 with 55 native entities including NSS, IBAN FR, and tax code.

When to prefer each

Privacy Filter excels on generalist English corpora, long documents, secrets, and API credentials.

Presidio remains the default choice when the team wants to write its own recognizers and maintain a familiar Python dependency.

OpenMed takes the lead when the corpus is French and entities like NSS, SIRET, VAT numbers, or administrative identifiers dominate.

The simple rule: Privacy Filter to minimize a global flow, Presidio to custom-fit a case, OpenMed for native French formats; all three can coexist in the same pipeline.

Honest limits: what an LLM-based PII detector misses

OpenAI explicitly acknowledges in the model card: Privacy Filter is a minimization tool, not a guarantee of anonymization under the GDPR.

Contextual multi-page false negatives

The model sees a window of 128,000 tokens, and the masking decision remains local to a span: an indirect identifier spread over several pages can survive, especially when implicit reference replaces the proper name.

The arxiv 2510.07551 paper documents this type of false negative on hearing transcripts where the subject is named on page 1 then replaced by a contextual pronoun for 50 pages.

Proprietary identifiers outside training

Without fine-tuning, Privacy Filter does not recognize French NSS, SIRET, APE codes, IBAN FR, or custom internal identifiers (HR numbers, sector-specific file numbers).

These formats escape the eight basic categories and require fine-tuning on 200 to 500 labeled examples to reach an acceptable F1.

Mosaic effect and quasi-identifiers

Masking Jean Dupont’s name but leaving his birth date, postal code, and employer allows reidentification in five seconds via LinkedIn and INSEE.

This mosaic effect is the main blind spot of PII detectors: the EDPB, in its 01/2025 guidelines on pseudonymization, reminds that Article 4(5) GDPR continues to apply as long as a quasi-identifier remains.

Side-channel risks

The masked payload may be clean, but application logs, unpurged memory, and product telemetry can leak PII in parallel.

The Cribl 2026 analysis on telemetry pipelines shows that 40% of documented GDPR leaks in B2B SaaS occur through logs and APM traces, not the main payload.

Striking image: redacting the letter but leaving the draft in the hallway printer.

Allegorical transatlantic arch with filtering membrane transforming data flows into golden particles

TCO cost and operational trade-offs

The cloud API is simpler to account for than local infrastructure, and the price gap changes the equation at a certain volume.

On the API side, Gemini 1.5 Flash charges about $3.13 per million input tokens and GPT-4 Turbo around $20 per million.

Locally, Privacy Filter runs on an RTX 4060 at 1,500 tokens per second, or about 5.4 million tokens per hour on a machine amortized in less than a year.

The TCO tipping point is between month 6 and month 12 depending on the volume processed, cloud inflation, and local infrastructure operation costs (energy, supervision, updates).

An honest TCO comparison must also account for the DPO’s time spent documenting and auditing the internal pipeline, which never appears on the API invoice.

The final trade-off depends on three variables: monthly volume of tokens to filter, available hardware profile, and tolerance for side-channel risk on the cloud side.

A SaaS founder pushing 50 million tokens per month to OpenAI gains mechanically by internalizing the PII filter by the second year.

A SME pushing 5 million remains comfortable with cloud-side filtering supplemented by an OpenAI DPA audit.

TCO calculation is not just about the price per million tokens: it also weighs the cost of non-compliance, and the CNIL 2026 jurisprudence reminds that this cost is calculated as a percentage of turnover.

What it changes in DPIA or EIPD

For a DPO, the arrival of Privacy Filter opens a new line in the EIPD: documenting a technical minimization measure compliant with Article 5 GDPR.

This line does not remove the transfer impact assessment (TIA) required by EDPB guidelines post-Schrems II, nor the contractual DPA signed with OpenAI.

It reduces the residual risk surface and strengthens the defensibility of the file in case of a CNIL inspection.

Concretely, the DPO documents three elements: the model version used, the inference hardware profile, and the recall rate measured on a representative business sample.

The recall measured on an internal corpus is more telling than the F1 published on PII-Masking-300k, as French formats and custom identifiers are represented.

A quarterly internal audit on 200 to 500 hand-annotated documents is enough to maintain the traceability required by Article 35.

For agencies and SaaS reselling an AI service to French administrations, this file becomes a direct commercial asset, on par with SecNumCloud certification or HDS controls.

A useful parallel on risk management also concerns privacy and AI in Gmail, where the same framework applies to the internal use of a cloud assistant.

Conclusion

Privacy Filter joins the DPO’s GDPR toolkit, without becoming an automatic compliance seal.

It helps minimize data leaving the IT system, without removing the OpenAI DPA, the Schrems II TIA, or the controller’s responsibility.

The bench against Presidio and OpenMed clarifies its playing field: generalist, multilingual, designed for long flows and code secrets.

For a SaaS founder preparing an enterprise AI deployment, the decision becomes clear: test Privacy Filter on a sample of fifty sensitive business documents, measure the real recall, document in the EIPD, then move to production.

FAQ

Does Privacy Filter make ChatGPT 100% GDPR compliant?

No, it minimizes data transferred to the cloud and strengthens the defense of a GDPR file, without removing DPA Article 46, DPIA Article 35, or Schrems II TIA.

Where to place Privacy Filter in a RAG pipeline?

Before embeddings and any cloud LLM call, with an optional second pass on the generation output if the model reformulates segments with their source identifiers.

Should you prefer Privacy Filter or Microsoft Presidio in 2026?

Privacy Filter is suitable for global multilingual flows and long documents, Presidio is preferable when the team wants to write its own business recognizers; both tools can coexist in the same pipeline.

Does Privacy Filter detect French PII like NSS, SIRET, or IBAN FR?

Not without targeted fine-tuning: these formats are not in the eight basic categories and require 200 to 500 labeled examples to reach an acceptable F1.

What latency does Privacy Filter add to a cloud LLM call?

About 50 to 100 milliseconds on consumer GPU FP16, 0.5 to 2 seconds on CPU FP32, and under 500 milliseconds in INT8 quantized.

What happens if Privacy Filter misses a PII?

The data controller remains legally responsible, as automatic redaction is a minimization measure and not a transfer of responsibility to OpenAI.

Can Privacy Filter run in a browser or on mobile?

Yes, the model works via WebGPU in a Chromium tab thanks to Transformers.js, and an ONNX export makes mobile inference possible if memory mapping is aligned.

How to document Privacy Filter in an EIPD?

The DPO records the model version, the inference hardware profile, the recall rate measured on a representative business corpus, and plans a quarterly audit on 200 to 500 annotated documents.

Does Privacy Filter replace anonymization under GDPR?

No, it pseudonymizes under Article 4(5); the risk of reidentification by mosaic effect persists as long as quasi-identifiers like birth date or postal code remain in the text.

What is the real cost of a local Privacy Filter deployment?

An RTX 4060 amortized over eighteen months and a DPO supervision position suffice for a volume up to five million tokens per hour, with a TCO tipping point against the cloud API between month 6 and month 12.

The GDPR issue Privacy Filter claims to solve

What Privacy Filter does technically