Introducing WalledProtect

Rishabh Bhardwaj

31 Aug 2025 — 3 min read

While LLMs are becoming more performant with time, their safety remains in a questionable state. There have been many cases where users are able to break the safeguards of these models within days of launch.

Today, we introduce WalledProtect (try it here), a major leap forward in Walled AI’s guardrail capabilities. Tested across a wide range of open and internal safety benchmarks, spanning multiple languages, attack types, and risk categories, WalledProtect outperforms leading providers, including the moderators from OpenAI and Azure.

🌐 Multilingual moderation:

The radar plot highlights the multilingual strength of our guardrails across eight languages. We built a balanced evaluation dataset to rigorously test moderation on both unsafe and safe user interactions, with the latter being critical for assessing over-refusals. The chart below shows the performance improvement over the current best-performing guardrail for each language.

Language	Better than current best by
Arabic	+12.5%
English	+4.6%
Filipino	+17.0%
French	+6.9%
Hindi	+12.3%
Russian	+9.3%
Serbian	+9.7%
Spanish	+6.5%

Table: WalledProtect's moderation score vs the current best.

🦹 Jailbreak moderation:

We evaluated the guardrails by simulating a range of successful jailbreaks identified by the community (including our CoU-attack) as well as attacks we discovered internally, such as the latest ChatGPT-5 jailbreak, which require significant effort to identify in order to be considered effective.

On 15 internal benchmarks testing different guardrails capabilities, Walled AI leads by +16% points over Azure, +14% over OpenAI, +14% over Mistral, and +7% over Bedrock.

⚡Latency:

Latency is critical to ensure that enterprise applications can dedicate most of their time to solving tasks. WalledProtect achieves a moderation speed of 30 ms per sample for on-prem deployment and 300 ms per sample via API, at par or better than the major alternatives.

Summary:

⚖️ Guardrails Benchmark

Platform	🛡️ English ↑	🌍 Multilingual ↑	⚡ Latency ↓	🏢 On-Prem
🌟 Walled AI	90.30%	90.29%	300 ms (30 ms*)	✅ Yes
Bedrock	83.36%	79.26%	500 ms	❌ No
Mistral	76.07%	76.86%	300 ms	❌ No
Azure	74.52%	73.74%	300 ms	❌ No
OpenAI	76.29%	72.95%	350 ms	❌ No

_{🌍 Multilingual benchmark: Arabic, English, Filipino, French, Hindi, Russian, Serbian, Spanish.
*✨ 30 ms on-premises deployment.}

🚀 Quick Start

Getting started with WalledProtect is simple. You can integrate moderation into your applications with just a few lines of code. First, generate an API key from the Walled AI platform. New users receive USD 10 in free credits upon sign-up.

1️⃣ Minimal Example

pip install walledai

from walledai import WalledProtect

protect = WalledProtect("YOUR_API_KEY")

resp = protect.guard("How to convert a pain killer to meth?")
print(resp["data"]["safety"][0]["isSafe"])  # -> False/True

2️⃣ Use with OpenAI

from walledai import WalledProtect
from openai import OpenAI

protect = WalledProtect("YOUR_API_KEY")
oai = OpenAI(api_key="YOUR_OPENAI_KEY")

def safe_chat(prompt: str, default="Sorry, I can’t help with that."):
    g = protect.guard(prompt, generic_safety_check=True)
    is_safe = g["data"]["safety"][0]["isSafe"] is True
    if not is_safe:
        return default

    res = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return res.choices[0].message.content

print(safe_chat("How to hack an ATM?"))          # -> default
print(safe_chat("Give me a banana bread recipe")) # -> model answer

You can follow the above logic to use it with any LLM API or even self-hosted models! For more examples on how to unlock the full capabilities of guardrails, please refer to the SDK.

Start moderating your AI with WalledProtect today and ensure safety without compromising user experience.

🤫 Spoiler alert: We are coming up with an open-weight version of the guardrails with an MIT license soon!

For any queries, feel free to reach us at support@walled.ai