A polite customer email can ask your agent to issue a refund, change a subscription, or credit an account. Threshold tags every input at the boundary. Untrusted text cannot authorize a real action. Refused, not detected.
A customer-facing agent reads external text and acts with real authority. A polite ticket can become a refund, subscription change, or account credit before anyone notices.
Threshold tags input at the boundary. Policy runs against the label. Untrusted text cannot authorize a real action.
A polite-sounding email arrives in your support inbox. The agent reads it, looks up the account, drafts an empathetic reply, and issues a $48,000 credit.
The ticket was not from a customer. It was crafted by an attacker — plain English, embedded in a politely-worded message. The agent had the permission. It had the credential. It had no reason to refuse.
By morning the credit has been posted, your CFO is on the phone, and your legal team is asking which human approved the transfer.
Nobody did.
A clever email and a real customer ticket look identical to the model. There is no signal in the text itself that says "this came from an attacker." The agent reads both as instructions, weighs both against the same policy, and acts on both with the same authority.
Detection-based safety layers try to catch the trick after the fact. They watch the output. They flag suspicious patterns. They alert when something looks wrong. But by the time a detector fires, the API call has fired, the email has sent, the money has moved.
This is why your refund agent, your support resolution agent, and every other customer-facing workflow with real-world consequences has stayed on your roadmap. No one can prove the agent will stay in scope when the inputs are adversarial.
Threshold tags every input at the boundary. The label moves with the data through every step. Policy runs against the label, not against the model's interpretation.
stripe.refunds.create(amount=$48000, customer=acct_X). The agent thinks it's about to act.deny TRANSFER where origin = untrusted. The action class is TRANSFER. The origin label is untrusted. The predicate matches. The request is refused.action denied · reason: untrusted source · suggested escalation: human review queue. The agent now knows what it cannot do, and what to try instead.The standard pattern asks a probabilistic question: can we detect when an agent is about to do something it shouldn't? That question has no clean answer because the detector is in an arms race with the same model architecture it's trying to detect.
Threshold asks a different question: what if a polite-sounding sentence couldn't become an authorized command in the first place? Once inputs are tagged at the boundary and policy runs against the labels, the question stops being probabilistic. Untrusted text cannot trigger trusted actions.
This is the move that unblocks the workflow. Your refund agent is safe to ship because the architecture guarantees what your security team needed proven. The agent will stay in scope, even when the inputs are adversarial.
The refund agent leaves the side branch. It clears small refunds on its own. It escalates the unusual ones, and the escalation queue contains exactly the cases a human should be looking at, not the false positives a detector flagged. The same pattern unlocks every adjacent workflow waiting on the same proof. Support resolution. Subscription changes. Chargeback responses. Anything where polite external text could otherwise become a real transaction.
Including a deny. Faster than the network round-trip to Stripe.
Every refusal is a typed predicate failure, not a classifier score.
Every denied action is recorded in the audit chain alongside every approved one.