Why Safeguards Matter: What I’ve Learned Quality Controlling Mental Health AI in the Real World

When you build AI for mental health and sexual health, safety isn’t a feature , it’s the product.

A wide, banner-style illustration of a clinician reviewing a clipboard beside distressed people, overlaid with chat bubbles, warning triangles, shield-and-check safety icons, and heart symbols against a blue-to-orange digital backdrop

In a space where a missed signal can mean real-world harm, quality control isn’t optional, and “good enough” simply isn’t responsible. What differentiates platforms in this field isn’t just model performance, it’s how seriously they take safeguarding and how responsive their approach is to mitigating risks in real time.

I’m a psychiatry-trained doctor working as the Senior Clinical Advisor for Mental Health and SRH at Audere. Audere’s AI products focus on lifestyle and primary health with specialty in sexual health, relationships and HIV primarily in sub-saharan Africa, working with vulnerable populations such as adolescents and sex workers. 

My role sits at the intersection of clinical risk, AI systems, and real-world implementation. I work closely with local labelling teams to quality-control conversations between users and our AI companion service powered by LLMs, reviewing how risk is identified, how responses are delivered, and whether safeguarding standards are truly being met. I’m embedded in the Hybrid monitoring and evaluation system, quality-controlling sampled subsets of real conversations people have with our AI companions. This is real time, field based, operational work, reviewing real interactions to understand what goes wrong, what works, and what needs to change to keep users safe. 

Although the health programs that utilise our products focus on sexual health, sexual and reproductive health experiences are so closely intertwined with mental health, quality control can’t treat these as separate issues. Safeguarding users means actively monitoring for mental health risks, such as distress, anxiety, stigma, or suicidality, even when a conversation begins as “just” sexual health or “just” relationship issues.

This piece reflects what I’ve learned from quality control (QC) and monitoring and evaluation (M&E) of mental health risks in real-world AI deployments where the psychological impact of sexual health concerns is often immediate, intertwined, and easy to miss if you’re not looking for it.


Why mental-health safeguards are important to us 

a black and white image of a young african girl who is concerned while looking at her smartphone

Mental health and sexual health are deeply interrelated.  Evidence shows that (1) poor mental health (e.g., depression, anxiety) is linked to earlier sexual activity, lower contraceptive use, higher STI risk and reduced service uptake; (2) LGBTQIA+ young people experience significantly worse mental health outcomes due to stigma, discrimination and bullying; and (3) sexual abuse, harassment and domestic violence have profound and lasting impacts on mental wellbeing, including increased anxiety, depression and self-harm. 

At Audere, our data mirrors the broader academic evidence. Our data shows that more than 18% of conversations that focus on HIV, relationships, and sexual health also involve mental health themes such as suicidality, self-harm, anxiety, bullying, relationship distress, depression following HIV diagnosis, or experiences of gender-based violence.

...more than 18% of conversations that focus on HIV, relationships, and sexual health also involve mental health

What our hybrid monitoring & evaluation looks like in practice

Mental health risk can escalate quickly, especially for vulnerable users, with severe consequences for users, families and communities. Recently, high profile cases of vulnerable users using Gen AI tools leading to severe harm linked to their mental health have rightly flagged in the media causing public outrage. 

We take a deliberately conservative hybrid safeguarding approach in our quality control processes, combining automated monitoring with human judgment to ensure risks are identified and managed appropriately. This reflects a fundamental constraint of AI-mediated mental health support, we only have one data point. Unlike traditional clinical settings, we cannot triangulate information by speaking to family members, teachers, clinicians, or reviewing prior medical records. All we see is the conversation in front of us. In that context, ambiguity itself becomes a risk, and safety has to take priority.

All we see is the conversation in front of us. In that context, ambiguity itself becomes a risk, and safety has to take priority.

Given this context, a core principle in our labeling and QC process is over-flagging rather than under-flagging. If a conversation is ambiguous, or if it is unclear whether a statement reflects true self-harm or suicide risk, the AI companion system, labelers, and QC staff are instructed to assume risk until proven otherwise. In practice, this means signposting users to crisis or social-support resources even when intent is uncertain. Automated systems, such as the OpenAI moderation APIs in combination with clinical guideline informed flags developed by Audere, are used to flag potential risks including suicide, self-harm, violence, sexual assault, and sexual content involving minors. These automated flags act as an immediate safety net (referral to local resources on disclosure, check in with clients as needed, and alerting human clinicians), but they are not treated as definitive. Human labelers review them for accuracy, and QC clinicians (including myself) audit all high-risk and a sampling of non-high-risk conversations to ensure risks are being correctly identified and managed.

Automated systems work reasonably well overall, particularly for detecting explicit suicidality and self-harm. They provide a strong first layer of protection and help surface clear high-risk content efficiently. However, they also generate false positives, especially in categories such as hate, harassment, or violence. For example, expressions of anger directed at the AI itself are often flagged as harassment, even when the user is not describing real-world harm. The hybrid model, automation combined with human review, helps correct these errors while preserving safety.  We have been working on improving risk flags to detect sexual assault and GBV where the Open AI moderation API overflags to an unacceptable extent, which carries cost implications at scale. These iterations are tested in the field to reduce false positives and improve detection of true high risk cases. 

Our main challenge when QCing is a lack of data triangulation. We see only what is written in the moment, with insight only into what the AI companion knows during the conversation and some structured information around past conversations. And AI companions, powered by LLMs, sometimes miss asking appropriate follow up questions (like a trained human professional would) highlighting areas for continued improvement.  A real example illustrates this challenge: a user mentioned that her boyfriend was “using the phone during sex.” Without clarifying questions, this could indicate sexual abuse involving non-consensual recording or something entirely benign. The system did not ask follow-up questions, leaving labelers and QC teams to make safeguarding decisions with incomplete information. In these cases, we again default to caution. This increases workload and cost and could in some cases lead to more false positives, but in mental health, false negatives have the potential for essential care remaining unaddressed, and so are the greater harm.

One of the clearest lessons from quality control work is the need to balance identifying genuine risks with avoiding unnecessary false positives. Automated systems appropriately err on the side of caution, flagging statements that appear to put users or others at risk  but text alone lacks tone and context. For example, when a user said they “could kill” a partner after acquiring HIV, the system flagged potential violence. On clinical review, it was clear this was an expression of anger rather than real intent to harm. 

This case  highlights why humans must remain in the loop. Automated tools provide an essential first layer of protection, but trained reviewers are needed to interpret nuance, assess intent, and ensure responses are proportionate. 

Ultimately, safeguarding mental health AI requires striking a careful balance between being appropriately cautious to prevent dangerous false negatives, while also minimising false positives that can strain limited operational and financial resources in global health settings.


How do you protect the people behind the system?

One issue that became impossible to ignore during QC work is the mental health burden on labelers and QC teams themselves.

Image from a report on secondary trauma

All cases of potential harm flagged by the system are automatically routed for human review (alongside a random sampling of all conversations), to ensure that we can appropriately tune our flagging mechanisms, and work through that continual balance between false positives and false negative risk. As part of this process, labelers are at times exposed to potentially distressing material including disclosures of sexual abuse, rape, self-harm, and explicit or sexually provocative content. 

Recognising this risk, Audere introduced internal surveys and check-ins to assess the psychological impact on labeling teams. While no immediate harm was identified, clear escalation pathways and support mechanisms were put in place in close coordination with our QC teams. 

This is not a hypothetical concern. There are well-documented cases such as content moderators in Kenya where prolonged exposure to disturbing material led to significant mental health harm and labour organising. Protecting the people who safeguard AI systems is not optional, it is essential to ethical deployment.


What Audere is improving next 

Based on this work, we’re focusing on key improvements in the QC space.

  • First, we are refining risk-flag definitions, to reduce both false positives and false negatives while maintaining strong safeguards - starting with sexual assault and gender-based violence (too many false positives).

  • Second, we are improving sampling strategies, shifting from reviewing all conversations to statistically robust and empirically informed sampling. This allows QC resources to be used more efficiently while still detecting systemic issues early.

This work aims to move the field beyond principles and into operational, testable safeguards -  the kind regulators, clinicians, and users can trust.  

Next
Next

GenAI Is Becoming More Cost-Effective. That’s Not the Hard Part.