Task Specific Computer Vision Versus Large Multi-Modal Models for Diagnostic Test Interpretation – A Benchmarking Study [version 1]

Read the full pre-print publication here

Abstract

Recent advances in large multimodal models (LMMs) promise flexible image and text understanding, opening new possibilities for AI-driven healthcare delivery. This study presents a structured benchmark comparing LMMs to a production-grade, task-specific computer vision (CV) pipeline for the interpretation of rapid diagnostic tests (RDTs) across real-world tasks, including result reading, tampering detection, image quality assessment, and handwriting recognition. Our findings show that while LMMs demonstrate generalist capabilities, they fall short of the precision, reproducibility, and deployability required for high-stakes health applications. In contrast, the CV system – optimized for offline use in low-resource settings – delivers deterministic, high-accuracy outputs with minimal infrastructure overhead. We highlight the importance of reproducibility, auditability, and field constraints in model design and deployment, and identify key failure modes in generative models, including hallucinations and inconsistencies. This work underscores the need for grounded benchmarks, deployment considerations, and safeguards in applying frontier AI to global health contexts.

Models evaluated

Six models were assessed using the benchmarking dataset. The first is HealthPulse AI (developed by Audere). The solution comprises a modular computer vision pipeline for interpreting RDT results.18 It includes shared components across two branches: (1) a fine-tuned path trained specifically on certain RDT types (in this case most WHO PQ’d malaria and HIV kits), and (2) a foundation path trained on a wider variety of standard RDTs (single continuous result window with 2 or 3 lines) for general-purpose interpretation (but not as optimized for any single one). The fine-tuned path is used when the RDT type is recognized as one of its trained targets; otherwise, the foundation pipeline handles the image. HealthPulse AI has been described in prior work1,2; here, we treat it as the “classical CV” benchmark. None of the images in the benchmarking dataset were used to train or tune the HealthPulse AI models, ensuring an unbiased test. The other five were broadly accessible multimodal generative models: OpenAI’s GPT-4o,19 OpenAI’s GPT-4o Mini,20 Google’s Gemini 1.5,21 Anthropic’s Claude 3.0 (Opus),22 Anthropic’s Claude 3.5 (Sonnet).23 Detailed summaries of each solution are available in the supplementary methods, Table 1. A second tranche of assessment was done with two visual language models Llama 4 Maverick 17B-128E-Instruct24 and Qwen 2.5 VL-72B.25

Results

Specificity in detecting the presence of an RDT (or multiple), is a key differentiator of AI models, with no clear best-in-class solution across all tasks

All models exhibited excellent sensitivity, ranging from 97.5% to 99.6% ( Table 4). Specificity, however, varied dramatically across models. Whilst GPT-4 4o had perfect (100%) specificity, suggesting it never hallucinated an RDT when none existed, this result is complicated by the fact that it returned a response in less than 50% of cases (whereas all other models returned a result for all 22,000 images). The best performing LMMs, across all images, were Claude 2.5 and Gemini 1.5, with 96% specificity. The HealthPulse AI’s ensemble (i.e., the CV pipeline) incorrectly signaled the presence of an RDT in 10% of images that actually had none (i.e., 90% specificity). The most extreme was Claude 3.0 (Opus), which had only 34.0% specificity. Moreover, all the LMMs struggled with identifying and interpreting images with multiple RDTs. Outside of the CV pipeline, GPT-4o Mini was the best performing overall model at only 52.7%. Qwen 2.5VL-72B and Llama 4 Maverick 17B-128E-Instruct particularly struggled on multiple RDT detection. Separate assessment of test type identification (e.g., Malaria, HIV, etc.) demonstrated that this task is feasible regardless of model type, with CV-model and LMMs capable of comparable performance. HealthPulse AI, GPT-4o were the best models for test type identification, 84.5% and 83.5% accuracy, respectively ( Table 4). Notably, despite no specific training for this task, GPT-4o justified its answers by correctly ‘reading’ text labels on the cassette (such as “HIV” or specific brand names) and using that to identify the test type. The worst performing model was Gemini with 62.0% accuracy with Claude Opus close behind.

Specialist CV models significantly outperform LMMs on RDT interpretation accuracy

The HealthPulse CV foundational model outperformed all tested LMMs, with an F1 ~91% versus at best ~72% (GPT-4 on partial data) or ~55% (Gemini, Claude 3.5). Claude 3.0 and GPT-4o Mini had the worst performance at 33–35%. The generative models showed distinct error patterns. For example, GPT-4o tended to be cautious: its false positive rate was 5.7%, higher than CV but still modest, whereas its false negative rate was relatively higher at 10.0%. GPT-4o Mini had a strikingly high false positive rate of 41.9%. Moreover, its false negative rate was only 3.6%, indicating it rarely missed actual positives, but that came at the cost of an untenably large number of false alarms. In a real scenario, GPT-4o Mini would be impractical as it would label nearly half of healthy patients as positive. Notably, Claude 3 (Opus) demonstrated similar behaviour. Conversely, Gemini 1.5 had the opposite issue: a relatively low false positive rate (17.7%) but a very high false negative rate (14.8%).

One key type of error is the risk of missed diagnosis due to a faint test line. Claude 3 Opus (89.2%), GPT 4o Mini, (89.1%), and the HealthPulse CV Foundational model (86.1%) all did well on faint line recall (i.e., the fraction of faint-positive cases correctly flagged as positive). These two LMM outlier results are likely a function of the aforementioned error pattern behaviour (i.e., their high sensitivity/false-positive tendency). The performance of the other LMMs was substantially worse (range: 13.4% - 64.2%).

On image quality assessment, HealthPulse’s CV foundation model caught 66.1% of the images that humans found uninterpretable, doing better on HIV tests (88.6%), than on malaria (38.9%). This was due to 8700 images being non-standard RDTs (primarily HIV) which the foundation model is not set up to interpret. The LMMs were very similar in performance, or in the case of Gemini and Qwen demonstrably better at 70%. However, high recall in quality-flagging comes with the risk of over-flagging. The HealthPulse CV foundation model labelled nearly a quarter of images as low quality or non-standard. For context, only ~5–6% were marked truly uninterpretable by humans. All of the LMMs, except for GPT4o Mini, were over-flagging (28.4%-38.4%) more often than the CV model. In other words, all of these models would reject or warn on roughly one in every 3–4 images, far above the true rate of problematic photos.

For the accuracy measurements in Table 5, the HealthPulse foundation model was used for primary comparison. However, results were also included for a fine-tuned model. If trained for a specific test like the fine-tuned HealthPulse model, much higher results were achieved with a weighted F1 of 96.4, false positivity rate of 1.7%, false negative rate of 1.4%, faint line recall of 90%, 53.8% alignment with expert-labelled issues and only 4.1% of images flagged as having IQ issues.

The non-determinism of LMMs is a significant issue for consistency of outputs

The HealthPulse models achieved 100% reproducibility by design; its pipeline (being an aggregation of CNN-based models chosen for accuracy, speed, and resource efficiency over transformer models like DETR [Carion et al., 2020] for applications in low resource environments) produced byte-for-byte identical JSON outputs. Most of the LMMs yielded somewhat inconsistent results despite having its temperature set to zero to minimize randomness, with roughly 17–24% of the time the answer being changed, for example, calling a line “faint” in one run but not in another, or listing an additional image quality issue intermittently. Gemini was the noticeable outlier, providing fully identical results only about one quarter of the time. Again, GPT-4o was a special case, as it returned non-empty, parseable outputs for only 101 images (of the 2000) on all three attempts. Counting the images where GPT-4o gave an empty response in any run as non-reproducible (which is a reasonable interpretation, since a model that sometimes fails on an image cannot be considered consistent), the effective reproducibility was 3.9%.

Cost and processing speed vary greatly across model types

The HealthPulse CV solution processed an image in about 6.16 seconds on average (the fastest of all models tested). Notably, Gemini 1.5 cost the same per image as the CV solution. However, Gemini 1.5 has since been discontinued. The newer Gemini 2.5 model has a cost of $0.0083. The other LMMs were more expensive, with the exception of GPT-4o Mini which was more than 75% cheaper than the CV benchmark. LMM token costs also generally increase at higher volumes. Batch inference wasn’t used to reflect how these services would be used in the field with individual images being sent for decision support, targeted training or surveillance use cases.

The generative models were only modestly slower in most cases ( Table 6), with the exception of Claude 3.0 (Opus), which took 15.3 s per image on average (explained by the API throttling described in the supplementary results), roughly double the other LMMs. In aggregate, processing 22k images took 1.5 days for Claude 3.0, versus ~12–18 hours for most other LMMs and under 8 hours for HealthPulse. For use cases like a large batch of images (say a national surveillance program uploading thousands of test photos), the slower models could introduce significant delays.

LMMs could provide specific capabilities to augment CV models

Detecting RDT reuse and tampering is an important use case in the field, especially where incentives – such as reimbursements – may be associated with test results. Although underreported in the literature, we have observed these practices empirically in real-world deployments, including cases of reused RDTs and manual alterations intended to manipulate outcomes. Given the operational significance of such behavior, automated detection is critical for safeguarding data integrity and guiding programmatic follow-up. In these scenarios, precision is more important than recall — the goal is not to catch every possible instance of misuse, but to ensure that flagged cases are highly likely to be real, in order to avoid generating noise and maintain trust in the system. The HealthPulse AI model was specifically tuned for the use case as reflected in the results. In contrast, the LMMs exhibited poor precision in identifying reuse – frequently incorrectly flagging reuse – which was also reflected in their high recall. All LMMs performed poorly on tampering detection (where drawn lines or whiteout may be used in the result window). On handwriting, GPT 4o showed promise in recognition capabilities.

Discussion

This benchmarking study found that a purpose-trained computer vision (CV) system, when considering clinical metrics, cost, and speed, is the more appropriate choice compared to several leading large multimodal models (LMMs) for certain types of medical image interpretation (e.g., interpreting RDTs). The fine-tuned CV pipeline achieved an overall F1 score of 96.4%, with similarly high accuracy across HIV (95.5%) and malaria (97.0%) RDTs. It also demonstrated perfect reproducibility due to its deterministic nature across runs. In contrast, whilst Llama Maverick was cheaper than the CV solution, its substandard performance (F1 score of 48, and a false positive rate of 32.8%) undermines this advantage. GPT-4o was undermined by its inability to generate any output consistently, Gemini was undermined by its lack of consistency across outputs, and the Claude models are the most expensive without a discernible benefit relative to the other available solutions. Despite these shortcomings and the current base requirement of connectivity, LMMs show promising capabilities that are noteworthy and could add value in potential hybrid pipelines, leveraging the best of both types of models. For example, LMMs natively support the production of richer outputs for human understanding (e.g., describing image or RDT quality issues rather than just flagging them), which the CV model does not.

Results in context of the literature

When viewed in the context of existing literature, these results align with earlier findings that general-purpose generative AI models tend to underperform in visually grounded clinical tasks when compared to traditional purpose-trained CV systems. Studies like Liu et al. (2024)27 showed GPT-4V’s accuracy dropped below 50% on radiology interpretation when images were included. Similarly, a commentary by Wang et al. (2024)28 found that GPT-4V underperformed on dermatology classification tasks relative to dermatologists and dedicated CV models and showed a tendency to hallucinate features.

By contrast, there is a growing body of evidence showing consistently strong performance of computer vision models trained on medical imaging datasets across multiple clinical specialties, including radiology, ophthalmology, dermatology, and pathology.29 Gulshan et al. (2016)30 reported >90% AUC for diabetic retinopathy detection using a CNN trained on fundus images. Chexnet for X-rays31 has similarly demonstrated repeated higher diagnostic performance on medical imaging tasks such as detecting pneumonia from chest X-rays. Models deployed in real-time, critical-care contexts continue to demonstrate reliability and effectiveness, underscoring their practical value.32 Brinker et al. (2017)33 demonstrated that a CNN trained on 12 378 clinical images matched or outperformed dermatologists in melanoma detection, surpassing 136 out of 157 dermatologists with a ROC-AUC of ≥ 0.91.

These LMM limitations are not just isolated to medical imaging. Recent work from MIT34 demonstrates that even the most advanced vision-language models (VLMs) like GPT-4o fail at basic reasoning tasks such as handling linguistic negation (e.g., distinguishing “a cat on the bed” from “no cat on the bed”) in image-text queries. This finding reinforces a broader concern: that current LMMs may not yet possess the fine-grained reasoning capabilities needed to distinguish clinically critical distinctions — such as a faint test line versus no test line, or invalid versus valid results — especially in noisy, real-world data.

Recent proposals advocate hybrid AI architectures, where task optimized CV models provide core outputs, while LMMs supplement auxiliary features such as quality assessment or free text reasoning. Lin et al. (2024)35 describes “modular interpretability pipelines” that combine deterministic image classifiers with generative layers for user facing feedback.

Strengths and weaknesses

This study includes the use of a large, realistic benchmarking dataset of 22,000 images drawn from field deployments across multiple geographies. The dataset incorporated real-world image quality variability (including adversarial conditions such as blur, glare, blood smears, faint lines) and positivity rates. Evaluations were conducted using consistent prompts and schemas across all LMMs, with outputs analyzed against an independent, expert labeled ground truth. Additionally, a diverse set of metrics was used to capture performance dimensions beyond raw accuracy - including cost, reproducibility, reliability, response time - factors that are important in real world deployments.

The limitations of this study include the fact that the LMMs were tested in zero-shot settings. It remains possible that performance could be improved through techniques such as fine-tuning, few-shot learning (providing input/output examples), or multi-shot prompting (breaking tasks into sequential steps). However, such approaches are rarely viable in field settings where models need to be deployed on low-cost smartphones, often in low or no-connectivity environments, and are expected to interpret images submitted one at a time by frontline users. In these contexts — common across many low- and middle-income countries (LMICs) — model size, inference latency, cost, and offline compatibility become critical and essential to enabling the use and adoption of AI. Fine-tuned LLMs or customized VLMs may offer performance gains in controlled environments, but are often too large to run locally on-device, and too resource-intensive or inconsistent to depend on in real-world deployments. As such, deployment-ready solutions must account for the technical and infrastructural constraints that define usage at the last mile.

Future research

Several promising future research directions arise from this work. First, exploring the effect of temperature and decoding parameters on output confidence and reproducibility may yield actionable insights for real-world deployments, where models must operate reliably under diverse and often suboptimal conditions. In many of the settings where such tools are deployed, particularly in LMICs – this means functioning on low-end older Android devices, often offline, and processing one image at a time as they are captured and uploaded in the field. Second, introducing few-shot examples in prompts (to the extent practical while balancing costs and token limits) could help determine if structured performance gains are possible without full model fine-tuning. Third, targeted analysis of hallucinated outputs - including hallucinations of RDTs being detected in images without RDTs - could inform the design of fail-safe mechanisms for quality control where LMMs are used. Fourth, error consistency analysis across models could help identify whether specific image characteristics contribute to systematic misinterpretations.

Further use cases could be explored to determine if LMMs may contribute meaningful value as an augment to CV models such as explainability, data collation, handwriting recognition for broader use cases (handwriting in the background) or multi-RDT interpretation (useful for fever panels and audits in the real world). Additionally, a fine-tuned LMM model could be explored in future research to see if it boosts performance (understanding that size and offline capabilities may still be a limitation even if accuracy levels are achieved) and how the cost compares to training a task-specific computer vision model. Different output formats should also be explored to see if it could potentially result in lower costs and fewer parsing errors.

Implications for product developers & end users

The implementation of this study resulted in several valuable experiential insights. Each LLM API demonstrates idiosyncratic behaviour, necessitating non-trivial engineering work to accommodate. These include image size limits (e.g., Claude’s 5 MB limit was not large enough for full sized images necessitating custom resizing, which functionally excludes any tasks that require the detail in full-resolution images), rate limits (as we experienced with Claude and Gemini which would limit use at scale), etc. Separately, no model type was robust to managing multiple RDTs in a single image. Thus, for this particular use case (i.e., RDT readers), we’d recommend future workflows explicitly advise users on framing (i.e., one test per image) for best results. Availability of models and costing is also constantly changing impacting reliability. Gemini 1.5 was discontinued before the supplementary feature run requiring an update to Gemini 2.5, which in turn required additional prompt adaptations and a separate integration.

Implications for policy makers

These results reinforce that determinism and robust service behavior are as important as raw accuracy for AI in medicine. The variability observed amongst the generative models is problematic from a deployment perspective. Many stakeholders from decision makers to healthcare providers to patients expect an automated interpretation tool to yield the same result every time for a given test image (much like a lab instrument would). If an AI model were to give a “positive” result in one run and “negative” in another on the same image, that undermines trust and utility. In other words, a model with superhuman knowledge is of limited use if it frequently fails to respond or gives inconsistent answers. Our results show the CV approach meets this bar by design, whereas the generative models do not – an issue that would need addressing (likely drawing on cutting edge methods that have been developed [e.g.,36]) – but even these do not guarantee success, before these tools are ready for field deployment.

The benefits of determinism extend beyond accuracy and trust: they reduce operational burden. In some care delivery settings, due to its predictable nature and high accuracy, implementers can opt to not store every image – thus reducing data hosting costs in LMIC environments. In contrast, for non-deterministic systems like LMMs, where the same input may yield different outputs across runs, every image must be stored to allow human verification in case of audits or incident investigation. This increases the logging, traceability and quality management requirements. The reproducibility of the CV model allows shallower audit trails and lighter infrastructure footprints.

The rapid evolution of LMM APIs and discontinuation of previous models (e.g. Gemini 1.5 was retired soon after the study) complicates validation, as each model update may require additional engineering, re-certification or auditing. Offline compatibility is another key consideration. The HealthPulse CV models have a small package size and run completely offline on low-end Android devices, enabling community based use without connectivity - a capability that no current LMM provides.17 And with online deployments, various LMMs have throttling limits after certain volumes which has implications for scale. Even if costs of LMMs come down, the aggregate cost of engineering effort to keep up with changes once deployed, retries, and inconsistent responses likely outweighs the cost advantage in real world deployments. Policymakers should treat reproducibility, offline capability where necessary, and version stability as minimum safety standards for AI diagnostics. Without these safeguards, national programs risk deploying tools that are unreliable, costly to maintain, and erode trust among providers and patients.

Conclusion

CV solutions are still the most viable solution for scalable, high-accuracy RDT interpretation. Given similar, related results from other studies, this is likely a reasonable reflection of the current state of CV vs. GenAI for medical image interpretation, more broadly. However, as the state-of-the-art in generative models rapidly improves, this issue is worth revisiting – the critical question is how addressable the problems arising from their non-deterministic behaviour are. In the interim, LMMs may serve as complementary tools, particularly for ancillary tasks (such as generating richer descriptions of quality challenges or handwriting recognition), but substantial engineering improvements are required before they can be relied upon for primary diagnostic interpretation in clinical or public health deployments. With further development, especially around hallucination prevention, reproducibility and reliability, LMMs may mature into more central roles in diagnostic pipelines. For now, however, the safest and most effective approach is to pair the strengths of both - leveraging the deterministic accuracy of CV for core-decision making, while exploring LMMs as intelligent auxiliary tools that enrich user experience and provide transparency through free text reasoning.

 
Next
Next

Online delivery of oral HIV pre- and post-exposure prophylaxis: findings from the ePrEP Kenya pilot