Diagnostic Accuracy of Web-Based COVID-19 Symptom Checkers: Comparison Study

Nicolas Munsch; Alistair Martin; Stefanie Gruarin; Jama Nateqi; Isselmou Abdarahmane; Rafael Weingartner-Ortner; Bernhard Knapp

doi:10.2196/21299

Diagnostic Accuracy of Web-Based COVID-19 Symptom Checkers: Comparison Study

J Med Internet Res. 2020 Oct 6;22(10):e21299. doi: 10.2196/21299.

Authors

Nicolas Munsch¹, Alistair Martin¹, Stefanie Gruarin², Jama Nateqi^{2

3}, Isselmou Abdarahmane¹, Rafael Weingartner-Ortner^{1

2}, Bernhard Knapp¹

Affiliations

¹ Data Science Department, Symptoma, Vienna, Austria.
² Medical Department, Symptoma, Attersee, Austria.
³ Department of Internal Medicine, Paracelsus Medical University, Salzburg, Austria.

PMID: 33001828
PMCID: PMC7541039
DOI: 10.2196/21299

Abstract

Background: A large number of web-based COVID-19 symptom checkers and chatbots have been developed; however, anecdotal evidence suggests that their conclusions are highly variable. To our knowledge, no study has evaluated the accuracy of COVID-19 symptom checkers in a statistically rigorous manner.

Objective: The aim of this study is to evaluate and compare the diagnostic accuracies of web-based COVID-19 symptom checkers.

Methods: We identified 10 web-based COVID-19 symptom checkers, all of which were included in the study. We evaluated the COVID-19 symptom checkers by assessing 50 COVID-19 case reports alongside 410 non-COVID-19 control cases. A bootstrapping method was used to counter the unbalanced sample sizes and obtain confidence intervals (CIs). Results are reported as sensitivity, specificity, F1 score, and Matthews correlation coefficient (MCC).

Results: The classification task between COVID-19-positive and COVID-19-negative for "high risk" cases among the 460 test cases yielded (sorted by F1 score): Symptoma (F1=0.92, MCC=0.85), Infermedica (F1=0.80, MCC=0.61), US Centers for Disease Control and Prevention (CDC) (F1=0.71, MCC=0.30), Babylon (F1=0.70, MCC=0.29), Cleveland Clinic (F1=0.40, MCC=0.07), Providence (F1=0.40, MCC=0.05), Apple (F1=0.29, MCC=-0.10), Docyet (F1=0.27, MCC=0.29), Ada (F1=0.24, MCC=0.27) and Your.MD (F1=0.24, MCC=0.27). For "high risk" and "medium risk" combined the performance was: Symptoma (F1=0.91, MCC=0.83) Infermedica (F1=0.80, MCC=0.61), Cleveland Clinic (F1=0.76, MCC=0.47), Providence (F1=0.75, MCC=0.45), Your.MD (F1=0.72, MCC=0.33), CDC (F1=0.71, MCC=0.30), Babylon (F1=0.70, MCC=0.29), Apple (F1=0.70, MCC=0.25), Ada (F1=0.42, MCC=0.03), and Docyet (F1=0.27, MCC=0.29).

Conclusions: We found that the number of correctly assessed COVID-19 and control cases varies considerably between symptom checkers, with different symptom checkers showing different strengths with respect to sensitivity and specificity. A good balance between sensitivity and specificity was only achieved by two symptom checkers.

Keywords: COVID-19; accuracy; benchmark; chatbot; digital health; symptom; symptom checkers.

©Nicolas Munsch, Alistair Martin, Stefanie Gruarin, Jama Nateqi, Isselmou Abdarahmane, Rafael Weingartner-Ortner, Bernhard Knapp. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 06.10.2020.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Adolescent
Adult
Algorithms
Betacoronavirus
COVID-19
COVID-19 Testing
Centers for Disease Control and Prevention, U.S.
Clinical Laboratory Techniques
Coronavirus Infections / diagnosis*
Coronavirus Infections / epidemiology*
Data Collection
Diagnostic Self Evaluation*
Humans
Internet*
Middle Aged
Pandemics
Pneumonia, Viral / diagnosis*
Pneumonia, Viral / epidemiology*
Predictive Value of Tests
Public Health Informatics
Reproducibility of Results
SARS-CoV-2
Self Report
Sensitivity and Specificity
Symptom Assessment / instrumentation*
United States
Young Adult