Mobile-device-optimized AI gestational age and fetal malpresentation estimation
We calculated the mean difference in absolute error between the GA model estimate and estimated gestational age as determined by standard fetal biometry measurements using imaging from traditional ultrasound devices operated by sonographers.20. The reference ground truth GA was established based on an initial patient visit as described above in Methods. When conducting pairwise statistical comparisons between blind sweep and standard fetal biometry absolute errors, we established an a priori criterion for non-inferiority which was confirmed if the blind-sweep mean absolute error (MAE) was less than 1.0 day greater than the standard fetal biometry’s. MAE. Statistical estimates and comparisons were computed after randomly selecting one study visit per patient for each analysis group, to avoid combining correlated measurements from the same patient.
We conducted a supplemental analysis of GA model prediction error with mixed effects regression on all test data, combining sonographer-acquired and novice-acquired test sets. Fixed effect terms accounted for the ground truth GA, the type of ultrasound machine used (standard vs. low cost), and the training level of the ultrasound operator (sonographer vs. novice). All patient studies were included in the analysis, and random effects terms accounted for intra-patient and intra-study effects.
GA analysis results are summarized in Table 1. The MAE for the GA model estimate with blind sweeps collected by sonographers using standard ultrasound devices was significantly lower than the MAE for the standard fetal biometry estimates (mean difference −1.4 ± 4.5 days, 95% CI. −1.8, −0.9 days). There was a trend toward increasing error for a blind sweep and standard fetal biometry procedures with the gestational week (Fig. 2a).
n= 407 study participants, blind sweeps performed by expert sonographers. a Blind-sweep procedure and standard fetal biometry procedure absolute error versus ground truth gestational age (4-week windows). Box indicates 25th, 50th, and 75th percentile absolute error, and whiskers indicate 5th and 95th percentile absolute error. b Error distributions for blind-sweep procedure and standard fetal biometry procedure. c Paired errors for a blind sweep and standard fetal biometry estimates in the same study visit. The errors of the two methods exhibit correlation, but the worst-case errors for the blind-sweep procedure have a lower magnitude than the standard fetal biometry method. d Video sequence feedback-score calibration on the test sets. The realized model estimation error on held-out video sequences decreases as the model’s feedback score increases. A thresholded feedback score may be used as a user feedback signal to redo low-quality blind sweeps. Box indicates 25th, 50th, and 75th percentile of absolute errors, and whiskers indicate the 5th and 95th percentile absolute error.
The accuracy of the fetal malpresentation model for predicting noncephalic fetal presentation from third-trimester blind sweeps was assessed using a reference standard determined by sonographers equipped with traditional ultrasound imagery (described above). We selected the latest study visit in the third trimester for each patient. Data from sweeps performed by the sonographers and novices were analyzed separately. We evaluated the fetal malpresentation model’s area under the receiver operating curve (AUC-ROC) on the test set in addition to noncephalic sensitivity and specificity.
The fetal malpresentation model attained an AUC-ROC of 0.977 (95% CI 0.949, 1.00), sensitivity of 0.938 (95% CI 0.848, 0.983), and specificity of 0.973 (95% CI 0.955, 0.985) (Table 2 and Fig. 3).

n= 623 study participants. Receiver operating characteristic (ROC) curves for fetal malpresentation estimation. Crosses indicate the predefined operating point selected from the tuning dataset. a ROC comparison based on the type of device: low-cost and standard. b ROC comparison based on the type of ultrasound operator: novices and sonographers.
Generalization of GA and malpresentation estimation to novices
Our models were trained on up to 15 blind sweeps per study performed by sonographers. No novice-acquired blind sweeps were used to train our models. We assessed GA model generalization to blind sweeps performed by novice operators that performed six sweeps. We compared the MAE between novice-performed blind-sweep AI estimates and the standard fetal biometry. For the malpresentation model, we reported the AUC-ROC for blind sweeps performed by novices, along with the sensitivity and specificity at the same operating point used for evaluating blind sweeps performed by sonographers.
In this novice-acquired dataset, the difference in MAE between blind-sweep AI estimates and the standard fetal biometry was −0.6 days (95% CI −1.7, 0.5), indicating that sweeps performed by novice operators provide a non-inferior GA estimate. compared to the standard fetal biometry. Table 1 provides novice blind-sweep performance analyzed by ultrasound device type. The mixed effects regression error analysis did not indicate a significant association between GA error magnitude and the type of operator conducting the blind sweep (P= 0.119).
Fetal malpresentation using novice-acquired blind sweeps was compared to the sonographer’s determination on 189 participants (21 malpresentations), and AUC-ROC was 0.992 (95% CI 0.983, 1.0). On the preselected operating point, sensitivity was 1.0 (95% CI 0.839, 1.0) and specificity was 0.952 (95% CI 0.908, 0.979).
Performance of low-cost ultrasound device in GA and fetal malpresentation estimation
GA model estimation using blind sweeps acquired with the low-cost ultrasound device were compared against the clinical standard on the combined novice-acquired and sonographer-acquired test sets. We used the same a priori criterion for non-inferiority as described above, 1.0 day. For the malpresentation model, we reported AUC-ROC by type of ultrasound device along with sensitivity and specificity at the same operating point discussed above.
GA model estimation using blind sweeps acquired with the low-cost ultrasound device were compared against the standard fetal biometry estimates on the combined novice-acquired and sonographer-acquired test sets. The blind-sweep AI system had MAE of 3.98 ± 3.54 days versus 4.17 ± 3.74 days for standard fetal biometry (mean difference −0.21 ± 4.21, 95% CI −0.87, 0.44) which meets the criterion for non-inferiority.
Paired GA estimates for blind sweeps acquired with both a standard ultrasound device and the low-cost device were available for some study participants in the combined test set (N= 155 participants). The MAE difference between blind sweeps performed with the low-cost and standard devices was 0.45 days (95% CI, 0.0, 0.9). The mixed effects regression showed that use of the low-cost device was associated with increased error magnitude (P= 0.001), although the estimated effect was only 0.67 days.
Fetal malpresentation estimation using blind sweeps acquired with the low-cost ultrasound device were compared against the sonographer’s determination on the combined novice-acquired and sonographer-acquired test sets (213 participants, 29 malpresentations). The blind-sweep AI system had AUC-ROC of 0.97 (95% CI 0.944, 0.997). At the preselected operating point, sensitivity was 0.931 (95% CI 0.772, 0.992) and specificity was 0.94 (95% CI 0.896, 0.970).
Simplified sweep evaluation
Protocols consisting of fewer sweeps than the standard 6 sweeps (Fig. 1b) may simplify clinical deployment. We selected M and R sweep types as the best performing set of two sweeps on the tuning set and evaluated this reduced protocol on the test sets.
On test set sweeps performed by sonographers, the reduced protocol of just the M and R sweep types (Fig. 1b) was sufficient for maintaining the non-inferiority of the blind-sweep protocol relative to the standard fetal biometry estimates (MAE difference 95%. C.I.: [−1.5, −0.69] days). The reduced protocol was sufficient for maintaining non-inferiority of blind sweeps relative to standard fetal biometry on test set examinations performed by novices (MAE difference 95% CI: [−1.19, 0.67] days). On average, the reduced protocol can be completed in 20.1 seconds, as extrapolated from videos collected from novices (see Supplementary Table 2). MAE across subgroups using the reduced protocol are provided in Table 1 (last row).
Feedback-score evaluation
Our GA model provided a feedback score to evaluate the suitability of a video sequence for GA estimation. The GA model computed the feedback score for 24-frame video sequences (about one second in length) and therefore provided a semi-continuous feedback signal across the duration of a typical 10-s long blind sweep. The feedback score took the form of an inverse variance estimate and can be used to weight and aggregate GA predictions across blind-sweep video sequences during a study visit. All GA results were computed using this inverse variance weighting aggregation method. More details are provided in “Methods”.
As expected, video sequences with high feedback score had low MAE when compared against ground truth GA, and low feedback-score video sequences had high MAE compared against ground truth GA. Figure 2d indicates the calibration of the feedback score on the held-out test datasets. Supplementary Fig. 2c shows example blind-sweep video frames from high and low feedback-score video sequences. The feedback score qualitatively aligns with the degree to which the fetus is visible in the video clip, with the high feedback score left and center-left examples showing the fetal abdomen and head (respectively). In contrast, the fetus is not visible in the low feedback-score examples (center-right and right).
Run-time evaluation on mobile phones
Our blind-sweep AI models were designed to enable near real-time inference on modern mobile phones or standard computers, to enable elimination of waiting time during the clinical examination procedure. We measured both GA and fetal malpresentation model run-time performance using 10-s long blind-sweep videos, which were chosen to match the average length of novice blind sweeps (Supplementary Table 2). Videos were streamed to an Android test application running on Google Pixel 3 and 4, Samsung Galaxy S10, and Xiaomi Mi 9 phones (examples of Android phones that can be purchased refurbished for less than $250 USD). Videos were processed by the GA and fetal malpresentation models simultaneously, with both models executed in the same TensorFlow Lite run-time environment. All necessary image preprocessing operations were also included in the benchmark.
Our results indicated that combined diagnoses for both models are available between 0.2 and 1.0 s on average after the completion of a blind sweep on devices with a graphics processing unit (GPU), and between 1.5 and 2.5 s on average after completion on devices with neural. network acceleration libraries for standard CPU processors. See Table 3 for complete benchmark results.