Data Quality Assessment

Data Quality Assessment (DQA) checks are implemented at two levels to ensure data completeness, consistency, and accuracy for the HTS 001 Client Intake Form.

Two-layer architecture: Client-side validation catches errors at the point of collection — before the image is submitted. Server-side validation catches errors that are only visible at the population level, or that were skipped by the enumerator on the device.


Part 1 — Client-Side Data Validation

Validation rules run on the device (phone or tablet) in real time as the ScanForm app processes the photographed form page. These checks fire before the record is submitted — catching errors at the point of collection, not after upload.

Important constraints of client-side validation:

  • Validates OCR-detected bubble selections and filled digit boxes on the scanned image.
  • The app prompts the field worker to correct the paper form and retake the photo if a check fails; the worker may skip after the minimum number of retries.
  • Does not have access to the full dataset — cross-record and longitudinal checks must be done server-side.
  • b.set_min_tries_before_skipping(3) is configured for this form: the app will ask the worker to retry up to 3 times before allowing the check to be dismissed.

Because workers can eventually bypass checks, server-side validation is always required as a second layer.


Row Eligibility

A form is considered active (i.e. validation rules apply) when:

At least one OCR-active field on the form is non-empty AND the discardPage2 bubble (“Discard both pages”) is not marked.

This logic prevents the system from firing validation errors on intentionally blank or spoiled forms that have been marked for discard. If the discard bubble is marked, all checks are suppressed and both pages are excluded from the dataset.


Active Checks


Check Types Explained

Validation Check Types
Check Type Description Fields Using It
enough_filled At least one box or bubble in the field is filled. Catches blank mandatory fields. dateVisit, modality, providerId
exactly_one Exactly one bubble is selected (both enough_answers AND not_too_many checked together). Enforces hard single-select fields. dateVisit, referredFrom, testingSetting, previousTestedHIVNegative, all Section B items, hivTestResult
not_too_many No more than one bubble is selected. Allows the field to be blank but prevents accidental multi-selection. All optional select-one fields (Section C, D, Post-Test, demographics)

Check Count Summary

Client-Side Check Count
Category Checks
enough_filled (required fields) 3
exactly_one (paired enough + not_too_many) 22
not_too_many only (optional single-select) 47
Discard / eligibility control 1
**Total active checks** **73**

Fields Not Validated Client-Side

The following fields exist on the form but have no client-side validation rule. They are commented out in data_validation.py. Errors in these fields are caught server-side, or are considered acceptable to leave unvalidated due to their optional or administrative nature.


Part 2 — Server-Side Data Validation

Server-side DQA checks run automatically each time the data pipeline executes against the full submitted dataset. Unlike client-side validation, these checks can compare across records and enforce population-level rules.

Capabilities beyond client-side:

  • Applies to all submitted records, including those where client-side warnings were dismissed after 3 retries.
  • Can validate OCR-extracted text values (e.g. checking that a digit box parses as a valid integer or date).
  • Can enforce clinical plausibility rules requiring full cohort context.
  • Can detect duplicate submissions and cross-form linkage issues.
Warning

No SQL pipeline configuration was provided for this form. The server-side check catalogue below documents the checks that are expected based on the form’s OCR models, field types, and clinical logic — but the exact SQL implementation has not yet been supplied. This section will be updated automatically when a checks.sql file is available.


Record Drop Logic

Warning

Records are excluded from the clean dataset — not deleted. Excluded records remain visible in the DQA dashboard and can be corrected and re-entered.

A record is dropped from the clean layer when:

  • It has one or more checks with severity Error.
  • The discardPage bubble is marked — both pages are excluded.
  • Both pages are present but cannot be linked (mismatched Client Code or Data Matrix ID).

Missing or incomplete pages: If only one of the two pages is successfully scanned, the entire form record is flagged as incomplete. Fields from the missing page are treated as missing and the record is excluded from analyses requiring those fields.


Error Classification

Severity Levels
Severity Meaning Action
Error The record violates a hard rule. Records with ≥1 Error are excluded from the clean dataset and visible in the DQA dashboard for correction. Exclude from analysis · Flag for re-entry
Warning The record is unusual but not necessarily wrong. It is kept in the clean dataset but flagged for review. Keep in analysis · Flag for review

Checks by Category

  • Missing Mandatory Values — Flags records where required fields (Date of Visit, Referred From, Setting, Modality, Section B items, HIV Test Result, Provider ID) are empty after OCR processing.
  • Invalid Numbers — Checks that digit-box fields (age, scores, counts, modality code, Provider ID) can be parsed as integers and fall within expected ranges.
  • Invalid Dates — Checks that date fields (Date of Visit, completion Date) parse as valid calendar dates and are not implausibly far in the past or future.
  • Multiple Selections on Single-Select Fields — Flags records where more than one bubble is marked for any select-one question (catching forms where the client-side not_too_many check was skipped).
  • Score Consistency — Checks that manually entered summary scores (Knowledge Assessment, Personal Risk, TB Screening, STI Screening, Sex Partner Risk) are consistent with the sum of their component items.
  • Clinical Plausibility — Checks that clinically linked fields are internally consistent (e.g. breastfeeding fields not marked for male clients; recency test result present when HIV Test Result = Positive).
  • Duplicate Detection — Identifies records sharing the same Client Code submitted more than once within a short window.

Expected Check Catalogue

Note

This catalogue documents expected server-side checks derived from the form specification. Severity assignments follow standard pipeline conventions. This table will be replaced with the live implementation once a SQL checks model is available.


Summary

Expected Server-Side Checks by Category
Category Errors Warnings
Missing Mandatory Values 5 0
Invalid Numbers 1 7
Invalid Dates 1 2
Multiple Selections 11 0
Score Consistency 0 5
Clinical Plausibility 0 4
Duplicate Detection 0 1
**Total** **18** **19**

Generated automatically from HTS 001 Client Intake Form v1.6 source files. Last updated: 2026-06-30.