21 Data Management

Every clinical trial generates a vast apparatus of data: patient medical histories, laboratory values, imaging results, patient-reported outcomes, adverse event records, protocol deviations, and countless other observations. Converting this raw information into reliable, analyzable data is the province of clinical data management.

The stakes are high. Regulatory decisions affecting millions of patients rest on whether the data can be trusted. The entire edifice of evidence-based medicine assumes that the numbers in clinical databases reflect what actually happened to patients. Data management provides that assurance.

Data in a clinical trial flows through a carefully orchestrated pipeline, illustrated in Figure 21.1.

flowchart TB
    subgraph Row1 [ ]
        direction LR
        A["Clinical Observation"] --> B["Source Document"] --> C["CRF/EDC Entry"] --> D["Edit Checks"] --> E["Query Generation"]
    end
    subgraph Row2 [ ]
        direction LR
        F["Site Response"] --> G["Query Resolution"] --> H["Medical Review"] --> I["Database Lock"] --> J["Analysis Dataset"]
    end
    Row1 --> Row2

Figure 21.1: Clinical trial data flow from point of collection through database lock

It begins at the point of collection: when a nurse draws blood, when a patient completes a questionnaire, when a physician conducts an examination. At this stage, data exists as raw observations: a tube of blood, marks on paper, clinical impressions.

Observations are then captured (entered into the trial’s data collection system). Historically, this meant transcribing information onto paper case report forms (CRFs); today, electronic data capture (EDC) systems are standard, with site staff entering data directly into validated computer systems.

Once captured, data is cleaned: reviewed for errors, inconsistencies, missing values, and implausibilities. Queries are issued to sites asking them to verify or correct questionable data. This iterative cleaning process continues until the data is judged acceptably clean.

Finally, data is locked and extracted for statistical analysis. Database lock is a formal milestone: after lock, no further changes are made, and the data that will support the regulatory submission is fixed.

21.1 Electronic Data Capture

EDC systems have transformed data management. These validated software platforms provide electronic forms that mirror protocol requirements, capturing each data point as site staff enter it.

Edit checks programmed into EDC systems catch many errors at the point of entry. If a patient’s birth date suggests they are 200 years old, or if a systolic blood pressure is entered as 12 rather than 120, the system flags the discrepancy immediately. This front-end validation improves data quality substantially compared to paper-based processes where errors might not be discovered for months.

EDC also provides real-time visibility into trial data. Sponsors can monitor enrollment, track data completeness, and identify sites with quality issues much more rapidly than was possible with paper systems.

Audit trails automatically log every entry and modification along with timestamps and user identities. This creates an unalterable record of how the data evolved, required for demonstrating compliance with regulatory requirements and for investigating any questions that arise.

21.2 Data flow beyond EDC: external sources, vendor pipelines, and transfer specifications

The traditional operational framing of data flow (especially the contrast between bolus and continuous flow, and the emphasis on timeliness for monitoring and trial control) remains conceptually useful even as the technology stack has changed (Meinert 2013). In contemporary trials, however, “data flow” is no longer limited to site-entered CRFs/EDC. It typically includes multiple externally generated streams (central laboratories, imaging core labs, ePRO/eCOA platforms, wearable/sensor-derived endpoints, and real-world data sources), each with distinct provenance, latency, and error modes (U.S. Food and Drug Administration 2023c, 2023b, 2024b).

From a quality and regulatory standpoint, the central problem is not merely ingestion; it is traceable control across systems: documented interfaces, controlled transformations, and auditability sufficient to support reconstruction of what was collected, when, by whom/what system, and how it was transformed into analysis-ready datasets (U.S. Food and Drug Administration 2023a, 2024a).

Checklist: Data Transfer Specification (DTS)

For each external source (device vendor, central lab, imaging core, ePRO platform, EHR/claims provider), a defensible DTS should specify:

Purpose and scope: what data are transferred, at what cadence, and how the data support protocol endpoints and monitoring (U.S. Food and Drug Administration 2023c).
Provenance and identifiers: subject identifiers and pseudonymization rules; site identifiers; visit/timepoint mapping; timezone conventions (U.S. Food and Drug Administration 2024b).
Format and standards: file formats, data dictionaries, controlled terminology, and mapping plans into SDTM/ADaM where applicable (U.S. Food and Drug Administration 2023b).
Validation and reconciliation: checksums, record counts, missingness expectations, range/logic checks, and discrepancy handling (including “source of truth” rules) (U.S. Food and Drug Administration 2024a).
Security and access control: encryption in transit/at rest, least-privilege access, and role-based audit trails (U.S. Food and Drug Administration 2023a, 2024a).
Change control: versioning of schemas and code, deprecation policies, and documented impact assessment for changes during the trial (U.S. Food and Drug Administration 2024a).
Retention and inspection readiness: how raw transfers, transformation logs, and audit trails will be retained and retrieved under inspection timelines (U.S. Food and Drug Administration 2023a, 2024a).

Digital Health Technologies (DHTs): operational controls

When endpoints depend on DHT-derived measurements, the data management plan should address device selection and performance, participant usability and training, data completeness monitoring, and the integrity of transmission and storage processes. These controls are part of endpoint validity, not merely IT hygiene (U.S. Food and Drug Administration 2023c).

Real-world data (EHR and claims): fitness-for-purpose documentation

When EHR or claims-derived datasets are used to support regulatory decision-making, sponsors should document data provenance, capture processes, missingness and measurement error, and the relationship between clinical care documentation and research variables. The governing question is whether the dataset is fit for the stated regulatory purpose, with traceability and transparency sufficient for review (U.S. Food and Drug Administration 2024b, 2023b).

Common Data Elements (CDEs): standardization for interoperability and AI-readiness

Classic operational texts emphasized bespoke form design tailored to each trial’s specific requirements (Meinert 2013). While protocol-specific customization remains necessary, the modern trend is toward Common Data Elements: standardized variable definitions, value sets, and collection instruments that enable data pooling, cross-study comparison, and machine learning applications.

The NIH CDE Repository provides curated element definitions across therapeutic areas, allowing researchers to adopt pre-validated instruments rather than reinventing measurements. CDISC standards (CDASH for collection, SDTM for tabulation, ADaM for analysis) provide the structural framework that regulatory submissions now routinely require. The strategic benefit extends beyond compliance: trials that use CDEs can more readily contribute to meta-analyses, external control arms, and AI-driven signal detection, capabilities that custom-designed data structures often preclude.

Checklist: CDE Selection and Implementation

Identify applicable CDE repositories: NIH CDE Repository, CDISC Therapeutic Area User Guides, disease-specific consortia standards.
Map protocol endpoints to existing CDEs: prioritize validated instruments for primary and key secondary endpoints; document rationale for any custom elements.
Specify controlled terminology: use standard dictionaries (MedDRA for adverse events, WHODrug for medications, SNOMED CT or ICD for diagnoses) with explicit version control.
Plan CDISC mapping early: define CDASH annotation during CRF design rather than retrofitting SDTM mapping at database lock.
Document deviations: where protocol requirements preclude standard CDEs, document the deviation and maintain a crosswalk to related standards for future interoperability.

21.3 Data Quality Standards

What makes data “good”? Several dimensions of quality matter.

Accuracy means that recorded data reflects what actually happened. A blood pressure of 140/90 was actually 140/90, not misread or mistranscribed.

Completeness means that all required data has been collected. Missing values may be unavoidable (a patient may miss a visit), but missing data should be documented and explained.

Consistency means that data makes sense internally. If a subject is recorded as male on one form and female on another, something is wrong.

Timeliness means that data is entered promptly. Data entered long after collection is more likely to contain errors due to memory lapses or lost source documents.

Legibility and traceability mean that the data can be read and that its origin can be determined. For electronic data, this means clear audit trails; for paper documents, it means legible handwriting and proper corrections (single-line strikethrough with initials and date, not obliteration).

The principle of ALCOA captures these standards, summarized in Table 21.1.

Table 21.1: ALCOA Data Integrity Principles

Principle	Definition	Examples	Common Failures
Attributable	Who recorded it?	User ID, signature, initials	Shared login; Unsigned entries
Legible	Can it be read?	Clear handwriting; Readable font	Illegible corrections; Truncated fields
Contemporaneous	Recorded when observed	Timestamp matches activity	Backdated entries; Batched data entry
Original	First record of data	Source document; Certified copy	Transcribed from informal notes
Accurate	Truthful and correct	Matches source document	Transcription errors; Rounding

Modern standards extend ALCOA to ALCOA+, adding Complete (nothing missing), Consistent (same format over time), Enduring (readable throughout retention), and Available (retrievable for inspection). A further refinement, ALCOA++, adds Traceable (every value can be followed through its full lifecycle, including audit trails of any changes).

21.4 Data Cleaning and Query Management

Despite best efforts at the point of collection, data inevitably contains errors and inconsistencies. Data cleaning systematically identifies and resolves these issues.

Data management personnel run programmed checks against the database: cross-field validations (is the hospitalization date after the enrollment date?), range checks (is this laboratory value physiologically plausible?), and protocol logic checks (if the patient was randomized to arm A, should they have received arm B at this visit?).

When a potential error is identified, a query is generated: a formal request to the site to verify or correct the data. The query specifies the discrepancy and asks the site to respond. Site staff review their source documents (the original medical records) and either correct the data or explain why it is correct as recorded.

Query resolution is an iterative dialogue. Some queries are resolved immediately; others require investigation. The number of open queries and the time to resolution are key data management metrics.

21.5 Source Data and Verification

The trial database is not the ultimate source of truth; the source documents are. Source documents are the original records where clinical observations first appear: medical records, laboratory printouts, imaging reports, patient diaries.

Source data verification (SDV) is the process by which monitors compare the trial database against source documents to ensure that transcription has been accurate. Historically, SDV was exhaustive: 100% of key variables for 100% of patients. More recently, risk-based monitoring approaches target SDV toward high-risk data and sites, recognizing that exhaustive verification is expensive and that some data points are more important than others.

The relationship between source documents and the trial database must be clear and auditable. For any value in the database, it should be possible to trace back to the source where that value was originally recorded.

Beyond standard electronic data capture, several specialized data streams introduce distinct operational challenges. Central laboratory data (generated at specialized facilities rather than at the individual clinical sites) ensures high consistency but requires careful coordination of specimen logistics and data integration. Similarly, electronic patient-reported outcomes (ePRO) allow for direct capture of data from participants’ own devices, which minimizes transcription errors while shifting the focus to patient compliance and technical device management. Integrating external sources like medical imaging archives, wearable device data, or third-party databases further complicates data management, as these must be harmonized with trial data under strict quality standards. Every data transfer between systems, whether from site EMRs to trial databases or between specialized laboratory platforms, must be validated to ensure that information remains intact as it moves through the trial’s pipeline.

As a trial nears completion, the data management effort culminates in the database lock. At that point, all queries must be resolved, programmed cleaning must be finalized, and medical reviews must confirm the accuracy of safety coding and adverse event records. Once the database is locked, it is frozen: no further modifications are permitted, and any errors discovered afterward must be documented through administrative procedures rather than direct data edits. The locked data is then transferred to biostatisticians for analysis; that transfer must be validated to ensure the final clinical study report is based on a complete, accurate, and immutable dataset.

21.6 Data Standards and Traceability

Regulatory agencies require that submission data be provided in standardized formats that enable efficient review and cross-study comparisons. Understanding these standards (and the traceability chain from source to submission) is required for data management professionals.

The data management professional must ensure that all trial data adheres to international standards developed by the Clinical Data Interchange Standards Consortium (CDISC). These standards, which are now regulatory requirements for submissions to the FDA and PMDA, ensure that clinical data is consistent, interoperable, and reviewable. The Study Data Tabulation Model (SDTM) provides the foundation by organizing raw clinical data into standardized domains (such as demographics, adverse events, and laboratory results) using a uniform structure and variable naming convention. This consistency allows any reviewer to interpret captured data, such as a blood pressure measurement, without needing to consult study-specific custom documentation.

While SDTM tabularizes the data as collected, the Analysis Data Model (ADaM) prepares it for statistical analysis. ADaM datasets include derived variables, analysis flags, and complex calculations like time-to-event estimates, providing the direct basis for the primary efficacy and safety analyses. Accompanying these datasets is the define.xml file, a machine-readable metadata document that provides a roadmap for reviewers. It specifies the origin and derivation of every dataset and variable, allowing for efficient navigation of the submission. Supporting all these models is the use of controlled terminology, which mandates standardized codes for categorical values like adverse event severity or dosing routes, thereby enabling automated data pooling and comparison across the entire research portfolio.

The Traceability Chain

Regulators expect to trace any number in a submission back to its source. This traceability chain links:

Source documents (medical records, lab reports)
CRF/EDC data (as collected)
SDTM datasets (tabulated for submission)
ADaM datasets (derived for analysis)
Tables, listings, and figures (TLFs) (in the CSR)
eCTD submission (regulatory dossier)

At each step, the transformation should be documented and reproducible. If a reviewer questions an efficacy p-value, they should be able to follow the chain back through the ADaM dataset, the SDTM source, the EDC record, and ultimately the source document.

Common traceability failures include:

Undocumented derivations (how was “baseline” defined?)
Missing mapping documentation (how did EDC variable X become SDTM variable Y?)
Ad hoc data corrections applied outside validated systems
Inconsistent versioning of datasets across analysis steps

The eCTD Submission Structure

Submissions to major regulatory agencies use the electronic Common Technical Document (eCTD) format, which organizes the dossier into a navigable electronic structure (Table 21.2).

Table 21.2: eCTD Module Structure

Module	Content	Data Management Relevance
Module 1	Regional administrative information	Country-specific forms and labels
Module 2	Summaries and overviews	Clinical summaries synthesizing data
Module 3	Quality (CMC)	Manufacturing; not directly DM
Module 4	Nonclinical	Preclinical studies; not directly DM
Module 5	Clinical	CSRs, SDTM/ADaM datasets, define.xml

Module 5 is where clinical data management work appears: the study datasets (SDTM and ADaM), the define.xml files, and the clinical study reports that interpret the data. The quality of data management directly affects how efficiently reviewers can navigate and verify the submission.

21.7 Regulatory Compliance

Clinical data management is heavily regulated. Multiple requirements apply.

21 CFR Part 11 establishes requirements for electronic records and electronic signatures, addressing system validation, audit trails, access controls, and electronic signatures.

Good Clinical Practice (GCP) requires that trial data be accurately recorded, attributable, contemporaneous, and available for inspection.

Computer System Validation (CSV) requires that EDC and other clinical trial systems be validated for their intended purpose before use, with documentation that the systems function as specified.

Records retention requirements mandate that trial records be maintained for specified periods and be available for regulatory inspection. ICH E6(R3) (2025) defers to applicable regional regulations rather than specifying a single retention period; in practice, FDA 21 CFR 312.62 requires sponsors to retain records for two years following marketing approval or program discontinuation, while many sponsors maintain records for 15 or more years as a practical minimum. AI tools for automated data cleaning, discrepancy detection, and natural language processing of source documents are covered in Chapter 26.