19  Data Management

Behind every clinical trial stands a vast apparatus of data: patient medical histories, laboratory values, imaging results, patient-reported outcomes, adverse event records, protocol deviations, and countless other observations. Converting this raw information into reliable, analyzable data is the province of clinical data management.

The stakes are high. Regulatory decisions affecting millions of patients rest on whether the data can be trusted. The entire edifice of evidence-based medicine assumes that the numbers in clinical databases reflect what actually happened to patients. Data management provides that assurance.

Data in a clinical trial flows through a carefully orchestrated pipeline, illustrated in Figure 19.1.

flowchart TB
    subgraph Row1 [ ]
        direction LR
        A["Clinical Observation"] --> B["Source Document"] --> C["CRF/EDC Entry"] --> D["Edit Checks"] --> E["Query Generation"]
    end
    subgraph Row2 [ ]
        direction LR
        F["Site Response"] --> G["Query Resolution"] --> H["Medical Review"] --> I["Database Lock"] --> J["Analysis Dataset"]
    end
    Row1 --> Row2
Figure 19.1: Clinical trial data flow from point of collection through database lock

It begins at the point of collection—when a nurse draws blood, when a patient completes a questionnaire, when a physician conducts an examination. At this stage, data exists as raw observations: a tube of blood, marks on paper, clinical impressions.

Observations are then captured—entered into the trial’s data collection system. Historically, this meant transcribing information onto paper case report forms (CRFs); today, electronic data capture (EDC) systems are standard, with site staff entering data directly into validated computer systems.

Once captured, data is cleaned—reviewed for errors, inconsistencies, missing values, and implausibilities. Queries are issued to sites asking them to verify or correct questionable data. This iterative cleaning process continues until the data is judged acceptably clean.

Finally, data is locked and extracted for statistical analysis. Database lock is a formal milestone: after lock, no further changes are made, and the data that will support the regulatory submission is fixed.

19.1 Electronic Data Capture

EDC systems have transformed data management. These validated software platforms provide electronic forms that mirror protocol requirements, capturing each data point as site staff enter it.

Edit checks programmed into EDC systems catch many errors at the point of entry. If a patient’s birth date suggests they are 200 years old, or if a systolic blood pressure is entered as 12 rather than 120, the system flags the discrepancy immediately. This front-end validation dramatically improves data quality compared to paper-based processes where errors might not be discovered for months.

EDC also provides real-time visibility into trial data. Sponsors can monitor enrollment, track data completeness, and identify sites with quality issues much more rapidly than was possible with paper systems.

Audit trails automatically log every entry and modification along with timestamps and user identities. This creates an unalterable record of how the data evolved—required for demonstrating compliance with regulatory requirements and for investigating any questions that arise.

19.2 Data flow beyond EDC: external sources, vendor pipelines, and transfer specifications

The traditional operational framing of data flow—especially the contrast between bolus and continuous flow, and the emphasis on timeliness for monitoring and trial control—remains conceptually useful even as the technology stack has changed (Meinert 2013). In contemporary trials, however, “data flow” is no longer limited to site-entered CRFs/EDC. It typically includes multiple externally generated streams (central laboratories, imaging core labs, ePRO/eCOA platforms, wearable/sensor-derived endpoints, and real-world data sources), each with distinct provenance, latency, and error modes (U.S. Food and Drug Administration 2023c, 2023b, 2024b).

From a quality and regulatory standpoint, the central problem is not merely ingestion; it is traceable control across systems: documented interfaces, controlled transformations, and auditability sufficient to support reconstruction of what was collected, when, by whom/what system, and how it was transformed into analysis-ready datasets (U.S. Food and Drug Administration 2023a, 2024a).

TipChecklist: Data Transfer Specification (DTS)

For each external source (device vendor, central lab, imaging core, ePRO platform, EHR/claims provider), a defensible DTS should specify:

Digital Health Technologies (DHTs): operational controls

When endpoints depend on DHT-derived measurements, the data management plan should address device selection and performance, participant usability and training, data completeness monitoring, and the integrity of transmission and storage processes. These controls are part of endpoint validity, not merely IT hygiene (U.S. Food and Drug Administration 2023c).

Real-world data (EHR and claims): fitness-for-purpose documentation

When EHR or claims-derived datasets are used to support regulatory decision-making, sponsors should document data provenance, capture processes, missingness and measurement error, and the relationship between clinical care documentation and research variables. The governing question is whether the dataset is fit for the stated regulatory purpose, with traceability and transparency sufficient for review (U.S. Food and Drug Administration 2024b, 2023b).

Common Data Elements (CDEs): standardization for interoperability and AI-readiness

Classic operational texts emphasized bespoke form design tailored to each trial’s specific requirements (Meinert 2013). While protocol-specific customization remains necessary, the modern trend is toward Common Data Elements—standardized variable definitions, value sets, and collection instruments that enable data pooling, cross-study comparison, and machine learning applications.

The NIH CDE Repository provides curated element definitions across therapeutic areas, allowing researchers to adopt pre-validated instruments rather than reinventing measurements. CDISC standards (CDASH for collection, SDTM for tabulation, ADaM for analysis) provide the structural framework that regulatory submissions now routinely require. The strategic benefit extends beyond compliance: trials that use CDEs can more readily contribute to meta-analyses, external control arms, and AI-driven signal detection—capabilities that custom-designed data structures often preclude.

TipChecklist: CDE Selection and Implementation
  • Identify applicable CDE repositories: NIH CDE Repository, CDISC Therapeutic Area User Guides, disease-specific consortia standards.
  • Map protocol endpoints to existing CDEs: prioritize validated instruments for primary and key secondary endpoints; document rationale for any custom elements.
  • Specify controlled terminology: use standard dictionaries (MedDRA for adverse events, WHODrug for medications, SNOMED CT or ICD for diagnoses) with explicit version control.
  • Plan CDISC mapping early: define CDASH annotation during CRF design rather than retrofitting SDTM mapping at database lock.
  • Document deviations: where protocol requirements preclude standard CDEs, document the deviation and maintain a crosswalk to related standards for future interoperability.

19.3 Data Quality Standards

What makes data “good”? Several dimensions of quality matter.

Accuracy means that recorded data reflects what actually happened. A blood pressure of 140/90 was actually 140/90, not misread or mistranscribed.

Completeness means that all required data has been collected. Missing values may be unavoidable (a patient may miss a visit), but missing data should be documented and explained.

Consistency means that data makes sense internally. If a subject is recorded as male on one form and female on another, something is wrong.

Timeliness means that data is entered promptly. Data entered long after collection is more likely to contain errors due to memory lapses or lost source documents.

Legibility and traceability mean that the data can be read and that its origin can be determined. For electronic data, this means clear audit trails; for paper documents, it means legible handwriting and proper corrections (single-line strikethrough with initials and date, not obliteration).

The principle of ALCOA captures these standards, summarized in Table 19.1.

Table 19.1: ALCOA Data Integrity Principles
Principle Definition Examples Common Failures
Attributable Who recorded it? User ID, signature, initials Shared login; Unsigned entries
Legible Can it be read? Clear handwriting; Readable font Illegible corrections; Truncated fields
Contemporaneous Recorded when observed Timestamp matches activity Backdated entries; Batched data entry
Original First record of data Source document; Certified copy Transcribed from informal notes
Accurate Truthful and correct Matches source document Transcription errors; Rounding

Modern standards extend ALCOA to ALCOA++, adding: Complete (nothing missing), Consistent (same format over time), Enduring (readable throughout retention), and Available (retrievable for inspection).

19.4 Data Cleaning and Query Management

Despite best efforts at the point of collection, data inevitably contains errors and inconsistencies. Data cleaning systematically identifies and resolves these issues.

Data management personnel run programmed checks against the database: cross-field validations (is the hospitalization date after the enrollment date?), range checks (is this laboratory value physiologically plausible?), and protocol logic checks (if the patient was randomized to arm A, should they have received arm B at this visit?).

When a potential error is identified, a query is generated—a formal request to the site to verify or correct the data. The query specifies the discrepancy and asks the site to respond. Site staff review their source documents (the original medical records) and either correct the data or explain why it is correct as recorded.

Query resolution is an iterative dialogue. Some queries are resolved immediately; others require investigation. The number of open queries and the time to resolution are key data management metrics.

19.5 Source Data and Verification

The trial database is not the ultimate source of truth—the source documents are. Source documents are the original records where clinical observations first appear: medical records, laboratory printouts, imaging reports, patient diaries.

Source data verification (SDV) is the process by which monitors compare the trial database against source documents to ensure that transcription has been accurate. Historically, SDV was exhaustive—100% of key variables for 100% of patients. More recently, risk-based monitoring approaches target SDV toward high-risk data and sites, recognizing that exhaustive verification is expensive and that some data points are more important than others.

The relationship between source documents and the trial database must be clear and auditable. For any value in the database, it should be possible to trace back to the source where that value was originally recorded.

Beyond standard electronic data capture, several specialized data streams introduce unique operational challenges. Central laboratory data, for instance—generated at specialized facilities rather than at the individual clinical sites—ensures high consistency but requires rigorous coordination of specimen logistics and data integration. Similarly, electronic patient-reported outcomes (ePRO) allow for direct capture of data from participants’ own devices, which minimizes transcription errors while shifting the focus to patient compliance and technical device management. Integrating external sources like medical imaging archives, wearable device data, or third-party databases further complicates data management, as these must be harmonized with trial data under strict quality standards. Ultimately, every data transfer between systems—whether from site EMRs to trial databases or between specialized laboratory platforms—must be validated to ensure that information remains intact as it moves through the trial’s ecosystem.

As a trial nears completion, the data management effort culminates in the database lock. This is a critical transition where all queries must be resolved, programmed cleaning must be finalized, and medical reviews must confirm the accuracy of safety coding and adverse event records. Once the database is locked, it is frozen—it can no longer be modified, and any errors discovered afterward must be documented through administrative procedures rather than direct data edits. The locked data is then transferred to biostatisticians for analysis, a transfer that must be rigorously validated to ensure the final clinical study report is based on a complete, accurate, and immutable dataset.

19.6 Data Standards and Traceability

Regulatory agencies require that submission data be provided in standardized formats that enable efficient review and cross-study comparisons. Understanding these standards—and the traceability chain from source to submission—is required for data management professionals.

The data management professional must ensure that all trial data adheres to international standards developed by the Clinical Data Interchange Standards Consortium (CDISC). These standards, which are now regulatory requirements for submissions to the FDA and PMDA, ensure that clinical data is consistent, interoperable, and reviewable. The Study Data Tabulation Model (SDTM) provides the foundation by organizing raw clinical data into standardized domains—such as demographics, adverse events, and laboratory results—using a uniform structure and variable naming convention. This consistency allows any reviewer to interpret captured data, such as a blood pressure measurement, without needing to consult study-specific custom documentation.

While SDTM tabularizes the data as collected, the Analysis Data Model (ADaM) prepares it for statistical analysis. ADaM datasets include derived variables, analysis flags, and complex calculations like time-to-event estimates, providing the direct basis for the primary efficacy and safety analyses. Accompanying these datasets is the define.xml file, a machine-readable metadata document that provides a roadmap for reviewers. It specifies the origin and derivation of every dataset and variable, allowing for efficient navigation of the submission. Supporting all these models is the use of controlled terminology, which mandates standardized codes for categorical values like adverse event severity or dosing routes, thereby enabling automated data pooling and comparison across the entire research portfolio.

The Traceability Chain

Regulators expect to trace any number in a submission back to its source. This traceability chain links:

  1. Source documents (medical records, lab reports)
  2. CRF/EDC data (as collected)
  3. SDTM datasets (tabulated for submission)
  4. ADaM datasets (derived for analysis)
  5. Tables, listings, and figures (TLFs) (in the CSR)
  6. eCTD submission (regulatory dossier)

At each step, the transformation should be documented and reproducible. If a reviewer questions an efficacy p-value, they should be able to follow the chain back through the ADaM dataset, the SDTM source, the EDC record, and ultimately the source document.

Common traceability failures include:

  • Undocumented derivations (how was “baseline” defined?)
  • Missing mapping documentation (how did EDC variable X become SDTM variable Y?)
  • Ad hoc data corrections applied outside validated systems
  • Inconsistent versioning of datasets across analysis steps

The eCTD Submission Structure

Submissions to major regulatory agencies use the electronic Common Technical Document (eCTD) format, which organizes the dossier into a navigable electronic structure.

Table 19.2: eCTD Module Structure
Module Content Data Management Relevance
Module 1 Regional administrative information Country-specific forms and labels
Module 2 Summaries and overviews Clinical summaries synthesizing data
Module 3 Quality (CMC) Manufacturing; not directly DM
Module 4 Nonclinical Preclinical studies; not directly DM
Module 5 Clinical CSRs, SDTM/ADaM datasets, define.xml

Module 5 is where clinical data management work appears: the study datasets (SDTM and ADaM), the define.xml files, and the clinical study reports that interpret the data. The quality of data management directly affects how efficiently reviewers can navigate and verify the submission.

19.7 Regulatory Compliance

Clinical data management is heavily regulated. Multiple requirements apply.

21 CFR Part 11 establishes requirements for electronic records and electronic signatures, addressing system validation, audit trails, access controls, and electronic signatures.

Good Clinical Practice (GCP) requires that trial data be accurately recorded, attributable, contemporaneous, and available for inspection.

Computer System Validation (CSV) requires that EDC and other clinical trial systems be validated for their intended purpose before use, with documentation that the systems function as specified.

Records retention requirements mandate that trial records be maintained for specified periods (often 15+ years) and be available for regulatory inspection.