15 Endpoints and Outcomes

In 1987, the FDA approved zidovudine (AZT) for HIV based on a trial showing improved survival at 24 weeks. In 2011, the FDA revoked accelerated approval for bevacizumab in breast cancer after confirmatory trials failed to show the survival benefit the surrogate endpoint had predicted. In 2021, the approval of aducanumab for Alzheimer’s disease sparked controversy because the drug reduced amyloid plaques (the surrogate) but did not clearly improve cognition (the clinical outcome).

These cases illustrate a fundamental challenge: what should we measure to determine whether a drug works? The answer shapes trial design, determines sample size, affects regulatory strategy, and ultimately decides whether patients gain access to new therapies.

An endpoint is a defined outcome that a clinical trial is designed to measure. It is the variable that answers the trial’s central question: did the treatment work? The term “endpoint” reflects the fact that these measurements typically occur at the end of treatment or follow-up, though some endpoints are measured continuously throughout the trial.

Endpoints must be specified before the trial begins. The primary endpoint is the main outcome on which the trial will be judged—the result that determines whether the trial “succeeded” or “failed.” A Phase III trial might have survival as its primary endpoint, meaning the trial succeeds if the treatment group lives significantly longer than the control group. Secondary endpoints provide additional information but do not determine the trial’s overall success.

Choosing the right endpoint requires balancing what matters clinically (does the patient feel better? live longer?) with what is practical (can we measure it reliably? in a reasonable timeframe? with an affordable sample size?). This tension between clinical meaningfulness and operational feasibility runs through every endpoint decision.

15.1 The Endpoint Hierarchy

Table 15.1 summarizes the endpoint hierarchy with examples across therapeutic areas.

Table 15.1: Clinical Trial Endpoint Hierarchy

Level	Definition	Examples	Regulatory Acceptance	Trial Implications
Clinical Endpoints	Direct patient outcomes	Survival, symptomatic improvement, functional status	Gold standard for approval	Large trials, long follow-up
Surrogate Endpoints	Predict clinical outcomes	Blood pressure -> CV events; HbA1c -> diabetes complications	Validated or reasonably likely	Smaller trials, faster readout
Biomarkers	Biological process indicators	Drug concentration, receptor occupancy, gene expression	Generally not for approval	Early development, dose selection

At the apex of the endpoint hierarchy sits what we ultimately care about: whether patients live longer, feel better, or function more effectively. These clinical endpoints—sometimes called hard endpoints—directly measure patient outcomes. Survival is the paradigm: either the patient is alive or they are not. There is no ambiguity, no measurement error, no room for interpretation.

Below clinical endpoints sit surrogate endpoints: measurements that are not themselves clinical outcomes but that are reasonably expected to predict clinical outcomes. Blood pressure is a surrogate for stroke and heart attack. Viral load is a surrogate for AIDS progression. Tumor shrinkage is a surrogate for cancer survival.

Further down the hierarchy are biomarkers: measurable characteristics that indicate biological processes, disease states, or responses to intervention. Some biomarkers may also be surrogates, but not all. A biomarker that reliably reflects drug activity might not predict clinical benefit.

15.2 Clinical Endpoints

The most compelling endpoint is one that directly measures what matters to patients. In oncology, overall survival (OS) is the gold standard—the most reliable way to determine whether a cancer treatment extends life. In heart failure, the composite of cardiovascular death and heart failure hospitalization captures both mortality and major morbidity.

But clinical endpoints come with practical challenges. They may require large trials because events are relatively rare. They may require long follow-up because outcomes develop slowly. A trial with survival as its primary endpoint in a slowly progressing cancer might need thousands of patients followed for years—a massive undertaking with implications for cost, time to approval, and patient access to therapy.

Patient-reported outcomes (PROs) represent another category of clinical endpoints. How does the patient feel? What is their quality of life? Can they perform daily activities? These outcomes are measured through standardized questionnaires and reflect the patient’s perspective, which may differ from what objective measurements suggest.

PROs are increasingly important to regulators and payers. A drug that improves a biomarker but does not make patients feel better is less valuable than one that produces symptomatic improvement. However, PROs introduce measurement challenges: responses can be influenced by expectations, mood, and cultural factors, and the validity of PRO instruments must be established.

Unlike a blood test, a questionnaire cannot be calibrated against a physical standard. Instead, PROs must undergo psychometric validation to prove they measure what they claim to measure. Content validity establishes that the instrument covers the relevant symptoms, typically through qualitative interviews with patients. Reliability ensures the measure is stable and reproducible. Construct validity checks whether the score correlates with other known measures of the disease, while sensitivity to change confirms the instrument can detect meaningful improvements. Regulators require this evidence to approve labeling claims based on PROs.

The proliferation of smartphones and wearables has birthed a new category of endpoints: those derived from Digital Health Technologies (DHTs). DHTs offer the promise of moving from snapshot medicine to continuous, real-world monitoring. This includes digital biomarkers, such as subtle changes in gait or voice that may predict disease progression, and continuous monitoring that replaces brief clinical assessments with long-term data collection.

While promising, DHTs face significant validation hurdles. Regulators must be convinced that the device measures the concept accurately, that the measurement correlates with clinical status, and that the data is secure and attributable. The FDA has established a specific Digital Health Center of Excellence to guide the integration of these novel tools into regulatory decision-making (U.S. Food and Drug Administration 2023).

15.3 Surrogate Endpoints

Given the challenges of clinical endpoints, surrogates offer an attractive alternative. They can often be measured sooner and in fewer patients, potentially accelerating development and reducing costs.

The concept behind surrogates is straightforward: if intervention A lowers blood pressure more than intervention B, and lower blood pressure reduces cardiovascular events, then intervention A should reduce cardiovascular events more than intervention B.

But this logic can fail. The relationship between surrogate and outcome may not hold under all circumstances. An intervention might affect the surrogate through a pathway that does not influence clinical outcomes. Or it might have effects on clinical outcomes that are not mediated by the surrogate.

The history of medicine includes instructive failures. The CAST trial of the late 1980s tested antiarrhythmic drugs in post-heart attack patients. The drugs effectively suppressed the abnormal heart rhythms they were designed to treat—the surrogate looked excellent. But the trial was stopped early when it became clear that treated patients were more likely to die than those on placebo. The rhythm abnormalities were a marker of underlying heart damage, not a cause of death; suppressing them did nothing to address the underlying disease and may have introduced new cardiac risks.

Similarly, hormone replacement therapy improved lipid profiles and was widely prescribed for cardiovascular prevention based on surrogate reasoning and observational data. The Women’s Health Initiative randomized trial showed that, despite the favorable lipid effects, hormone therapy actually increased cardiovascular events. The surrogate had predicted the wrong outcome.

The FDA distinguishes between validated surrogates—where the relationship to clinical outcome is well established—and reasonably likely surrogates—where the relationship is plausible but not fully established. The latter can support accelerated approval for serious conditions, but sponsors must subsequently confirm the clinical benefit through post-marketing studies.

15.4 Primary and Secondary Endpoints

Table 15.2 defines the roles of different endpoint categories in clinical trials.

Table 15.2: Endpoint Categories and Their Roles

Category	Role	Statistical Handling	Interpretive Rules
Primary	Main outcome for trial success/failure	Fully powered; Controls Type I error	Trial succeeds or fails on this
Key Secondary	Pre-specified important outcomes	Part of testing hierarchy; Multiplicity-adjusted	Interpret only if primary positive
Secondary	Additional efficacy/safety info	May be multiplicity-adjusted	Supportive; Interpret with caution
Exploratory	Hypothesis-generating	No multiplicity adjustment	Signals for future studies only

The primary endpoint is the outcome that the trial is designed and powered to assess. It is the basis for the primary analysis and typically the endpoint that regulatory decisions rest upon. A trial succeeds or fails based primarily on its primary endpoint.

Secondary endpoints provide additional information. They may address different aspects of efficacy, safety, or tolerability. They may assess outcomes in subpopulations. They may evaluate the durability of effects over time.

The relationship between primary and secondary endpoints requires careful thought. If the primary endpoint is negative (no statistically significant difference between treatment groups), positive secondary endpoints are difficult to interpret—they may represent chance findings from multiple comparisons. If the primary endpoint is positive, secondary endpoints provide valuable confirmation and characterization.

Multiplicity is the statistical challenge that arises from testing multiple endpoints. With a conventional significance threshold of 0.05, one in twenty tests will be “positive” by chance alone. When multiple endpoints are tested, the overall probability of false positive findings increases. Statistical methods for multiplicity adjustment—such as hierarchical testing, gatekeeping procedures, or alpha spending—are used to control this inflation of false positives.

15.5 Composite Endpoints

A composite endpoint combines multiple outcomes into a single measure. A participant might be counted as having an event if they experience any one of: death, heart attack, stroke, or hospitalization for unstable angina. The composite counts a single event per participant, with the most severe event taking precedence (death counts, even if the participant also had a stroke).

Composites offer statistical efficiency: because events are more common (many participants will have at least one component), smaller sample sizes may suffice. They also provide a more complete picture of disease burden, capturing multiple manifestations of benefit or harm.

But composites introduce interpretive challenges. If the composite is positive overall, which components are driving the effect? Is a treatment that reduces hospitalizations but has no effect on death meaningfully beneficial? Should a drug that reduces minor events but not major ones be approved?

The components of a composite should be clinically related and of roughly similar importance. A composite that includes both death and mild headache is problematic because the components have vastly different clinical significance.

15.6 Time-to-Event Endpoints

Many outcomes are measured not as binary events but as time-to-event: how long until a participant experiences the outcome. Time-to-event analyses can distinguish between treatments that reduce eventual event rates versus treatments that delay events.

Survival analysis methods—Kaplan-Meier curves, log-rank tests, Cox proportional hazards models—are designed for time-to-event data. They can handle participants who have not yet experienced an event (they are censored at the time of analysis) and can incorporate participants who drop out before experiencing an event.

A key assumption in many time-to-event analyses is proportional hazards: the relative risk of an event is constant over time. If a treatment reduces risk by 30% at month 6, it should also reduce risk by 30% at month 12. When this assumption is violated—as when a treatment takes time to work, or when its effects wane over time—alternative methods may be needed.

15.7 Endpoint Selection in Practice

Selecting the optimal endpoint requires a sophisticated balance of regulatory, clinical, and operational constraints. While regulatory acceptability is paramount—often favoring well-validated clinical outcomes—sponsors must also prioritize clinical meaningfulness to ensure the drug provides value that patients and payers actually recognize. Operational factors are equally critical: an endpoint must be measured with enough precision to maintain statistical power, and its timing must be carefully managed to avoid unnecessary delays while still providing a reliable prediction of long-term benefit. In many cases, using established and validated instruments—particularly for patient-reported outcomes—is the most effective way to ensure that trial findings are both reproducible and credible to external stakeholders.