C-Index from scratch in Python
--
David Deutsch has this tweet:
If you can’t program it, you haven’t understood it.
If you do some work in medicine, statistics, or even biophysics, you are surely familiar with the C-Index, also called the C-statistics, or Harrell’s index . Frank Harrell introduced it to measure the ability of a model to discriminate patients with different prognosis. This means:
Consider two patients. Patient A lived longer than patient B.
If the predicted survival time for the patient A is longer than the predicted survival time for the patient B, the predictions for this pair A-B are concordant with the outcomes.
The definition of the C-Index is:
Here you can find a wonderful explanation of the C-Index and its interpretation, I highly recommend it.
If you go to the PySurvival website, you will find the following and more refined definition:
where
So, lets program it to understand it!
(Here is how you can type Greek letters on a Linux system.)
First, this expression:
δT = lambda Ti, Tj: 1. if Tj < Ti else 0.
The function δT takes two arguments: Ti and Tj. These are the survival times.
Second, this expression:
δη = lambda ηi, ηj: 1. if ηj > ηi else 0.
The function δη takes two arguments: ηi and ηj. These are the risk scores. The patient with a higher risk score should have a shorter predicted survival time.
Third, what is δj?
Mathematically, it can be either 1 or 0. So, lets assume for now it is always 1:
δj = [1, 1, 1, 1, …]
Assuming, the number of patients is n:
δj = np.array([1. for i in range(n)])
Lastly, we implement the sum over i and j:
n = 5 # NUMBER OF PATIENTS
numerator = 0
denominator = 0
for i in range(n):
for j in range(n):
numerator +=
denominator +=
The expression x+= 1 is identical to the expression x = x+1.
Putting it all together
This is now easy. Bur first, we have to define the actual survival times, predicted risk scores, and the vector…