You made some good points but let's stop and think about the composite outcome components issue. An overriding principle in my mind is to construct endpoints that reflect how patients feel or fare. Hospitalization is bad and should be counted but not as much as say MI or stroke. If you feel that non-HF hospitalization is bad but is not as bad as HF hospitalization then in an ordinal outcome rank non-HF hosp as less severe than HF hosp (and this less severe than MI and MI less severe than death). The end result of this is that we will be able to get a single expression of which treatment provides better patient outcomes, and will greatly reduce sample size requirements. More at https://hbiostat.org/endpoint
My question is out of ignorance more than anything.
But hierchical analyses are becoming more common (total events analysis as well, rather than time to first event, but that seems unrelated to this). However, the hierarchy starts with the harder endpoints, so usually those analyses do not survive too many levels.
In your model of “the totality of things patients care about”, would it make sense to flip those hierarchies upside-down, and go from soft to hard?
The methods described in the link I provided, such as longitudinal ordinal outcome modeling, are extremely flexible. For example you might have a 200-level functional status scale as the first 200 levels of Y, then have a level for hospitalization, then a level for MI, then death.
Proxy end points are the bane of studies. Cancer studies define success as “recurrence free survival” and seldom “overall survival” because overall survival seldom shows any benefit. It’s easier to get a treatment approved with a proxy end point.
Don't mix the ideas of "proxy endpoints" with patient-relevant endpoint components (even though you're right that there are issues with recurrence-free survival).
I can see how requiring the preregistration of the endpoints would dramatically reduce positive results.
Suppose we do a large RCT testing whether eating jellybeans causes cancer. The results are negative. But the lead scientist is convinced jellybeans cause cancer because he has seen it so many times in his practice. It occurs to him, maybe it’s a particular flavor that causes cancer. There are 20 flavors of jellybeans, so he looks carefully at the data. Sure enough, the green ones appear to cause cancer as the p value is under 5%.
Of course, with 20 flavors at least one of them is likely to show a false positive. Preregistering endpoints puts an end to this kind of nonsense. To clear this up maybe someone should try to replicate a few of those old studies.
I am not sure about pre-registration. Possibly there was a different culture of research in those days, too. But an interesting thought. The “law of diminishing returns” has been well described in a JAMA comment many years ago: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3063072/
Thank you for your essay. You are likely on to something. Here are some other thoughts.
1. I have problem with binning of cholesterol because the binning is actually ad hoc. (Will address below.)
2. Medicine doesn't actually truly understand the complex causal mechanism at play. It only sort of understands.
Let's call making it to your life expectancy at birth a win. Of course, I'd prefer nobody die for any cause. But let's take a lottery win if we get it. Consider the following:
Life Expectancy at Birth for the United States (1920-1960)
| Birth Year | Men | Women |
| 1920 | 54 | 55 |
| 1921 | 55 | 56 |
| 1922 | 56 | 57 |
| 1923 | 57 | 58 |
| 1924 | 58 | 59 |
| 1925 | 58 | 62 |
| 1926 | 59 | 60 |
| 1927 | 60 | 61 |
| 1928 | 61 | 62 |
| 1929 | 62 | 63 |
| 1930 | 58 | 62 |
| 1931 | 59 | 63 |
| 1932 | 60 | 64 |
| 1933 | 61 | 65 |
| 1934 | 62 | 66 |
| 1935 | 62 | 67 |
| 1936 | 63 | 68 |
| 1937 | 64 | 69 |
| 1938 | 65 | 70 |
| 1939 | 66 | 71 |
| 1940 | 61 | 65 |
| 1941 | 62 | 66 |
| 1942 | 63 | 67 |
| 1943 | 64 | 68 |
| 1944 | 65 | 69 |
| 1945 | 66 | 71 |
| 1946 | 67 | 72 |
| 1947 | 68 | 73 |
| 1948 | 69 | 74 |
| 1949 | 70 | 75 |
| 1950 | 66 | 71 |
| 1951 | 67 | 72 |
| 1952 | 68 | 73 |
| 1953 | 69 | 74 |
| 1954 | 70 | 75 |
| 1955 | 67 | 73 |
| 1956 | 68 | 74 |
| 1957 | 69 | 75 |
| 1958 | 70 | 76 |
| 1959 | 71 | 77 |
| 1960 | 72 | 78 |
Instead of ad hoc binning, let's instead do this. Look at the cholesterol levels of ONLY those who made it to or past their at birth LE. Show me the cholesterol distribution of this data set. Then, make only 3 bins:
• >3 std from mean
• Between +\- 3 std from mean
• <3 std from mean
Now show me the survival rates and tell me whether "low" or "high" makes a meaningful difference with these bins.
Presuming it does, trim the tails. Recalculate 3 std and make three groups the same way. Does new low or high make difference? Are there even any "outliers" after trimming?
As more winners (those who make it to their LE) are added - update.
I don't think we've really come to terms with the breadth of natural variation that is compatible and incompatible with healthy.
While I don't really like the term "conservative" medicine, I think this approach might fall within your rubric until the Rube Goldberg machine that is the healthy human is better understood.
It's likely a combination of decreased questionable research practices and all or moet low hanging fruits have been plucked. You see it in other sciences as well. I the younger science like psychology it is known that effect sizes decrease or cease existing when measures against questionable research practices are taken, this includes pre-registration, when done correctly. But you idea that it is harder to improve on the lower hanging fruits that are the older trials, especially when those are used as controls in the newer trials, like treatment as usual.
Interesting idea about the temporal correlation btw advent of pre-registration and the era of ‘missing significance on hard outcomes’.
However, most trials now will have individual components of the primary composite as a secondary outcome, and will not infrequently have all-cause mortality or at least CV mortality as a stand-alone secondary endpoint. And those nearly always fail to reach stat significance. So if the old-era trialists were in fact P-hacking, something has still changed where they would happen upon significant hard outcomes post hoc, whereas that simply doesn’t happen now. It seems such a change over the years would NOT hinge on prospective vs retrospective.
One of the advantages being old is sometimes you get perspective. I did my cardiology fellowship in 1983 in San Antonio in the Air Force. My wife and I worked together , she was my fiancé at the time in the CCU.
I remember one night where it looked like a Warzone. There was a six bed unit. Three people died. One was still in the room. EKG paper was all over the floor, mainly from the defibrillators.
The nurse colonels came by, saw the carnage and the only comment was that my wife needed to get her hair off her collar. I probably didn't help the matter cause I was sitting with my feet up on the desk. In retrospect if I was as mature then as I am now, I would've stood for their rank, not individual. That's why they say youth is wasted on the young.
Anyway, the average age in the CCU was between 50 and 60. An 85-year-old was unheard of and people would go see that person because we never saw elderly.
It's sort of like how many people are running three minute and 30 second miles?
When is the last time anyone was in the CCU and saw 50% of the people die that night between the ages 50 and 60 from acute myocardial infarction?
It's not that we don't have good treatments. Incremental gains I think are going to come with GLP's, combos like statin and PCSK9 meds - as we stop the inflammation, which caused the disease in the first place.
It’s not a failure today of doing good studies. It’s the complete politicization of health care, so huge studies must be done showing imaginary results that are derived with statistical and methodological chicanery that few of us are qualified to understand. I always go back to Ivan Illych I think it was, to paraphrase. Today we have great care if you get into an accident, there is insulin for diabetics, antibiotics for infection, aspirin for headaches…and everything else is pretty neutral or negative.
I see many treatments that work, but will never appear here in your column. Never ever ever. You recently discussed A Fib, and the solution to A Fib has never been more clear, and most is iatragenic, but my comment does not stir things up in the least. It’s all a matter of which oblation and which medication. The solutions are often out there, but because they are not profitable, nobody will study them.
You made some good points but let's stop and think about the composite outcome components issue. An overriding principle in my mind is to construct endpoints that reflect how patients feel or fare. Hospitalization is bad and should be counted but not as much as say MI or stroke. If you feel that non-HF hospitalization is bad but is not as bad as HF hospitalization then in an ordinal outcome rank non-HF hosp as less severe than HF hosp (and this less severe than MI and MI less severe than death). The end result of this is that we will be able to get a single expression of which treatment provides better patient outcomes, and will greatly reduce sample size requirements. More at https://hbiostat.org/endpoint
My question is out of ignorance more than anything.
But hierchical analyses are becoming more common (total events analysis as well, rather than time to first event, but that seems unrelated to this). However, the hierarchy starts with the harder endpoints, so usually those analyses do not survive too many levels.
In your model of “the totality of things patients care about”, would it make sense to flip those hierarchies upside-down, and go from soft to hard?
The methods described in the link I provided, such as longitudinal ordinal outcome modeling, are extremely flexible. For example you might have a 200-level functional status scale as the first 200 levels of Y, then have a level for hospitalization, then a level for MI, then death.
Proxy end points are the bane of studies. Cancer studies define success as “recurrence free survival” and seldom “overall survival” because overall survival seldom shows any benefit. It’s easier to get a treatment approved with a proxy end point.
Don't mix the ideas of "proxy endpoints" with patient-relevant endpoint components (even though you're right that there are issues with recurrence-free survival).
Yes, I was commenting purposely and consciously about proxy endpoints.
I can see how requiring the preregistration of the endpoints would dramatically reduce positive results.
Suppose we do a large RCT testing whether eating jellybeans causes cancer. The results are negative. But the lead scientist is convinced jellybeans cause cancer because he has seen it so many times in his practice. It occurs to him, maybe it’s a particular flavor that causes cancer. There are 20 flavors of jellybeans, so he looks carefully at the data. Sure enough, the green ones appear to cause cancer as the p value is under 5%.
Of course, with 20 flavors at least one of them is likely to show a false positive. Preregistering endpoints puts an end to this kind of nonsense. To clear this up maybe someone should try to replicate a few of those old studies.
I am not sure about pre-registration. Possibly there was a different culture of research in those days, too. But an interesting thought. The “law of diminishing returns” has been well described in a JAMA comment many years ago: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3063072/
Thank you for your essay. You are likely on to something. Here are some other thoughts.
1. I have problem with binning of cholesterol because the binning is actually ad hoc. (Will address below.)
2. Medicine doesn't actually truly understand the complex causal mechanism at play. It only sort of understands.
Let's call making it to your life expectancy at birth a win. Of course, I'd prefer nobody die for any cause. But let's take a lottery win if we get it. Consider the following:
Life Expectancy at Birth for the United States (1920-1960)
| Birth Year | Men | Women |
| 1920 | 54 | 55 |
| 1921 | 55 | 56 |
| 1922 | 56 | 57 |
| 1923 | 57 | 58 |
| 1924 | 58 | 59 |
| 1925 | 58 | 62 |
| 1926 | 59 | 60 |
| 1927 | 60 | 61 |
| 1928 | 61 | 62 |
| 1929 | 62 | 63 |
| 1930 | 58 | 62 |
| 1931 | 59 | 63 |
| 1932 | 60 | 64 |
| 1933 | 61 | 65 |
| 1934 | 62 | 66 |
| 1935 | 62 | 67 |
| 1936 | 63 | 68 |
| 1937 | 64 | 69 |
| 1938 | 65 | 70 |
| 1939 | 66 | 71 |
| 1940 | 61 | 65 |
| 1941 | 62 | 66 |
| 1942 | 63 | 67 |
| 1943 | 64 | 68 |
| 1944 | 65 | 69 |
| 1945 | 66 | 71 |
| 1946 | 67 | 72 |
| 1947 | 68 | 73 |
| 1948 | 69 | 74 |
| 1949 | 70 | 75 |
| 1950 | 66 | 71 |
| 1951 | 67 | 72 |
| 1952 | 68 | 73 |
| 1953 | 69 | 74 |
| 1954 | 70 | 75 |
| 1955 | 67 | 73 |
| 1956 | 68 | 74 |
| 1957 | 69 | 75 |
| 1958 | 70 | 76 |
| 1959 | 71 | 77 |
| 1960 | 72 | 78 |
Instead of ad hoc binning, let's instead do this. Look at the cholesterol levels of ONLY those who made it to or past their at birth LE. Show me the cholesterol distribution of this data set. Then, make only 3 bins:
• >3 std from mean
• Between +\- 3 std from mean
• <3 std from mean
Now show me the survival rates and tell me whether "low" or "high" makes a meaningful difference with these bins.
Presuming it does, trim the tails. Recalculate 3 std and make three groups the same way. Does new low or high make difference? Are there even any "outliers" after trimming?
As more winners (those who make it to their LE) are added - update.
I don't think we've really come to terms with the breadth of natural variation that is compatible and incompatible with healthy.
While I don't really like the term "conservative" medicine, I think this approach might fall within your rubric until the Rube Goldberg machine that is the healthy human is better understood.
Pre-registration effect on trial outcome has been examined: see Kaplan and Irvin “Likelihood of Null Effects of Large NHLBI
Clinical Trials Has Increased over Time”. It is as you suppose. Probably too much analytical flexibility before registration
It's likely a combination of decreased questionable research practices and all or moet low hanging fruits have been plucked. You see it in other sciences as well. I the younger science like psychology it is known that effect sizes decrease or cease existing when measures against questionable research practices are taken, this includes pre-registration, when done correctly. But you idea that it is harder to improve on the lower hanging fruits that are the older trials, especially when those are used as controls in the newer trials, like treatment as usual.
Interesting idea about the temporal correlation btw advent of pre-registration and the era of ‘missing significance on hard outcomes’.
However, most trials now will have individual components of the primary composite as a secondary outcome, and will not infrequently have all-cause mortality or at least CV mortality as a stand-alone secondary endpoint. And those nearly always fail to reach stat significance. So if the old-era trialists were in fact P-hacking, something has still changed where they would happen upon significant hard outcomes post hoc, whereas that simply doesn’t happen now. It seems such a change over the years would NOT hinge on prospective vs retrospective.
My take is that additional treatments yield marginal improvements. I call it blindness to marginal utility.
https://thethoughtfulintensivist.substack.com/p/how-do-we-mistake-biological-plausibility?r=20qrtz
One of the advantages being old is sometimes you get perspective. I did my cardiology fellowship in 1983 in San Antonio in the Air Force. My wife and I worked together , she was my fiancé at the time in the CCU.
I remember one night where it looked like a Warzone. There was a six bed unit. Three people died. One was still in the room. EKG paper was all over the floor, mainly from the defibrillators.
The nurse colonels came by, saw the carnage and the only comment was that my wife needed to get her hair off her collar. I probably didn't help the matter cause I was sitting with my feet up on the desk. In retrospect if I was as mature then as I am now, I would've stood for their rank, not individual. That's why they say youth is wasted on the young.
Anyway, the average age in the CCU was between 50 and 60. An 85-year-old was unheard of and people would go see that person because we never saw elderly.
It's sort of like how many people are running three minute and 30 second miles?
When is the last time anyone was in the CCU and saw 50% of the people die that night between the ages 50 and 60 from acute myocardial infarction?
It's not that we don't have good treatments. Incremental gains I think are going to come with GLP's, combos like statin and PCSK9 meds - as we stop the inflammation, which caused the disease in the first place.
That's where the money is I think
It’s not a failure today of doing good studies. It’s the complete politicization of health care, so huge studies must be done showing imaginary results that are derived with statistical and methodological chicanery that few of us are qualified to understand. I always go back to Ivan Illych I think it was, to paraphrase. Today we have great care if you get into an accident, there is insulin for diabetics, antibiotics for infection, aspirin for headaches…and everything else is pretty neutral or negative.
I see many treatments that work, but will never appear here in your column. Never ever ever. You recently discussed A Fib, and the solution to A Fib has never been more clear, and most is iatragenic, but my comment does not stir things up in the least. It’s all a matter of which oblation and which medication. The solutions are often out there, but because they are not profitable, nobody will study them.