Determination of the optimal cutoff value for clinical research is a very convenient method that can handle linear, logistic, and Cox…

Welcome to the 2023 AMOS Structural Equation Model course (permanent replay)!

Structural Equation Model, live broadcast for two days from 10.28 to 29, welcome to register

Regarding the selection of the optimal cutoff value for continuity variables, we have previously introduced the surv_cutpoint and X-tile software in survminer:

  • Implementation of survival analysis in R language

  • Determination of optimal cutoff value for survival analysis

Today I will introduce a very useful R package: cutoff

Prepare data

Demonstration using myeloma data.

library(survival)
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
## The following object is masked from 'package:survival':
##
## myeloma

data(myeloma)
head(myeloma)
## molecular_group chr1q21_status treatment event time CCND1 CRIM1
## GSM50986 Cyclin D-1 3 copies TT2 0 69.24 9908.4 420.9
## GSM50988 Cyclin D-2 2 copies TT2 0 66.43 16698.8 52.0
## GSM50989 MMSET 2 copies TT2 0 66.50 294.5 617.9
## GSM50990 MMSET 3 copies TT2 1 42.67 241.9 11.9
## GSM50991 MAF <NA> TT2 0 65.00 472.6 38.8
## GSM50992 Hyperdiploid 2 copies TT2 0 65.20 664.1 16.9
## DEPDC1 IRF4 TP53 WHSC1
## GSM50986 523.5 16156.5 10.0 261.9
## GSM50988 21.1 16946.2 1056.9 363.8
## GSM50989 192.9 8903.9 1762.8 10042.9
## GSM50990 184.7 11894.7 946.8 4931.0
## GSM50991 212.0 7563.1 361.4 165.0
## GSM50992 341.6 16023.4 2096.3 569.2

First use the surv_cutpoint() function to find the best cutoff value:

res.cut <- surv_cutpoint(myeloma, time = "time", event = "event",
                         variables = c("CCND1", "CRIM1", "DEPDC1") # Find the best cut-off point of these three variables
                         )

summary(res.cut)
## cutpoint statistic
##CCND1 450.7 1.976398
## CRIM1 82.3 1.968317
## DEPDC1 279.8 4.275452

View the data distribution after grouping according to the best cut point:

# 2. Plot cutpoint for DEPDC1
plot(res.cut, "DEPDC1", palette = "npg")
## $DEPDC1

ec9d143daf62cfd77544cfd29f2a8aea.png

Then use surv_categorize() to group the data according to the optimal cutoff value, so that the data becomes high expression/low expression groups according to the optimal cutoff value.

For DEPDC1, its optimal cutoff value is 279.8:

# 3. Categorize variables
res.cat <- surv_categorize(res.cut)
head(res.cat)
## time event CCND1 CRIM1 DEPDC1
## GSM50986 69.24 0 high high high
## GSM50988 66.43 0 high low low
## GSM50989 66.50 0 low high low
## GSM50990 42.67 1 low low low
## GSM50991 65.00 0 high low low
## GSM50992 65.20 0 high low high

Plot a survival curve based on the best cut point:

# 4. Fit survival curves and visualize
library("survival")
fit <- survfit(Surv(time, event) ~DEPDC1, data = res.cat)
ggsurvplot(fit, data = res.cat, pval = T)

3e5685874a450295d31b912aad131b61.png

cutoff

cutoff is very simple to use, mainly 6 functions, and the usage logic is consistent.

  • cox: Find the cutoff with the smallest P value in COX regression

  • linear: Find the cutoff with the smallest P value in linear regression

  • logit: Find the cutoff with the smallest P value in logistic regression

  • logrank: Find the cutoff with the smallest P value in the logrank test

  • cutit: Group continuous variables according to cutoff

  • roc: Find the cutoff with the largest area under the ROC curve, which can only be used for classification but not survival.

The first four functions are used to determine the optimal intercept point and support multiple intercept points.

library(cutoff)

Using the best intercept point found by this function and then doing cox regression, the P value obtained is the smallest.

# The parameters are very simple
cut_cox <- cutoff::cox(data = myeloma,
            time = "time",
            y = "event",
            x = "DEPDC1",
            cut.numb = 1, #Number of cut points
            n.per = 0.2, #The proportion of the sample size of each group to the total sample size after grouping
            y.per = 0.1, #Proportion of dependent variable after grouping
            round = 5 #retain how many decimal places
            )
##
## 70 rows were deleted because of missing value.
## 186 rows were analyzed
##
##
## 1: all combinations: 185
## 2: filter by n.per
## -
## =
## Combination: 111
##
## 3: filter by y.per
## Combination: 111
##
## 4: last combination: 58

library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cut_cox %>% arrange(pvalue)
## cut1 n n.per y y.per dump hr pvalue
## 1 279.8 108/78 0.58065/0.41935 21/30 0.19444/0.38462 b2 2.19404 0.00577
## 2 481.8 146/40 0.78495/0.21505 34/17 0.23288/0.42500 b2 2.25001 0.00665
## 3 273.9 107/79 0.57527/0.42473 21/30 0.19626/0.37975 b2 2.15075 0.00713
## 4 466.1 143/43 0.76882/0.23118 33/18 0.23077/0.41860 b2 2.14269 0.00939
## 5 273.3 106/80 0.56989/0.43011 21/30 0.19811/0.37500 b2 2.09152 0.00953
## 6 284.0 109/77 0.58602/0.41398 22/29 0.20183/0.37662 b2 2.06914 0.01015
##
## p.adjust
## 1 0.64062
## 2 0.73839
## 3 0.79196
## 4 1.04224
## 5 1.05836
## 6 1.12655

You can see that the best cutoff value found here is also 279.8. This value is used for grouping below: (Note that the pvalue obtained above is not the P value of COX regression)

group <- cutoff::cutit(myeloma$DEPDC1,279.8)

Do cox regression again:

summary(coxph(Surv(myeloma$time,myeloma$event)~factor(group)))
## Call:
## coxph(formula = Surv(myeloma$time, myeloma$event) ~ factor(group))
##
## n= 256, number of events= 70
##
## coef exp(coef) se(coef) z Pr(>|z|)
## factor(group)2 1.020 2.773 0.242 4.215 2.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## factor(group)2 2.773 0.3606 1.726 4.456
##
## Concordance= 0.638 (se = 0.03)
## Likelihood ratio test= 17.87 on 1 df, p=2e-05
## Wald test = 17.76 on 1 df, p=2e-05
## Score (logrank) test = 19.34 on 1 df, p=1e-05

The P value is 2.5e-05. You can try other cutoff values. The P values obtained are larger than this~

The P value obtained by cox regression based on this grouping is the smallest.

The usage of several other functions is exactly the same, so I won’t explain it in detail. This R package is powerful and easy to use. It is worth learning.

This public provides a variety of scientific research services!

1. Course training

Since 2022, we have convened a group of experienced professional teams from universities to hold short-term statistical course training classes, includingRlanguage, meta analysis, clinical prediction models, real-world clinical research, 10 courses including questionnaire and scale analysis, medical statistics and SPSS, clinical trial data analysis, repeated measurement data analysis, nhanes, Mendelian randomization, etc. If you have needs, you may wish to click to view:

Refund after publishing the article! In 2023, Mr. Zheng’s team will conduct multiple live courses on scientific research statistics. You are welcome to register.

2. Statistical Services

For team development, we will cooperate with all friends for win-win results. Our team will provide statistical analysis services and help with clinical research. Welcome to learn more:

Medical Statistics Service | One-to-one guidance on medical public database papers