R language nonlinear regression and generalized linear models: Poisson, gamma, logistic regression, Beta regression analysis of motor vehicle accidents, mouse infections, clam data, supplement sodium intake data | Data sharing…

Full text link: https://tecdat.cn/?p=33781

We use Generalized Linear Models (GLM) to study non-normal data of customers and explore non-linear relationships(Click “Read the original text” at the end of the article to get the complete< strong>Code data).

Related videos

GLM is a flexible statistical model that works with a variety of data types and distributions, including non-normal distributions such as binomial, Poisson, and negative binomial distributions. Through GLM, we can model and predict non-normal data, and can handle count data, such as the number of customer purchases, website clicks, etc. GLM also allows the introduction of nonlinear effects of independent variables, thereby better fitting the complex relationship with the response variable. This makes GLM a powerful tool for dealing with non-normal data and non-linear relationships.

Poisson and Gamma Regression – Exploring the Connection

If we look at the train-vehicle collision data(See the end of the article to learn how to get the data for free), we find An interesting pattern.

library(readr)
...

train <- read_csv("s_agresti.csv")

train_plot <- ggplot(train,
   ......

7d98d42fe94d1e08641ea6ae19b97fc2.png

Just by looking at it, we can see that the variance changes with the predictor variables. Furthermore, we are dealing with count data, which has its own distribution, the Poisson distribution. However, what happens if we insist on using lm for analysis?

train_lm <-......odel(train_lm)

40d663a3f76a12fb9678ebcb61a92c61.png

There is a mismatch between predicted and observed values. Part of the reason is that the response variable here is not normally distributed in the residuals, but Poisson distributed since it is count data.

Poisson regression

Generalized linear models with Poisson errors usually have a logarithmic link, although they can also have an identity link. For example,

pois_tib <- tibble(x = rep(0:40,2),
            ...
  geom_col(position = position_dodge())

41ade250aae92c2cb06963fbe2c5aaac.png

The above shows two Poisson distributions, one with mean 5 and the other with mean 20. Notice how their variance changes.

The logarithmic link (e.g. y?=ea + bx?=eβ + αx) is a natural fitting method because it cannot get values less than 0. So in this case we can do this:

train_glm <- glm(collisions ~ km_trtravel,
           ......

We can then re-evaluate the model’s assumptions, including overseparation. Please note that the QQ plot below has no practical meaning because this is not a normal distribution.

check_model(train_glm)

a0cfdd3bec27a482e2df323f4aa7aa2f.png

So… what to do with the residuals? Given that the residuals are not normally distributed, using a qqnorm plot makes little sense. The fitted residual relationship may still look strange.

Click on the title to view previous issues

d8ffb3508ee0cce7128ed1dd95932cb1.png

R language uses linear models for ozone forecasting: weighted Poisson regression, ordinary least squares, weighted negative binomial model, multiple imputation of missing values

outside_default.png

Swipe left or right to see more

outside_default.png

01

2ef2924f00fee56b6dccd6427ad64db9.png

02

2c9153a327d8d17d5708cf632f6430f0.png

03

1605e1e1b78c1ad1630b0f5cfd509f32.png

04

1f1a76cbfc18150891cb29a423e719e7.png

Quantile residuals using generalized linear models

One way to evaluate a generalized linear model (and many other model forms) is to look at its quantile residuals. So first let’s generate some simulation residuals using DHARMa.

res <- simulateResiduals(train_glm)

We can plot these graphs and perform nonparametric fit tests.

plotQQunif(res)

87155300ed8993b57ba7e2c8cefdbdda.png

Very good, the fitting effect is good. Ignore the outlier test since on more detailed observation we find no outliers.

We can also view plots of predicted versus quantized residuals.

plotResiduals(res)

63fbefdd9ee4c45928c832915070ab33.png

check_overdispersion(train_glm) |> plot()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

e0e6af5fc999b27ebb12ab46f70a1102.png

View glm results

Let’s take a look at the model results.

summary(train_glm)

e2887b23479584caccfeba6dbe9ac3e0.png

Note that here we see standard glm output and we can interpret the coefficients just like any logarithmic transformation. We also have a discrete parameter that describes the relationship between mean and variance. For the Poisson distribution, its value is 1.

Finally, we can draw.

train_plot +
......"log")))

e9679ff511c17c6423fba75733274ef2.png

145455c877413dc9af234a55427b8a2d.png

Gamma Return

However, our data are usually not discrete. Consider clam data.

clams <- read_delim("ams.txt", delim = "\t") %>%
  mutate(MONTH = factor(MONTH))

What is the relationship between AFD (ashless dry mass) and month and length? Clearly there is a non-linear relationship here.

clam_plot <- ggplot(clams,
      ...

clam_plot

406c028db4351f9c3f63959d8ba57545.png

Now, it looks like we should fit a log-transformed model, but…

clam_lm <- lm(log(......

10c6945cde1f3cc3fc2da21a18d96cd5.png

There are clearly obvious problems. Even the qq plot after taking the logarithm of the AFD is not good, and the residual fitting plot is not good either. Gamma glm takes its inverse function as its canonical join, but they can often also use logarithmic joins.

clam_gamma <- glm(AFD ~ ......
                  data = clams)
check_model(clam_gamma)

d23fbf49a4aee867908d9ec4e864a0c9.png

besides

clam_res <- simulateR......res)

b9910d934d52e42732a8242ca2b44381.png

ploals(clam_res)

687ab43ea1f904340893e16682242e50.png

Okay, maybe not great. But this is mostly due to the sparsity of high values, so it doesn’t matter.

We can use predict for plotting, and plot each month here.

clam_plot + ......
  facet_wrap(~MONTH)

790375afb730b031a6262f85f3733e83.png

We can also view other properties.

ce96187c0bb2914b89dc7aff4b4388e8.png

summary(clam_gamma)

465e206ed407077a7d3ecfe4f897196a.png

We can reparameterize the gamma distribution so that mean = shape/rate. In this case, we parameterize the gamma distribution using this mean and shape. The discrete parameter is 1/shape.

However, to make it easier to understand, the variance of gamma expands proportionally to the square of the mean. The larger the discrete parameter, the faster the variance expands.

Finally, we can calculate R2 using the pseudo-R2 of the Nagierke meter.

# fit
r2(clam_gamma)

142c14e5530b09316ad4f29e960b456c.png

Is this normal?

You may ask why the gamma distribution is used here instead of the normal distribution? We can perform glm fitting with normal error and log link.

clam_glm_norm <- glm(AFD ......
                     data = clams)

One way to tell is to look for overdispersion.

norm_res <- simulateRe......orm_res)

outside_default.png

plotuals(norm_res)

outside_default.png

We can see that the QQ chart is very good. And predobs aren’t bad either (especially compared to the above). This is some good evidence that normal errors and a log link may be all that’s needed here.

Logistic regression

Let’s look at our example of mice infected with Cryptosporidium. Note that the data is restricted between 0 and 1.

mouse <- read_csv...... Portion)) +
  geom_point()

mouse_plot

outside_default.png

This is because although N is the total number of mice per sample, we cannot have more than N infections! In effect, each mouse is like a coin toss. whether it is infected.

Binomial distribution

The binomial distribution has two parameters, the probability of success and the number of coin tosses. The resulting distribution is always between 0 and 1. Consider the case of 15 coin tosses using different probabilities.

R
bin_tibble <- tibble(outcome = rep(0:15, 2),......
  geom_col(position = position_dodge())

outside_default.png

We can also adjust the range of the x-axis to 0 to 1 to represent proportions.

Or, consider the same probability, but a different number of coin tosses.

R
bin_tibble <- tibble(outcome = rep(0:15, 2),......
  geom_col(position = position_dodge())

outside_default.png

You can see that both parameters affect the shape of the distribution.

Binomial logistic regression

In binomial logistic regression, we primarily estimate the probability of getting a head. We then provide (rather than estimate) the number of trials in the form of weights. The typical link function used here is the logit function, since it describes a logistic function that saturates between 0 and 1.

In R, we can parameterize binomial logistic regression using two forms – these two forms are equivalent in that they extend the results into the number of successes and the total number of trials.

R
mouse_glm_cbind <- glm(cbind(Y,...
                 data = mouse)

The second way uses weights to represent the number of trials.

R
mouse_glm <- glm(Porport......
                 data = mouse)

Both models are identical.

From this point on, the workflow is the same as before – hypothesis testing, analysis, and visualization.

R
checl(mouse_glm)

outside_default.png

R
binduals(mouse_glm, ...

outside_default.png

R
res_bin <- sim......

outside_default.png

R
plotRes_bin)

outside_default.png

R
summary(moglm)

outside_default.png

R
r2(mouse_glm)

outside_default.png

Note that the discrete parameter is 1, just like the Poisson distribution.

R
ggplot(mouse,
     ...
              method.args = list(family = binomial))

outside_default.png

Beta Return

Finally, we often encounter restricted data, but these data are not drawn from a binomial distribution – that is, there are no independent “coin flips”.

Consider the following analysis of post-workout sodium intake ratios when taking different supplements. 2300 is the recommended intake, so we normalized it to this value.

R
sodium <- read_csv("laake.csv")
R
ggplot(sodium,
 ...
  geom_boxplot()

outside_default.png

Now, let us use beta regression to observe this result.

R

sodium_beta <- beta......
                       data = sodium)
R
soditmb <- glmmTMB(Port...
                       data=sodium)
chec......a_tmb)

outside_default.png

R
plotQQunif(sodium_beta_tmb)

outside_default.png

We can then proceed with all our usual testing and visualization. For example –

R
emmeans(sodium_b......
  confint(adjust = "none")

outside_default.png

If we have a continuous covariate, we can obtain the fitted values and errors and put them into the model.

outside_default.png

Data acquisition

Reply to “NonnormalDataData” in the background of the official account to obtain the complete data for free.

The data analyzed in this article is shared to the membership group. Scan the QR code below to join the group!

outside_default.png

outside_default.png

Click “Read original text” at the end of the article

Get the full text and complete code data.

This article is selected from “R Language Nonlinear Regression and Generalized Linear Models: Poisson Regression, Gamma Regression, Logistic Regression, and Beta Regression Analysis of Motor Vehicle Accidents, Mouse Infections, Clam Data, and Supplement Exercise Sodium Intake Data.”

outside_default.png

outside_default.png

Click on the title to view previous issues

Data sharing | R language logistic regression, linear discriminant analysis LDA, GAM, MARS, KNN, QDA, decision tree, random forest, SVM classification wine cross-validation ROC

R language Bayesian generalized linear mixed (multi-level/level/nested) model GLMM, logistic regression analysis of data on influencing factors of education grade repetition

Logistic regression Logistic model principle R language classification prediction of coronary heart disease risk example

Data sharing | Predicting abalone age and visualization using additive multiple linear regression, random forest, and elastic network models

Penalized regression methods for high-dimensional data in R language: principal component regression PCR, ridge regression, lasso, elastic network elastic net analysis of genetic data (including practice questions)

The minimum angle algorithm of LARS and Lasso regression in Python Lars analyzes Boston housing data example

Ridge regression and adaptive LASSO regression visualization in R language Bootstrap

R language Lasso regression model variable selection and diabetes development prediction model

R language implements Bayesian quantile regression, lasso and adaptive lasso Bayesian quantile regression analysis

Implementing LASSO regression analysis based on R language

R language uses LASSO, adaptive LASSO to predict inflation time series

R language adaptive LASSO polynomial regression, binary logistic regression and ridge regression application analysis

Classification model case of high-dimensional variable selection using R language penalized logistic regression (LASSO, ridge regression)

Lasso Regression Minimum Angle Algorithm LARS in Python

Implementation of LASSO regression, Ridge regression and Elastic Net model in r language

Implementation of LASSO regression, Ridge regression and Elastic Net models in r language

R language to implement LASSO regression – write your own LASSO regression algorithm

R uses LASSO regression to predict stock returns

Python uses LASSO regression to predict stock returns

outside_default.png

outside_default.png

outside_default.png

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 54,166 people are learning the system