R language nonlinear regression and generalized linear models: Poisson, gamma, logistic regression, Beta regression analysis of motor vehicle accidents, mouse infections, clam data, supplement sodium intake data

Full text link: https://tecdat.cn/?p=33781

We use Generalized Linear Models (GLM) to study non-normal data of customers and explore non-linear relationships(Click “Read the original text” at the end of the article to get the complete< strong>Code data).

Related videos

GLM is a flexible statistical model that works with a variety of data types and distributions, including non-normal distributions such as binomial, Poisson, and negative binomial distributions. Through GLM, we can model and predict non-normal data, and can handle count data, such as the number of customer purchases, website clicks, etc. GLM also allows the introduction of nonlinear effects of independent variables, thereby better fitting the complex relationship with the response variable. This makes GLM a powerful tool for dealing with non-normal data and non-linear relationships.

Poisson and Gamma Regression – Exploring the Connection

If we look at the train-vehicle collision data(See the end of the article to learn how to get the data for free), we find An interesting pattern.

library(readr)
...

train <- read_csv("s_agresti.csv")

train_plot <- ggplot(train,
   ......

Just by looking at it, we can see that the variance changes with the predictor variables. Furthermore, we are dealing with count data, which has its own distribution, the Poisson distribution. However, what happens if we insist on using lm for analysis?

train_lm <-......odel(train_lm)

There is a mismatch between predicted and observed values. Part of the reason is that the response variable here is not normally distributed in the residuals, but Poisson distributed since it is count data.

Poisson regression

Generalized linear models with Poisson errors usually have a logarithmic link, although they can also have an identity link. For example,

pois_tib <- tibble(x = rep(0:40,2),
            ...
  geom_col(position = position_dodge())

The above shows two Poisson distributions, one with mean 5 and the other with mean 20. Notice how their variance changes.

The logarithmic link (e.g. y?=ea + bx?=eβ + αx) is a natural fitting method because it cannot get values less than 0. So in this case we can do this:

train_glm <- glm(collisions ~ km_trtravel,
           ......

We can then re-evaluate the model’s assumptions, including overseparation. Please note that the QQ plot below has no practical meaning because this is not a normal distribution.

check_model(train_glm)

So… what to do with the residuals? Given that the residuals are not normally distributed, using a qqnorm plot makes little sense. The fitted residual relationship may still look strange.

Click on the title to view previous issues

R language uses linear models for ozone forecasting: weighted Poisson regression, ordinary least squares, weighted negative binomial model, multiple imputation of missing values

Swipe left or right to see more

Quantile residuals using generalized linear models

One way to evaluate a generalized linear model (and many other model forms) is to look at its quantile residuals. So first let’s generate some simulation residuals using DHARMa.

res <- simulateResiduals(train_glm)

We can plot these graphs and perform nonparametric fit tests.

plotQQunif(res)

Very good, the fitting effect is good. Ignore the outlier test since on more detailed observation we find no outliers.

We can also view plots of predicted versus quantized residuals.

plotResiduals(res)

check_overdispersion(train_glm) |> plot()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

View glm results

Let’s take a look at the model results.

summary(train_glm)

Note that here we see standard glm output and we can interpret the coefficients just like any logarithmic transformation. We also have a discrete parameter that describes the relationship between mean and variance. For the Poisson distribution, its value is 1.

Finally, we can draw.

train_plot +
......"log")))

Gamma Return

However, our data are usually not discrete. Consider clam data.

clams <- read_delim("ams.txt", delim = "\t") %>%
  mutate(MONTH = factor(MONTH))

What is the relationship between AFD (ashless dry mass) and month and length? Clearly there is a non-linear relationship here.

clam_plot <- ggplot(clams,
      ...

clam_plot

Now, it looks like we should fit a log-transformed model, but…

clam_lm <- lm(log(......

There are clearly obvious problems. Even the qq plot after taking the logarithm of the AFD is not good, and the residual fitting plot is not good either. Gamma glm takes its inverse function as its canonical join, but they can often also use logarithmic joins.

clam_gamma <- glm(AFD ~ ......
                  data = clams)

check_model(clam_gamma)

besides

clam_res <- simulateR......res)

ploals(clam_res)

Okay, maybe not great. But this is mostly due to the sparsity of high values, so it doesn’t matter.

We can use predict for plotting, and plot each month here.

clam_plot + ......
  facet_wrap(~MONTH)

We can also view other properties.

summary(clam_gamma)

We can reparameterize the gamma distribution so that mean = shape/rate. In this case, we parameterize the gamma distribution using this mean and shape. The discrete parameter is 1/shape.

However, to make it easier to understand, the variance of gamma expands proportionally to the square of the mean. The larger the discrete parameter, the faster the variance expands.

Finally, we can calculate R2 using the pseudo-R2 of the Nagierke meter.

# fit
r2(clam_gamma)

Is this normal?

You may ask why the gamma distribution is used here instead of the normal distribution? We can perform glm fitting with normal error and log link.

clam_glm_norm <- glm(AFD ......
                     data = clams)

One way to tell is to look for overdispersion.

norm_res <- simulateRe......orm_res)

plotuals(norm_res)

We can see that the QQ chart is very good. And predobs aren’t bad either (especially compared to the above). This is some good evidence that normal errors and a log link may be all that’s needed here.

Logistic regression

Let’s look at our example of mice infected with Cryptosporidium. Note that the data is restricted between 0 and 1.

mouse <- read_csv...... Portion)) +
  geom_point()

mouse_plot

This is because although N is the total number of mice per sample, we cannot have more than N infections! In effect, each mouse is like a coin toss. whether it is infected.

Binomial distribution

The binomial distribution has two parameters, the probability of success and the number of coin tosses. The resulting distribution is always between 0 and 1. Consider the case of 15 coin tosses using different probabilities.

R
bin_tibble <- tibble(outcome = rep(0:15, 2),......
  geom_col(position = position_dodge())

We can also adjust the range of the x-axis to 0 to 1 to represent proportions.

Or, consider the same probability, but a different number of coin tosses.

R
bin_tibble <- tibble(outcome = rep(0:15, 2),......
  geom_col(position = position_dodge())

You can see that both parameters affect the shape of the distribution.

Binomial logistic regression

In binomial logistic regression, we primarily estimate the probability of getting a head. We then provide (rather than estimate) the number of trials in the form of weights. The typical link function used here is the logit function, since it describes a logistic function that saturates between 0 and 1.

In R, we can parameterize binomial logistic regression using two forms – these two forms are equivalent in that they extend the results into the number of successes and the total number of trials.

R
mouse_glm_cbind <- glm(cbind(Y,...
                 data = mouse)

The second way uses weights to represent the number of trials.

R
mouse_glm <- glm(Porport......
                 data = mouse)

Both models are identical.

From this point on, the workflow is the same as before – hypothesis testing, analysis, and visualization.

R
checl(mouse_glm)

R
binduals(mouse_glm, ...

R
res_bin <- sim......

R
plotRes_bin)

R
summary(moglm)

R
r2(mouse_glm)

Note that the discrete parameter is 1, just like the Poisson distribution.

R
ggplot(mouse,
     ...
              method.args = list(family = binomial))

Beta Return

Finally, we often encounter restricted data, but these data are not drawn from a binomial distribution – that is, there are no independent “coin flips”.

Consider the following analysis of post-workout sodium intake ratios when taking different supplements. 2300 is the recommended intake, so we normalized it to this value.

R
sodium <- read_csv("laake.csv")

R
ggplot(sodium,
 ...
  geom_boxplot()

Now, let us use beta regression to observe this result.

R

sodium_beta <- beta......
                       data = sodium)

R
soditmb <- glmmTMB(Port...
                       data=sodium)
chec......a_tmb)

R
plotQQunif(sodium_beta_tmb)

We can then proceed with all our usual testing and visualization. For example –

R
emmeans(sodium_b......
  confint(adjust = "none")

If we have a continuous covariate, we can obtain the fitted values and errors and put them into the model.

Data acquisition

Reply to “NonnormalDataData” in the background of the official account to obtain the complete data for free.

The data analyzed in this article is shared to the membership group. Scan the QR code below to join the group!

Click “Read original text” at the end of the article

Get the full text and complete code data.

This article is selected from “R Language Nonlinear Regression and Generalized Linear Models: Poisson Regression, Gamma Regression, Logistic Regression, and Beta Regression Analysis of Motor Vehicle Accidents, Mouse Infections, Clam Data, and Supplement Exercise Sodium Intake Data.”

Click on the title to view previous issues

Data sharing | R language logistic regression, linear discriminant analysis LDA, GAM, MARS, KNN, QDA, decision tree, random forest, SVM classification wine cross-validation ROC

R language Bayesian generalized linear mixed (multi-level/level/nested) model GLMM, logistic regression analysis of data on influencing factors of education grade repetition

Logistic regression Logistic model principle R language classification prediction of coronary heart disease risk example

Data sharing | Predicting abalone age and visualization using additive multiple linear regression, random forest, and elastic network models

Penalized regression methods for high-dimensional data in R language: principal component regression PCR, ridge regression, lasso, elastic network elastic net analysis of genetic data (including practice questions)

The minimum angle algorithm of LARS and Lasso regression in Python Lars analyzes Boston housing data example

Ridge regression and adaptive LASSO regression visualization in R language Bootstrap

R language Lasso regression model variable selection and diabetes development prediction model

R language implements Bayesian quantile regression, lasso and adaptive lasso Bayesian quantile regression analysis

Implementing LASSO regression analysis based on R language

R language uses LASSO, adaptive LASSO to predict inflation time series

R language adaptive LASSO polynomial regression, binary logistic regression and ridge regression application analysis

Classification model case of high-dimensional variable selection using R language penalized logistic regression (LASSO, ridge regression)

Lasso Regression Minimum Angle Algorithm LARS in Python

Implementation of LASSO regression, Ridge regression and Elastic Net model in r language

Implementation of LASSO regression, Ridge regression and Elastic Net models in r language

R language to implement LASSO regression – write your own LASSO regression algorithm

R uses LASSO regression to predict stock returns

Python uses LASSO regression to predict stock returns

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 54,166 people are learning the system