GLMM for count data using square root link in lme4
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
I have data from a field survey. The objective of the study is to relate number of seedling (respond variable, count data), landform (exploratory variable, categorical variable with 3 levels) and percent canopy coverage (exploratory variable, quantitative).
In each habitat, I have data from five 25x25 meter plots. Within each plot I used three 2x2 meter subplots nested within the bigger plot, and number of seedlings were count from these subplots. Total number of observations is 60; 20 plots x 3 subplots.
Only one kind of landform contained seedlings. Other two landforms contained no seedlings, so they are empty plots. the data looks like this:
data.frame': 60 obs. of 5 variables:
$ plot : Factor w/ 20 levels "HD2","LC1","LC2",..: 16 16 16 17 17 17 12 12 12 9 ...
$ subplot : Factor w/ 60 levels "HD2.1","HD2.2",..: 46 47 48 49 50 51 34 35 36 25 ...
$ av.canopy : num 92.2 92.2 92.2 92.3 92.3 ...
$ landform : Factor w/ 3 levels "abandoned","ridge",..: 2 2 2 2 2 2 2 2 2 2 ...
$ seedling.2016: int 6 7 5 2 5 4 4 6 4 0 ...
The problem is when I compared number of seedlings between landforms using this code:
seedling <- glmer(seedling.2016 ~ landform +(1|plot), family = poisson)
The result does not make sense for me-there were no any significant different between landforms event there is only one landform (ridge) that has seedlings, while other had no seedlings at al. It is also note that SEs are enormous.
The result looks like this:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: poisson ( log )
Formula: seedling.2016 ~ landform + (1 | plot)
Data: streblus.subplots
AIC BIC logLik deviance df.resid
294.9 303.3 -143.5 286.9 56
Scaled residuals:
Min 1Q Median 3Q Max
-6.3619 -0.0001 -0.0001 0.0000 8.7305
Random effects:
Groups Name Variance Std.Dev.
plot (Intercept) 2.637 1.624
Number of obs: 60, groups: plot, 20
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -20.412 1461.267 -0.014 0.989
landformridge 22.250 1461.265 0.015 0.988
landformtemp 1.066 390.540 0.003 0.998
When I changed link function to square root as this code:
Seedling2 <- glmer(seedling.2016 ~ landform + (1|plot), family = poisson(link = sqrt))
Fixed effects:
#Estimate Std. Error z value Pr(>|z|)
#(Intercept) -1.220e-05 5.296e-01 0.000 1
#landformridge 3.250e+00 7.429e-01 4.376 1.21e-05 ***
# landformtemp 1.018e-05 7.795e-01 0.000 1
Now number of seedlings in ridge is significantly higher than the other, and it makes more sense to me.
My question is: Is it valid in term of statistics to use square root link with Poisson distribution, there are any better methods available with better ground of statistics?
lme4-nlme count-data glmm separation link-function
add a comment |Â
up vote
3
down vote
favorite
I have data from a field survey. The objective of the study is to relate number of seedling (respond variable, count data), landform (exploratory variable, categorical variable with 3 levels) and percent canopy coverage (exploratory variable, quantitative).
In each habitat, I have data from five 25x25 meter plots. Within each plot I used three 2x2 meter subplots nested within the bigger plot, and number of seedlings were count from these subplots. Total number of observations is 60; 20 plots x 3 subplots.
Only one kind of landform contained seedlings. Other two landforms contained no seedlings, so they are empty plots. the data looks like this:
data.frame': 60 obs. of 5 variables:
$ plot : Factor w/ 20 levels "HD2","LC1","LC2",..: 16 16 16 17 17 17 12 12 12 9 ...
$ subplot : Factor w/ 60 levels "HD2.1","HD2.2",..: 46 47 48 49 50 51 34 35 36 25 ...
$ av.canopy : num 92.2 92.2 92.2 92.3 92.3 ...
$ landform : Factor w/ 3 levels "abandoned","ridge",..: 2 2 2 2 2 2 2 2 2 2 ...
$ seedling.2016: int 6 7 5 2 5 4 4 6 4 0 ...
The problem is when I compared number of seedlings between landforms using this code:
seedling <- glmer(seedling.2016 ~ landform +(1|plot), family = poisson)
The result does not make sense for me-there were no any significant different between landforms event there is only one landform (ridge) that has seedlings, while other had no seedlings at al. It is also note that SEs are enormous.
The result looks like this:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: poisson ( log )
Formula: seedling.2016 ~ landform + (1 | plot)
Data: streblus.subplots
AIC BIC logLik deviance df.resid
294.9 303.3 -143.5 286.9 56
Scaled residuals:
Min 1Q Median 3Q Max
-6.3619 -0.0001 -0.0001 0.0000 8.7305
Random effects:
Groups Name Variance Std.Dev.
plot (Intercept) 2.637 1.624
Number of obs: 60, groups: plot, 20
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -20.412 1461.267 -0.014 0.989
landformridge 22.250 1461.265 0.015 0.988
landformtemp 1.066 390.540 0.003 0.998
When I changed link function to square root as this code:
Seedling2 <- glmer(seedling.2016 ~ landform + (1|plot), family = poisson(link = sqrt))
Fixed effects:
#Estimate Std. Error z value Pr(>|z|)
#(Intercept) -1.220e-05 5.296e-01 0.000 1
#landformridge 3.250e+00 7.429e-01 4.376 1.21e-05 ***
# landformtemp 1.018e-05 7.795e-01 0.000 1
Now number of seedlings in ridge is significantly higher than the other, and it makes more sense to me.
My question is: Is it valid in term of statistics to use square root link with Poisson distribution, there are any better methods available with better ground of statistics?
lme4-nlme count-data glmm separation link-function
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have data from a field survey. The objective of the study is to relate number of seedling (respond variable, count data), landform (exploratory variable, categorical variable with 3 levels) and percent canopy coverage (exploratory variable, quantitative).
In each habitat, I have data from five 25x25 meter plots. Within each plot I used three 2x2 meter subplots nested within the bigger plot, and number of seedlings were count from these subplots. Total number of observations is 60; 20 plots x 3 subplots.
Only one kind of landform contained seedlings. Other two landforms contained no seedlings, so they are empty plots. the data looks like this:
data.frame': 60 obs. of 5 variables:
$ plot : Factor w/ 20 levels "HD2","LC1","LC2",..: 16 16 16 17 17 17 12 12 12 9 ...
$ subplot : Factor w/ 60 levels "HD2.1","HD2.2",..: 46 47 48 49 50 51 34 35 36 25 ...
$ av.canopy : num 92.2 92.2 92.2 92.3 92.3 ...
$ landform : Factor w/ 3 levels "abandoned","ridge",..: 2 2 2 2 2 2 2 2 2 2 ...
$ seedling.2016: int 6 7 5 2 5 4 4 6 4 0 ...
The problem is when I compared number of seedlings between landforms using this code:
seedling <- glmer(seedling.2016 ~ landform +(1|plot), family = poisson)
The result does not make sense for me-there were no any significant different between landforms event there is only one landform (ridge) that has seedlings, while other had no seedlings at al. It is also note that SEs are enormous.
The result looks like this:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: poisson ( log )
Formula: seedling.2016 ~ landform + (1 | plot)
Data: streblus.subplots
AIC BIC logLik deviance df.resid
294.9 303.3 -143.5 286.9 56
Scaled residuals:
Min 1Q Median 3Q Max
-6.3619 -0.0001 -0.0001 0.0000 8.7305
Random effects:
Groups Name Variance Std.Dev.
plot (Intercept) 2.637 1.624
Number of obs: 60, groups: plot, 20
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -20.412 1461.267 -0.014 0.989
landformridge 22.250 1461.265 0.015 0.988
landformtemp 1.066 390.540 0.003 0.998
When I changed link function to square root as this code:
Seedling2 <- glmer(seedling.2016 ~ landform + (1|plot), family = poisson(link = sqrt))
Fixed effects:
#Estimate Std. Error z value Pr(>|z|)
#(Intercept) -1.220e-05 5.296e-01 0.000 1
#landformridge 3.250e+00 7.429e-01 4.376 1.21e-05 ***
# landformtemp 1.018e-05 7.795e-01 0.000 1
Now number of seedlings in ridge is significantly higher than the other, and it makes more sense to me.
My question is: Is it valid in term of statistics to use square root link with Poisson distribution, there are any better methods available with better ground of statistics?
lme4-nlme count-data glmm separation link-function
I have data from a field survey. The objective of the study is to relate number of seedling (respond variable, count data), landform (exploratory variable, categorical variable with 3 levels) and percent canopy coverage (exploratory variable, quantitative).
In each habitat, I have data from five 25x25 meter plots. Within each plot I used three 2x2 meter subplots nested within the bigger plot, and number of seedlings were count from these subplots. Total number of observations is 60; 20 plots x 3 subplots.
Only one kind of landform contained seedlings. Other two landforms contained no seedlings, so they are empty plots. the data looks like this:
data.frame': 60 obs. of 5 variables:
$ plot : Factor w/ 20 levels "HD2","LC1","LC2",..: 16 16 16 17 17 17 12 12 12 9 ...
$ subplot : Factor w/ 60 levels "HD2.1","HD2.2",..: 46 47 48 49 50 51 34 35 36 25 ...
$ av.canopy : num 92.2 92.2 92.2 92.3 92.3 ...
$ landform : Factor w/ 3 levels "abandoned","ridge",..: 2 2 2 2 2 2 2 2 2 2 ...
$ seedling.2016: int 6 7 5 2 5 4 4 6 4 0 ...
The problem is when I compared number of seedlings between landforms using this code:
seedling <- glmer(seedling.2016 ~ landform +(1|plot), family = poisson)
The result does not make sense for me-there were no any significant different between landforms event there is only one landform (ridge) that has seedlings, while other had no seedlings at al. It is also note that SEs are enormous.
The result looks like this:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: poisson ( log )
Formula: seedling.2016 ~ landform + (1 | plot)
Data: streblus.subplots
AIC BIC logLik deviance df.resid
294.9 303.3 -143.5 286.9 56
Scaled residuals:
Min 1Q Median 3Q Max
-6.3619 -0.0001 -0.0001 0.0000 8.7305
Random effects:
Groups Name Variance Std.Dev.
plot (Intercept) 2.637 1.624
Number of obs: 60, groups: plot, 20
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -20.412 1461.267 -0.014 0.989
landformridge 22.250 1461.265 0.015 0.988
landformtemp 1.066 390.540 0.003 0.998
When I changed link function to square root as this code:
Seedling2 <- glmer(seedling.2016 ~ landform + (1|plot), family = poisson(link = sqrt))
Fixed effects:
#Estimate Std. Error z value Pr(>|z|)
#(Intercept) -1.220e-05 5.296e-01 0.000 1
#landformridge 3.250e+00 7.429e-01 4.376 1.21e-05 ***
# landformtemp 1.018e-05 7.795e-01 0.000 1
Now number of seedlings in ridge is significantly higher than the other, and it makes more sense to me.
My question is: Is it valid in term of statistics to use square root link with Poisson distribution, there are any better methods available with better ground of statistics?
lme4-nlme count-data glmm separation link-function
lme4-nlme count-data glmm separation link-function
edited 37 mins ago
kjetil b halvorsen
26.7k977195
26.7k977195
asked 2 hours ago
Ponlawat
163
163
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
It looks very much like you have a case of complete separation:
there is only one landform (ridge) that has seedlings, while other had no seedlings at al
- large estimates ($|hat beta|>10$), and ridiculously large standard error estimates.
Basically what's happening is that the baseline level ("abandoned") has an expected number of counts equal to zero for all plots, so the intercept $beta_0$ - which is the expected log(counts) for the baseline level - should be estimated as $-infty$ ... which messes up the Wald estimation of the uncertainty (the approximate, fast method that summary()
uses).
You can read more about complete separation elsewhere; it is more typically discussed in the context of logistic regression (in part because logistic regression is more widely used than count regression ...)
Solutions:
- your square-root-link solution is reasonable (in this case the intercept is expected $sqrttextrmcounts$ in the baseline level, which is zero rather than $-infty$); it will change the assumed distribution of the random effects slightly (i.e., Normal on the square-root rather than on the log scale), but that wouldn't worry me too much. If you had continuous covariates or interactions in the model, it would also change the interpretation of the fixed effects.
- you could use some kind of penalization (most conveniently in a Bayesian framework), as described in my answer to the linked question (and here, search for "complete separation") to keep the parameters reasonable.
Ben, this does not affect your excellent answer, but should the final model (whatever it is) include a random effect for subplot as well?
â Isabella Ghement
39 mins ago
add a comment |Â
up vote
0
down vote
Indeed this seems to be a separation issue. To account for these cases, in my GLMMadaptive package you can include a penalty for the fixed-effects coefficients in the form of a Students-t density (i.e., for large enough df equivalent to ridge regression). For a worked example, have a look at the last section of this vignette.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
It looks very much like you have a case of complete separation:
there is only one landform (ridge) that has seedlings, while other had no seedlings at al
- large estimates ($|hat beta|>10$), and ridiculously large standard error estimates.
Basically what's happening is that the baseline level ("abandoned") has an expected number of counts equal to zero for all plots, so the intercept $beta_0$ - which is the expected log(counts) for the baseline level - should be estimated as $-infty$ ... which messes up the Wald estimation of the uncertainty (the approximate, fast method that summary()
uses).
You can read more about complete separation elsewhere; it is more typically discussed in the context of logistic regression (in part because logistic regression is more widely used than count regression ...)
Solutions:
- your square-root-link solution is reasonable (in this case the intercept is expected $sqrttextrmcounts$ in the baseline level, which is zero rather than $-infty$); it will change the assumed distribution of the random effects slightly (i.e., Normal on the square-root rather than on the log scale), but that wouldn't worry me too much. If you had continuous covariates or interactions in the model, it would also change the interpretation of the fixed effects.
- you could use some kind of penalization (most conveniently in a Bayesian framework), as described in my answer to the linked question (and here, search for "complete separation") to keep the parameters reasonable.
Ben, this does not affect your excellent answer, but should the final model (whatever it is) include a random effect for subplot as well?
â Isabella Ghement
39 mins ago
add a comment |Â
up vote
2
down vote
It looks very much like you have a case of complete separation:
there is only one landform (ridge) that has seedlings, while other had no seedlings at al
- large estimates ($|hat beta|>10$), and ridiculously large standard error estimates.
Basically what's happening is that the baseline level ("abandoned") has an expected number of counts equal to zero for all plots, so the intercept $beta_0$ - which is the expected log(counts) for the baseline level - should be estimated as $-infty$ ... which messes up the Wald estimation of the uncertainty (the approximate, fast method that summary()
uses).
You can read more about complete separation elsewhere; it is more typically discussed in the context of logistic regression (in part because logistic regression is more widely used than count regression ...)
Solutions:
- your square-root-link solution is reasonable (in this case the intercept is expected $sqrttextrmcounts$ in the baseline level, which is zero rather than $-infty$); it will change the assumed distribution of the random effects slightly (i.e., Normal on the square-root rather than on the log scale), but that wouldn't worry me too much. If you had continuous covariates or interactions in the model, it would also change the interpretation of the fixed effects.
- you could use some kind of penalization (most conveniently in a Bayesian framework), as described in my answer to the linked question (and here, search for "complete separation") to keep the parameters reasonable.
Ben, this does not affect your excellent answer, but should the final model (whatever it is) include a random effect for subplot as well?
â Isabella Ghement
39 mins ago
add a comment |Â
up vote
2
down vote
up vote
2
down vote
It looks very much like you have a case of complete separation:
there is only one landform (ridge) that has seedlings, while other had no seedlings at al
- large estimates ($|hat beta|>10$), and ridiculously large standard error estimates.
Basically what's happening is that the baseline level ("abandoned") has an expected number of counts equal to zero for all plots, so the intercept $beta_0$ - which is the expected log(counts) for the baseline level - should be estimated as $-infty$ ... which messes up the Wald estimation of the uncertainty (the approximate, fast method that summary()
uses).
You can read more about complete separation elsewhere; it is more typically discussed in the context of logistic regression (in part because logistic regression is more widely used than count regression ...)
Solutions:
- your square-root-link solution is reasonable (in this case the intercept is expected $sqrttextrmcounts$ in the baseline level, which is zero rather than $-infty$); it will change the assumed distribution of the random effects slightly (i.e., Normal on the square-root rather than on the log scale), but that wouldn't worry me too much. If you had continuous covariates or interactions in the model, it would also change the interpretation of the fixed effects.
- you could use some kind of penalization (most conveniently in a Bayesian framework), as described in my answer to the linked question (and here, search for "complete separation") to keep the parameters reasonable.
It looks very much like you have a case of complete separation:
there is only one landform (ridge) that has seedlings, while other had no seedlings at al
- large estimates ($|hat beta|>10$), and ridiculously large standard error estimates.
Basically what's happening is that the baseline level ("abandoned") has an expected number of counts equal to zero for all plots, so the intercept $beta_0$ - which is the expected log(counts) for the baseline level - should be estimated as $-infty$ ... which messes up the Wald estimation of the uncertainty (the approximate, fast method that summary()
uses).
You can read more about complete separation elsewhere; it is more typically discussed in the context of logistic regression (in part because logistic regression is more widely used than count regression ...)
Solutions:
- your square-root-link solution is reasonable (in this case the intercept is expected $sqrttextrmcounts$ in the baseline level, which is zero rather than $-infty$); it will change the assumed distribution of the random effects slightly (i.e., Normal on the square-root rather than on the log scale), but that wouldn't worry me too much. If you had continuous covariates or interactions in the model, it would also change the interpretation of the fixed effects.
- you could use some kind of penalization (most conveniently in a Bayesian framework), as described in my answer to the linked question (and here, search for "complete separation") to keep the parameters reasonable.
answered 1 hour ago
Ben Bolker
21.1k15684
21.1k15684
Ben, this does not affect your excellent answer, but should the final model (whatever it is) include a random effect for subplot as well?
â Isabella Ghement
39 mins ago
add a comment |Â
Ben, this does not affect your excellent answer, but should the final model (whatever it is) include a random effect for subplot as well?
â Isabella Ghement
39 mins ago
Ben, this does not affect your excellent answer, but should the final model (whatever it is) include a random effect for subplot as well?
â Isabella Ghement
39 mins ago
Ben, this does not affect your excellent answer, but should the final model (whatever it is) include a random effect for subplot as well?
â Isabella Ghement
39 mins ago
add a comment |Â
up vote
0
down vote
Indeed this seems to be a separation issue. To account for these cases, in my GLMMadaptive package you can include a penalty for the fixed-effects coefficients in the form of a Students-t density (i.e., for large enough df equivalent to ridge regression). For a worked example, have a look at the last section of this vignette.
add a comment |Â
up vote
0
down vote
Indeed this seems to be a separation issue. To account for these cases, in my GLMMadaptive package you can include a penalty for the fixed-effects coefficients in the form of a Students-t density (i.e., for large enough df equivalent to ridge regression). For a worked example, have a look at the last section of this vignette.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Indeed this seems to be a separation issue. To account for these cases, in my GLMMadaptive package you can include a penalty for the fixed-effects coefficients in the form of a Students-t density (i.e., for large enough df equivalent to ridge regression). For a worked example, have a look at the last section of this vignette.
Indeed this seems to be a separation issue. To account for these cases, in my GLMMadaptive package you can include a penalty for the fixed-effects coefficients in the form of a Students-t density (i.e., for large enough df equivalent to ridge regression). For a worked example, have a look at the last section of this vignette.
answered 23 mins ago
Dimitris Rizopoulos
2,630211
2,630211
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373887%2fglmm-for-count-data-using-square-root-link-in-lme4%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password