R: Error in contrasts when fitting linear models with `lm`


R: Error in contrasts when fitting linear models with `lm`
I've found Error in contrasts when defining a linear model in R and have followed the suggestions there, but none of my factor variables take on only one value and I am still experiencing the same issue.
This is the dataset I'm using: https://www.dropbox.com/s/em7xphbeaxykgla/train.csv?dl=0.
This is the code I'm trying to run:
simplelm <- lm(log_SalePrice ~ ., data = train)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
What is the issue?
sapply(train[!sapply(train, is.numeric)], function(x) length(unique(x)))
Welcome to Stack Overflow! In the future please do not use cloud links to files as they are considered a security risk. The website policy is to instead us built-in data, public data sets, simulated data, etc.
– Hack-R
May 19 at 0:30
Glancing at your data, both the Utilities and the PoolQC columns look pretty 1-level (didn't scroll very much though...)
– Gregor
May 19 at 0:31
@Hack-R Got it.
– Display_name_placeholder
May 20 at 2:36
2 Answers
2
The error pretty much describes the problem. The bad data in question is in your 9th column (Utilities
).
Utilities
The column in question has too little variation.
table(train$Utilities)
AllPub NoSeWa
1459 1
log_SalePrice <- train$log_SalePrice
train[,9] <- NULL
simplelm <- lm(log_SalePrice ~ ., data = train)
Thanks for providing your dataset (I hope that link will forever be valid so that everyone can access). I read it into a data frame train
.
train
Using the debug_contr_error
, debug_contr_error2
and NA_preproc
helper functions provided by How to debug "contrasts can be applied only to factors with 2 or more levels" error?, we can easily analyze the problem.
debug_contr_error
debug_contr_error2
NA_preproc
info <- debug_contr_error2(log_SalePrice ~ ., train)
## the data frame that is actually used by `lm`
dat <- info$mf
## number of cases in your dataset
nrow(train)
#[1] 1460
## number of complete cases used by `lm`
nrow(dat)
#[1] 1112
## number of levels for all factor variables in `dat`
info$nlevels
# MSZoning Street Alley LotShape LandContour
# 4 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 5
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
As you can see, Utilities
is the offending variable here as it has only 1 level.
Utilities
Since you have many character / factor variables in train
, I wonder whether you have missing values NA
for them. If we add NA
as valid levels, we could possibly get more complete cases.
train
NA
NA
new_train <- NA_preproc(train)
new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)
new_dat <- new_info$mf
nrow(new_dat)
#[1] 1121
new_info$nlevels
# MSZoning Street Alley LotShape LandContour
# 5 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 6
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
We do get more complete cases, but Utilities
still has one level. This means that most incomplete cases are actually caused by NA
in your numerical variables, which we can do nothing (unless you have a statistical valid way to impute those missing values).
Utilities
NA
As you only have one single-level factor variable, the same method as given in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? will work.
new_dat$Utilities <- 1
simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)
The model now runs successfully. However, it is rank-deficient. You probably want to do something to address it, but leaving it as it is is fine.
b <- coef(simplelm)
length(b)
#[1] 301
sum(is.na(b))
#[1] 9
simplelm$rank
#[1] 292
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What makes you think none of your factors only have one value? I don't want to download, import, and inspect your data set, but could you post the output of
sapply(train[!sapply(train, is.numeric)], function(x) length(unique(x)))
?– Gregor
May 19 at 0:28