R: Error in contrasts when fitting linear models with `lm`

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP

R: Error in contrasts when fitting linear models with `lm`



I've found Error in contrasts when defining a linear model in R and have followed the suggestions there, but none of my factor variables take on only one value and I am still experiencing the same issue.



This is the dataset I'm using: https://www.dropbox.com/s/em7xphbeaxykgla/train.csv?dl=0.



This is the code I'm trying to run:


simplelm <- lm(log_SalePrice ~ ., data = train)

#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels



What is the issue?





What makes you think none of your factors only have one value? I don't want to download, import, and inspect your data set, but could you post the output of sapply(train[!sapply(train, is.numeric)], function(x) length(unique(x)))?
– Gregor
May 19 at 0:28


sapply(train[!sapply(train, is.numeric)], function(x) length(unique(x)))





Welcome to Stack Overflow! In the future please do not use cloud links to files as they are considered a security risk. The website policy is to instead us built-in data, public data sets, simulated data, etc.
– Hack-R
May 19 at 0:30





Glancing at your data, both the Utilities and the PoolQC columns look pretty 1-level (didn't scroll very much though...)
– Gregor
May 19 at 0:31





@Hack-R Got it.
– Display_name_placeholder
May 20 at 2:36




2 Answers
2



The error pretty much describes the problem. The bad data in question is in your 9th column (Utilities).


Utilities



The column in question has too little variation.


table(train$Utilities)


AllPub NoSeWa
1459 1


log_SalePrice <- train$log_SalePrice

train[,9] <- NULL
simplelm <- lm(log_SalePrice ~ ., data = train)



Thanks for providing your dataset (I hope that link will forever be valid so that everyone can access). I read it into a data frame train.


train



Using the debug_contr_error, debug_contr_error2 and NA_preproc helper functions provided by How to debug "contrasts can be applied only to factors with 2 or more levels" error?, we can easily analyze the problem.


debug_contr_error


debug_contr_error2


NA_preproc


info <- debug_contr_error2(log_SalePrice ~ ., train)

## the data frame that is actually used by `lm`
dat <- info$mf

## number of cases in your dataset
nrow(train)
#[1] 1460

## number of complete cases used by `lm`
nrow(dat)
#[1] 1112

## number of levels for all factor variables in `dat`
info$nlevels
# MSZoning Street Alley LotShape LandContour
# 4 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 5
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55



As you can see, Utilities is the offending variable here as it has only 1 level.


Utilities



Since you have many character / factor variables in train, I wonder whether you have missing values NA for them. If we add NA as valid levels, we could possibly get more complete cases.


train


NA


NA


new_train <- NA_preproc(train)

new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)

new_dat <- new_info$mf

nrow(new_dat)
#[1] 1121

new_info$nlevels
# MSZoning Street Alley LotShape LandContour
# 5 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 6
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55



We do get more complete cases, but Utilities still has one level. This means that most incomplete cases are actually caused by NA in your numerical variables, which we can do nothing (unless you have a statistical valid way to impute those missing values).


Utilities


NA



As you only have one single-level factor variable, the same method as given in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? will work.


new_dat$Utilities <- 1

simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)



The model now runs successfully. However, it is rank-deficient. You probably want to do something to address it, but leaving it as it is is fine.


b <- coef(simplelm)

length(b)
#[1] 301

sum(is.na(b))
#[1] 9

simplelm$rank
#[1] 292






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

Executable numpy error

Trying to Print Gridster Items to PDF without overlapping contents

Hystrix command on request collapser fallback