pyspark OneHotEncoded vectors appear to be missing categories?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP

pyspark OneHotEncoded vectors appear to be missing categories?



Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?).


OneHotEncoder



After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem



Have dataset of the form


1. Wife's age (numerical)
2. Wife's education (categorical) 1=low, 2, 3, 4=high
3. Husband's education (categorical) 1=low, 2, 3, 4=high
4. Number of children ever born (numerical)
5. Wife's religion (binary) 0=Non-Islam, 1=Islam
6. Wife's now working? (binary) 0=Yes, 1=No
7. Husband's occupation (categorical) 1, 2, 3, 4
8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high
9. Media exposure (binary) 0=Good, 1=Not good
10. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term



with the actual data looking like


wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1



sourced from here: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.



After doing some other preprocessing on the data, then trying to encode the categorical and binary (just for the sake of practice) features to 1hot vectors via...


for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
print encoder.k
dataset = encoder.transform(dataset)



produces a row that looks like


Row(
....,
numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]),
wife_edu_1hot=SparseVector(4, 2: 1.0),
husband_edu_1hot=SparseVector(4, 3: 1.0),
husband_occupation_1hot=SparseVector(4, 2: 1.0),
SoL_index_1hot=SparseVector(4, 3: 1.0),
wife_religion_1hot=SparseVector(1, 0: 1.0),
wife_working_1hot=SparseVector(1, 0: 1.0),
media_exposure_1hot=SparseVector(1, 0: 1.0),
contraceptive_1hot=SparseVector(2, 0: 1.0)
)



My understanding of sparse vector format is that SparseVector(S, i1: v1, i2: v2, ..., in: vn) implies a vector of length S where all values are 0 expect for indices i1,...,in which have corresponding values v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).


SparseVector(S, i1: v1, i2: v2, ..., in: vn)



Based on this, it seems like the SparseVector in this case actually denotes the highest index in the vector (not the size). Furthermore, combining all the features (via pyspark's VectorAssembler) and checking the array version of the resulting dataset.head(n=1) vector shows


VectorAssembler


dataset.head(n=1)


input_features=SparseVector(23, 0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0)

indicates a vector looking like

indices: 0 1 2 3 4... 9 12 17 18 19 20 21
[-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]



I would think that it should be impossible to have a sequence of >= 3 consecutive 1s (as can be seen near the tail of the vector above), as this would indicate that one of the onehot vectors (eg. the middle 1) is only of size 1, which would not make sense for any of the data features.



Very new to machine learning stuff, so may be confused about some basic concepts here, but does anyone know what could be going on here?




1 Answer
1



Found this in the pyspark docs (https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder):



...with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].



More discussion about why this kind of last-category-dropping would be done can be found here (http://www.algosome.com/articles/dummy-variable-trap-regression.html) and here (https://stats.stackexchange.com/q/290526/167299).



I am pretty new to machine learning of any kind, but it seems that basically dropping the last categorical value is done to avoid something called the dummy variable trap where "the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others" (so basically you'd have a redundant feature (which I assume is not good for weighting a ML model)).


dummy variable trap



Eg. don't need a 1hot encoding of [isBoy, isGirl] when an encoding of [isBoy] would communicate the same information about someone's gender.


[isBoy, isGirl]


[isBoy]



This link (http://www.algosome.com/articles/dummy-variable-trap-regression.html) provides a good example, with the conclusion being



The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.



** Note: In looking for an answer to the original question, found this similar SO post (Why does Spark's OneHotEncoder drop the last category by default?). Yet, I think that this current post warrants existing since the mentioned post is about why this behavior happens while this post is about being confused as to what was going on in the first place as well as the fact that this current question title does not find the mentioned post when pasting to google.






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

Executable numpy error

PySpark count values by condition

Trying to Print Gridster Items to PDF without overlapping contents