PySpark count values by condition

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP

PySpark count values by condition



I have a DataFrame, a snippet here:


[['u1', 1], ['u2', 0]]



basically one string ('f') and either a 1 or a 0 for second element ('is_fav').



What I need to do is grouping on the first field and counting the occurrences of 1s and 0s. I was hoping to do something like


num_fav = count((col("is_fav") == 1)).alias("num_fav")

num_nonfav = count((col("is_fav") == 0)).alias("num_nonfav")

df.groupBy("f").agg(num_fav, num_nonfav)



It does not work properly, I get in both cases the same result which amounts to the count for the items in the group, so the filter (whether it is a 1 or a 0) seems to be ignored. Does this depend on how count works?


count




1 Answer
1



There is no filter here. Both col("is_fav") == 1 and col("is_fav") == 0) are just boolean expressions and count doesn't really care about their value as long as it is defined.


col("is_fav") == 1


col("is_fav") == 0)


count



There are many ways you can solve this for example by using simple sum:


sum


from pyspark.sql.functions import sum, abs

gpd = df.groupBy("f")
gpd.agg(
sum("is_fav").alias("fv"),
(count("is_fav") - sum("is_fav")).alias("nfv")
)



or making ignored values undefined (a.k.a NULL):


NULL


exprs = [
count(when(col("is_fav") == x, True)).alias(c)
for (x, c) in [(1, "fv"), (0, "nfv")]
]
gpd.agg(*exprs)





I get a TypeError: unsupported operand type(s) for +: 'int' and 'str' on the first solution, in sum. I'm sure the column 'is_fav' contains an IntegerType so I don't understand?
– mar tin
Mar 18 '16 at 8:20





Missing imports or some conflicts with sum function? I mean you use built-in sum not functions.sum.
– zero323
Mar 18 '16 at 8:27



sum


sum


functions.sum





Yes that was it, missing import. I guess it was confusing with the built-in Python "sum".
– mar tin
Mar 18 '16 at 11:22






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

Executable numpy error

Trying to Print Gridster Items to PDF without overlapping contents