rdd.countApprox taking as long as count()

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP

rdd.countApprox taking as long as count()



My code looks like


foo.rdd.countApprox(1000, 0.9) => takes 7.1 minutes
foo.count() => takes 7.1 minutes



Is there anything I am missing? foo is a df and I am trying to reduce the time it takes to count() the number of records in foo


foo


count()


foo



As you can see below it some cases it takes more time and in some other cases, it takes lesser time. I am confused :(



enter image description here



enter image description here





Seems like you should be comparing the first to foo.rdd.count()
– pault
8 mins ago


foo.rdd.count()









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

Executable numpy error

PySpark count values by condition

Mass disable jenkins jobs