Searching for hashtags using Apache Pig

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP

Searching for hashtags using Apache Pig



I am trying to determine the top 10 hashtags in a text file containing tweets in the following format:


USER_79321756 2010-03-05T04:48:05 ÜT: 47.528139,-122.197916 47.528139 -122.197916 Just talkin too for real. Ha.
USER_79321756 2010-03-05T20:25:56 ÜT: 47.528139,-122.197916 47.528139 -122.197916 RT @USER_620cd4b9: @USER_79321756 hey now! Leave me, and my big eyes alone LOL>>lol NO! :*
USER_4659ef22 2010-03-06T05:50:54 ÜT: 40.816206,-73.894429 40.816206 -73.894429 But where's @USER_55e0f4ff?? Hmmm shawty where u at?
USER_064b120e 2010-03-03T18:56:49 ÜT: 34.223957,-118.600448 34.223957 -118.600448 @USER_4a4d09c2 the ludacris one . have you heard it , he got off on that one .



I came up with the following snippet to do so.



CODE:


a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by STARTSWITH(tokens,'#');
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g;



This gives the result shown below.



RESULT:


(#ff, 55)
(#inhighschool, 25)
...
...
...
...
...
...
(#random, 9)
(#mewithoutyouislike, 7)



I have included an image of the output as well.



Output showing top 10 hashtags



However, if I open the text file (full_text_small.txt) containing the tweets in a word editor and search for the hashtag "#ff' (case-insensitive), I get a total count of 61 and not 55. Similarly, the counts for all the other hashtags in the output are different from those obtained using Pig.



Further, when I use a different matching technique, namely the one shown below, I get a slightly different result.



CODE:


a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by tokens MATCHES '#\s*(\w+)';
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g;



RESULT:


(#ff, 55)
(#inhighschool, 25)
...
...
...
...
...
...
(#random, 9)
(#realgrandmas, 7)



Image of the output of the second code snippet:



Second output



All the hashtags in the outputs for the two code snippets are the same except for the last ones.



My questions are the following:









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

Executable numpy error

PySpark count values by condition

Trying to Print Gridster Items to PDF without overlapping contents