Aggregate / summarize multiple variables per group (e.g. sum, mean)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP

Aggregate / summarize multiple variables per group (e.g. sum, mean)



From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?


sum


mean


max



Below are some sample data:


library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)



I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?


x1


x2


df2


x1


x2


### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)



Any suggestions would be greatly appreciated.





@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10




4 Answers
4



Where is this year() function from?


year()



You could also use the reshape2 package for this task:


reshape2


require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522





The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17



recast


reshape2


melt


dcast


recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))



Yes, in your formula, you can cbind the numeric variables to be aggregated:


formula


cbind


aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589



See ?aggregate, the formula argument and the examples.


?aggregate


formula





Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29





It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19





what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22





@ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53


.


aggregate(. ~ year + month, df1, sum, na.rm = TRUE)


sum





What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15



Using the data.table package, which is fast (useful for larger datasets)


data.table



https://github.com/Rdatatable/data.table/wiki


library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe



Using the plyr package


require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))



Using summarize() from the Hmisc package
(column headings are messy in my example though)


# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))



With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:


dplyr


summarise_all


summarise_at


summarise_if


library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)

# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)

# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)



The result of the latter two options:


year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows



Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.


summarise_each


summarise_all


summarise_at


summarise_if



As mentioned in my comment above, you can also use the recast function from the reshape2-package:


recast


reshape2


library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))



which will give you the same result.




Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



Would you like to answer one of these unanswered questions instead?

Comments

Popular posts from this blog

Executable numpy error

PySpark count values by condition

Mass disable jenkins jobs