Agregar / resumir múltiples variables por grupo (por ejemplo, suma, media)

154

A partir de una trama de datos, hay una manera fácil de agregar ( sum, mean, maxet c) múltiples variables simultáneamente?

A continuación se presentan algunos datos de muestra:

library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05)) 
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)

Me gustaría agregar simultáneamente las variables x1y x2del df2marco de datos por año y mes. El siguiente código agrega la x1variable, pero ¿también es posible agregar simultáneamente la x2variable?

### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)

Cualquier sugerencia sería muy apreciada.

— MikeTP
fuente

45

¿De dónde es esta year()función?

También puede usar el reshape2paquete para esta tarea:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
#  year month         x1           x2
1  2000     1  -80.83405 -224.9540159
2  2000     2 -223.76331 -288.2418017
3  2000     3 -188.83930 -481.5601913
4  2000     4 -197.47797 -473.7137420
5  2000     5 -259.07928 -372.4563522

— EDi
fuente

8

La recastfunción (también de reshape2) integra la función melty dcastde una vez para tareas como esta:recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

— Jaap

184

Sí, en su formula, puede cbindagregar las variables numéricas:

aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
   year month         x1          x2
1  2000     1   7.862002   -7.469298
2  2001     1 276.758209  474.384252
3  2000     2  13.122369 -128.122613
...
23 2000    12  63.436507  449.794454
24 2001    12 999.472226  922.726589

Ver ?aggregate, el formulaargumento y los ejemplos.

— Andrie
fuente

3

¿Es posible que cbind use variables dinámicas?

— pdb

14

Vale la pena señalar que cuando cualquiera de las variables que está en el cbind tiene un NA, la fila se eliminará para cada variable en el cbind. Este no es el comportamiento que esperaba.

— pdb

1

¿Qué pasa si en lugar de x1 y x2 quiero usar todas las variables restantes (que no sean año, mes)

— Clock Slave

77

@ClockSlave, entonces solo necesitas usarlo .en el LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). Sin sumembargo , en este ejemplo, "fecha" no tiene sentido ...

— A5C1D2H2I1M1N2O1R2T1

55

¿Qué pasa si no quiero dos variables sino dos funciones? Por ejemplo mean y sd.

— skan

51

Usar el data.tablepaquete, que es rápido (útil para conjuntos de datos más grandes)

https://github.com/Rdatatable/data.table/wiki

library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe

Usando el paquete plyr

require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))

Usando summaryize () del paquete Hmisc (los encabezados de columna son desordenados en mi ejemplo)

# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))

— número cruncher
fuente

¿Por qué no hacer esto para la opción data.table dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month):?

— Bulat

48

Con el dplyrpaquete, puede utilizar summarise_all, summarise_ato summarise_iffunciones para agregar múltiples variables simultáneamente. Para el conjunto de datos de ejemplo, puede hacer esto de la siguiente manera:

library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)

# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)

# summarising a specific set of non-grouping variables using select_helpers
# see ?select_helpers for more options
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(starts_with('x')), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(matches('.*[0-9]')), sum)

# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)

El resultado de las dos últimas opciones:

    year month        x1         x2
   <dbl> <dbl>     <dbl>      <dbl>
1   2000     1 -73.58134  -92.78595
2   2000     2 -57.81334 -152.36983
3   2000     3 122.68758  153.55243
4   2000     4 450.24980  285.56374
5   2000     5 678.37867  384.42888
6   2000     6 792.68696  530.28694
7   2000     7 908.58795  452.31222
8   2000     8 710.69928  719.35225
9   2000     9 725.06079  914.93687
10  2000    10 770.60304  863.39337
# ... with 14 more rows

Nota: summarise_eachestá en desuso a favor de summarise_all, summarise_aty summarise_if.

Como se mencionó en mi comentario anterior , también puede usar la recastfunción del reshape2paquete:

library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

lo que te dará el mismo resultado.

— Jaap
fuente

8

Curiosamente, aggregateel data.framemétodo de la base R no se muestra aquí, por encima de la interfaz de fórmula se utiliza, por lo que para completar:

aggregate(
  x = df1[c("x1", "x2")],
  by = df1[c("year", "month")],
  FUN = sum, na.rm = TRUE
)

Uso más genérico del método de data.frame del agregado:

Ya que estamos proporcionando un

data.framecomo xy
a list( data.frametambién es a list) como by, esto es muy útil si necesitamos usarlo de manera dinámica, por ejemplo, usar otras columnas para agregar y agregar es muy simple
también con funciones de agregación personalizadas

Por ejemplo así:

colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
  c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}

aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)

— Jozef
fuente

1

Con la develversión de dplyr(version - ‘0.8.99.9000’), también podemos usar summarisepara aplicar la función en varias columnas conacross

library(dplyr)
df1 %>% 
    group_by(year, month) %>%
    summarise(across(starts_with('x'), sum))
# A tibble: 24 x 4
# Groups:   year [2]
#    year month     x1     x2
#   <dbl> <dbl>  <dbl>  <dbl>
# 1  2000     1   11.7  52.9 
# 2  2000     2  -74.1 126.  
# 3  2000     3 -132.  149.  
# 4  2000     4 -130.    4.12
# 5  2000     5  -91.6 -55.9 
# 6  2000     6  179.   73.7 
# 7  2000     7   95.0 409.  
# 8  2000     8  255.  283.  
# 9  2000     9  489.  331.  
#10  2000    10  719.  305.  
# … with 14 more rows

— akrun
fuente

1

Para un enfoque más flexible y rápido para la agregación de datos, consulte la collapfunción en el paquete colapsar R disponible en CRAN:

library(collapse)
# Simple aggregation with one function
head(collap(df1, x1 + x2 ~ year + month, fmean))

  year month        x1        x2
1 2000     1 -1.217984  4.008534
2 2000     2 -1.117777 11.460301
3 2000     3  5.552706  8.621904
4 2000     4  4.238889 22.382953
5 2000     5  3.124566 39.982799
6 2000     6 -1.415203 48.252283

# Customized: Aggregate columns with different functions
head(collap(df1, x1 + x2 ~ year + month, 
      custom = list(fmean = c("x1", "x2"), fmedian = "x2")))

  year month  fmean.x1  fmean.x2 fmedian.x2
1 2000     1 -1.217984  4.008534   3.266968
2 2000     2 -1.117777 11.460301  11.563387
3 2000     3  5.552706  8.621904   8.506329
4 2000     4  4.238889 22.382953  20.796205
5 2000     5  3.124566 39.982799  39.919145
6 2000     6 -1.415203 48.252283  48.653926

# You can also apply multiple functions to all columns
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax)))

  year month  fmean.x1    fmin.x1  fmax.x1  fmean.x2   fmin.x2  fmax.x2
1 2000     1 -1.217984 -4.2460775 1.245649  4.008534 -1.720181 10.47825
2 2000     2 -1.117777 -5.0081858 3.330872 11.460301  9.111287 13.86184
3 2000     3  5.552706  0.1193369 9.464760  8.621904  6.807443 11.54485
4 2000     4  4.238889  0.8723805 8.627637 22.382953 11.515753 31.66365
5 2000     5  3.124566 -1.5985090 7.341478 39.982799 31.957653 46.13732
6 2000     6 -1.415203 -4.6072295 2.655084 48.252283 42.809211 52.31309

# When you do that, you can also return the data in a long format
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax), return = "long"))

  Function year month        x1        x2
1    fmean 2000     1 -1.217984  4.008534
2    fmean 2000     2 -1.117777 11.460301
3    fmean 2000     3  5.552706  8.621904
4    fmean 2000     4  4.238889 22.382953
5    fmean 2000     5  3.124566 39.982799
6    fmean 2000     6 -1.415203 48.252283

Nota : Puede usar funciones básicas como mean, maxetc. con collap, pero fmean, fmaxetc. son funciones agrupadas basadas en C ++ que se ofrecen en el paquete de contracción y que son significativamente más rápidas (es decir, el rendimiento en grandes agregaciones de datos es el mismo que data.table mientras proporciona una mayor flexibilidad, y estas funciones agrupadas rápidas también se pueden usar sin collap).

Nota 2 : collaptambién admite la agregación de datos multitipo flexible, lo que, por supuesto, puede hacer usando el customargumento, pero también puede aplicar funciones a columnas numéricas y no numéricas de forma semiautomatizada:

# wlddev is a data set of World Bank Indicators provided in the collapse package
head(wlddev)

      country iso3c       date year decade     region     income  OECD PCGDP LIFEEX GINI       ODA
1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA 32.292   NA 114440000
2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA 32.742   NA 233350000
3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA 33.185   NA 114880000
4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA 33.624   NA 236450000
5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE    NA 34.060   NA 302480000
6 Afghanistan   AFG 1966-01-01 1965   1960 South Asia Low income FALSE    NA 34.495   NA 370250000

# This aggregates the data, applying the mean to numeric and the statistical mode to categorical columns
head(collap(wlddev, ~ iso3c + decade, FUN = fmean, catFUN = fmode))

  country iso3c       date   year decade                     region      income  OECD    PCGDP   LIFEEX GINI      ODA
1   Aruba   ABW 1961-01-01 1962.5   1960 Latin America & Caribbean  High income FALSE       NA 66.58583   NA       NA
2   Aruba   ABW 1967-01-01 1970.0   1970 Latin America & Caribbean  High income FALSE       NA 69.14178   NA       NA
3   Aruba   ABW 1976-01-01 1980.0   1980 Latin America & Caribbean  High income FALSE       NA 72.17600   NA 33630000
4   Aruba   ABW 1987-01-01 1990.0   1990 Latin America & Caribbean  High income FALSE 23677.09 73.45356   NA 41563333
5   Aruba   ABW 1996-01-01 2000.0   2000 Latin America & Caribbean  High income FALSE 26766.93 73.85773   NA 19857000
6   Aruba   ABW 2007-01-01 2010.0   2010 Latin America & Caribbean  High income FALSE 25238.80 75.01078   NA       NA

# Note that by default (argument keep.col.order = TRUE) the column order is also preserved

— Sebastian
fuente

0

Llegó tarde a la fiesta, pero recientemente encontró otra forma de obtener las estadísticas resumidas.

library(psych) describe(data)

Salida: media, min, max, desviación estándar, n, error estándar, curtosis, asimetría, mediana y rango para cada variable.

— Britt
fuente

La pregunta es acerca de hacer agregaciones por grupo , pero describeno hace nada por grupo ...

— Gregor Thomas

describe.by(column, group = grouped_column)

— agrupará

44

Bueno, entonces pon eso en la respuesta! ¡No lo escondas en un comentario!

— Gregor Thomas