R SQL-ish aggregation

I came to R from SQL. I’ve heard that packages such as reshape and plyr are quite beautiful, but to me they are somewhat non-intuitive. I’m always looking for

SELECT mean(sodium) as mean, variance(sodium) as var 
    FROM texas GROUP BY weight;

I struggled with this a bit. First, it’s tempting to use tapply(), except that’s more of a one-column-in, one-column-out sort of thing. There’s by(), which is nice, and with a short wrapper can be OK. Well here’s what I did.

First, the pretty stuff. Because R, while it is generally considered a practical language, is, to me, more of a pretty language. A great way to make things pretty in R is to abuse dynamic binding.

Say I have a table like this:

> Table
  Alice Bob       Cathy Doug
1     1   1  0.08309254    a
2     2   2 -0.76015321    a
3     3   3  0.30887905    a
4     4   4  0.74900962    a
5     1   5  0.69447607    a
6     2   6  0.32485197    a

I’d like to call select() like this:

select(Table, Alice,
        a = mean(Cathy * Doug),
        b = sum(Cathy - Doug) / sum(Doug)
)

Where ‘Alice’ gives the column to group by, and a=, b= give the columns to create.

So this could look like

select = function(Table, Key, Add.Key=T, ...) {
        Columns = as.list(substitute(list(...)))[-1L]

        S.Key    = substitute(Key)
        Key.Str  = deparse(S.Key)
        Group.By = eval(S.Key, envir=Table)

        ...

}

First, the as.list(substitute(list(…)))[-1L] trick to get the unevaluated expressions for the columns. Then substitute(Key) gets the unevaluated key. deparse(S.Key) gives it a name (better would let the caller specify the name, but I haven’t wanted to do that yet so I didn’t implement it).

Group.By = eval(S.Key, envir=Table)

evaluates the grouping column, so now we have it as a vector, with (hopefully) the same number of rows as the table.

Now the next step turned out to suffer some unusual efficiency problems. The obvious approach is to use by() to split Table into many subtables, and evaluate the output columns on all of those. This works, but when the number of output rows is very large, this is extraordinarily slow.

An implementation using by() would look like this

select1  = function(Table, Key, Add.Key=T, ...) {
        Columns = as.list(substitute(list(...)))[-1L]

        S.Key = substitute(Key)
        Key.Str = deparse(S.Key)
        Group.By = eval(S.Key, envir=Table)
        Table[[Key.Str]] = Group.By

        Group.Gen = function(Group) {
                List.Out = sapply(Columns, function(Col) {
                        eval(Col, envir=Group)
                })

                if (Add.Key) {
                        List.Key = list()
                        List.Key[[Key.Str]] = unique(Group[[Key.Str]])

                        List.Out = c(List.Key, List.Out)
                }

                as.data.frame(List.Out)
        }

        Group.List = by(Table, Group.By, Group.Gen, simplify=F)
        names(Group.List) = c()

        Result  = do.call(rbind, Group.List)

        Result
}

Split it into tables with by().
Evaluate the columns in each subtable.
rbind() the outputs together.

The following implementation, which, instead of using by(), deals with vectors directly (not making a data.frame until the end), is about 10 times faster.

select2  = function(Table, Key, Add.Key=T, ...) {
        Columns = as.list(substitute(list(...)))[-1L]

        S.Key    = substitute(Key)
        Key.Str  = deparse(S.Key)
        Group.By = eval(S.Key, envir=Table)

        Indices  = tapply(1:nrow(Table), Group.By, c)

        Col.List = as.list(Table)

        Col.Out.List = lapply(Columns, function(Column) {
                sapply(Indices, function(Group) {
                        Subset = lapply(Col.List, function(Col) Col[Group])
                        eval(Column, envir=Subset)
                })
        })
        if (Add.Key) {
                Key.List = list();
                Key.List[[Key.Str]] = sapply(Indices, function(Group) {
                        unique(Group.By[Group])
                })

                Col.Out.List = c(Key.List, Col.Out.List)
        }

        Table.Out = do.call(data.frame, Col.Out.List)

        Table.Out
}

With a large table

Table = data.frame(
        Alice = rep(c(1, 2, 3, 4), 1000),
        Bob   = rep(1:2000, 2),
        Cathy = rnorm(4000),
        Doug  = rep('a', 4000)
)

The difference becomes apparent

> system.time(select1(Table, Bob, m=mean(Cathy)))
   user  system elapsed 
  3.000   0.010   3.026 
> system.time(select2(Table, Bob, m=mean(Cathy)))
   user  system elapsed 
  0.290   0.000   0.286

Unfortunately, when the number of output rows is very, very large, even select2() becomes too slow (and much slower than tapply()) (plyr is even slower). The best solutions I have found are

If there’s only one input column and one output column, use tapply().
Use sqldf to do the query it SQLite.

5 Responses to R SQL-ish aggregation

Hadley Wickham says:

September 10, 2010 at 4:03 pm

Your first example could be trivially written in plyr as:

ddply(texas, "weight", summarise, mean = mean(sodium), var = var(sodium)

Sure it’s not quite the same as the SQL, but it’s very very close! (the only major difference is the order of the elements)

Plyr is still rather slow for these types of cases, mainly because it’s (effectively) row centric rather than column centric. I think the next version of plyr will special case the combination of ddply + summarise and ddply + transform to provide a solution that’s much much faster. Initial benchmarking suggests that I should be able to do at least as well as base functions, if not better.

jd long says:

September 10, 2010 at 5:18 pm

Hadley and Marek helped me out quite a bit with speeding up these types of operations. As recently as today, even:

http://stackoverflow.com/questions/3685492/r-speeding-up-group-by-operations

I, like you, came to R knowing SQL. I used sqldf quite a bit and recently use plyr much more than sqldf.

-JD

- Branson says:
  
  September 13, 2010 at 10:23 am
  
  I am sure you will love the data.table package. I recommend the version 1.5 released here:
  https://r-forge.r-project.org/R/?group_id=240
  
  A bunch of quick features and manuals are available:
  http://cran.r-project.org/web/packages/data.table/index.html
  
  This package has the best performance I believe, but the learning efforts might be a little bit more than plyr package. Based on what you posed, I don’t think that will be a problem for you.
  
ellbur says:

September 13, 2010 at 8:25 pm

Hi Everyone!

Hadley:

Ah, I should have spent more time with plyr, but the particular case I was working on had a TON of output rows, so I abandoned it early. But thank you for the tip ;). (it’s not in the help(ddply) for my version of plyr…;( )

jd:

Sorry if I’m not not interpreting the code right (I haven’t seen id() or pylr:::split_indices() before) but the top answer is with one output column? Which is nicer because you can use tapply() on the indices, and no need to concatenate outputed data.frames. I guess it could be repeated for multiple columns. I wonder how that would compare.

ktoki says:

May 6, 2011 at 9:39 pm

Привет! Приглащаю в Нижневартовский чат http://ktoki.ru/nizhnevartovsk.html знакомств и общения.

	Assign multiple new… on Matlab-style multiple assignme…
	[r] 한 줄로 LHS에 여러 새 변… on Matlab-style multiple assignme…
	ljbldwj@gmail.com on Complex-valued linear models
	Johne73 on Matlab-style multiple assignme…
	Superman on Keyboard…