Subset views in R

I don’t know how to do this in R. So let me just say why I can’t.

I wanted something akin to Boost‘s sub-matrix views, where you can have indexes map back to the original matrix, so you don’t create a new object.

Sounds straightforward, just overload ‘[[‘ to subtract the offset and check the length. Alas, no dice. R zealously copies objects to the point this is not (as far as I know, which isn’t much) possible.

To demonstrate, the following function executes and times expressions operating on a vector called “M”.

time.op = function(N, Exp) {
        Exp = parse(text=Exp)
        M = numeric(N)

        N.Trials = 10
        Times = numeric(N.Trials)

        for (II in 1:N.Trials) {
                Times[[II]] = system.time(eval(Exp))[['elapsed']]
        }

        mean(Times)
}

Like

> time.op(1e5, 'sqrt(M)')
[1] 0.0024

Then see, does the size of M affect the time of the operation?

time.test = function(Exp) {
        Ns = 10^(1:6)
        Times = sapply(Ns, time.op, Exp=Exp)

        data.frame(Ns, Times)
}
> time.test('sqrt(M)')
     Ns  Times
1 1e+01 0.0000
2 1e+02 0.0000
3 1e+03 0.0001
4 1e+04 0.0004
5 1e+05 0.0027
6 1e+06 0.0274

(obviously)

And here’s why we know it’s copying:

> time.test('list(M)')
     Ns  Times
1 1e+01 0.0001
2 1e+02 0.0001
3 1e+03 0.0002
4 1e+04 0.0000
5 1e+05 0.0004
6 1e+06 0.0086

Or with attributes

> time.test('attr(M, "name") = "mike"')
     Ns  Times
1 1e+01 0.0000
2 1e+02 0.0000
3 1e+03 0.0000
4 1e+04 0.0000
5 1e+05 0.0006
6 1e+06 0.0081

Good luck making a subset without copying!

And here’s the relevant parts of the R code.

Making a list (main/builtin.c)

    for (i = 0; i < n; i++) {
                if (TAG(args) != R_NilValue) {
                    SET_STRING_ELT(names, i, PRINTNAME(TAG(args)));
                    havenames = 1;
                }
                else {
                    SET_STRING_ELT(names, i, R_BlankString);
                }
                if (NAMED(CAR(args)))
                    SET_VECTOR_ELT(list, i, duplicate(CAR(args)));
                else
                    SET_VECTOR_ELT(list, i, CAR(args));
                args = CDR(args);
            }
            if (havenames) {
                setAttrib(list, R_NamesSymbol, names);
    }

Note the repeated calls to “duplicate”.

And yes, duplicate does copy, and it is deep (main/duplicate.c):

case VECSXP:
        n = LENGTH(s);
        PROTECT(s);
        PROTECT(t = allocVector(TYPEOF(s), n));
        for(i = 0 ; i < n ; i++)
            SET_VECTOR_ELT(t, i, duplicate1(VECTOR_ELT(s, i)));
        DUPLICATE_ATTRIB(t, s);
        SET_TRUELENGTH(t, TRUELENGTH(s));
        UNPROTECT(2);
        break;
Advertisements
Posted in R | 4 Comments

The best thing I have ever added to my ~/.vimrc

map q :q<Enter>
Posted in vim | Leave a comment

Matlab-style multiple assignment in R

R again!

You know how in Matlab you can do?

S, I = sort(M)

I like that.

R generic functions makes this possible. First, let’s genericize assignment. I feel like regular “=” and “<-” oughta stay nongeneric, so let’s make a new one.

'%=%' = function(l, r, ...) UseMethod('%=%')

Now the next step is a bit tricky. We need to group several variables on the left of %=%

... a, b, ... %=% foo()

The trick is they have to stay unevaluated. Luckily R uses (what IIRC is called) “pass by name”, so we can do this. Let’s start with a function to take a, b, …

l = function(...) {

}

The hard part is grabbing the ‘…’ without evaluating it.

> substitute(...)
...

won’t work, nor will alist(…). I have NO IDEA why, but the following expression (stolen from data.frame()) works:

as.list(substitute(list(...)))[-1L]

So now we have our function

l = function(...) {
	List = as.list(substitute(list(...)))[-1L]
	class(List) = 'lbunch'
	
	List
}

And we can add a specific implementation for our generic %=%:

'%=%.lbunch' = function(l, r, ...) {
	Names = lapply(l, as.character)
	Envir = as.environment(-1)
	
	for (II in 1:length(Names)) {
		Name = Names[[II]]
		assign(Name, r[[II]], pos=Envir)
	}
}

Which just treats the objects “a”, “b” … as strings, and assigns into the caller’s environment.

That’s what I had first, but it can be made better. With the above implementation:

> l(a[1], b) %=% list(3, 4)
Warning message:
In assign(Name, r[[II]], pos = Envir) :
  only the first element is used as variable name

It doesn’t play nice with assignment functions.

It can be modified to call ‘<-‘ directly:

'%=%.lbunch' = function(l, r, ...) {
	Envir = as.environment(-1)
	
	for (II in 1:length(l)) {
		do.call('<-', list(l[[II]], r[[II]]), envir=Envir)
	}
}

And now all is good.

> l(attr(a, 'foo'), b) %=% list(3, 4)
> a
[1] 3
attr(,"foo")
[1] 3
Posted in R | 3 Comments

Packing everything into a data.frame

OK, I know I talk about R too much, but I like R, so I’m going to talk about it some more.

Common situation: repeat a procedure many times; each time generates some large wadge of awful-structured data, and in the end you’d like to go back and look at it all.

OK, sounds reasonably simple, just

lapply(1:Num.Trials, function(N) {
    ...
    list(
        A = ...,
        B = ...,
    )
})

and you’ve got a list of structs containing that data. It works, but I find it undesirable for a few reasons:

  1. A list of lists is cumbersome to navigate. You have to subscript the first list before the second.
  2. You can’t do nice data.frame things with it like plot(…, data=…). Basically, it should be a data.frame, because data.frames are pretty.
  3. Having to explicitly put everything into the struct there at the end forces you to choose what gets remembered and what gets dropped. Rarely do I have such foresight.

So to get a data.frame, we can use the magic of sapply. Like this:

as.data.frame(t(sapply(1:N, function(I) {
    ...
    list(
        A = ...,
        B = ...
    )
}))))

I have to admit I don’t actually know why sapply is smart enough to do this, but it turns the whole shebang into a matrix of mode “list”. t() transposes that matrix so the fields A, B… become the columns. as.data.frame() makes the whole thing a data frame. Excellent.

Well, there’s a little problem here. I didn’t realize this at first, but a data.frame is just a list() of columns plus some attributes() attached. And those columns are welcome to be of mode “list”, as they will be here. In one way that’s actually really convenient, because you can stick complex stuff inside a data.frame, as in, like anything, even whole other data.frames. But you can’t call mean() or sd() or acf() on a vector of mode “list”. Inconvenient.

(By the way, is there any other language in which every object has a type, a mode and a class, all of which mean different things? What is up with that?)

So the solution is this “clean” function, to convert, where possible, vectors of mode “list” to numeric or character vectors.

clean = function(Data.Frame) {
	is.one = function(X) {
		is.atomic(X) && (length(X) == 1)
	}
	
	is.good = function(Col) {
		all(sapply(Col, is.one))
	}
	
	for (Col.Name in colnames(Data.Frame)) {
		Col = Data.Frame[[Col.Name]]
		if (is.good(Col)) {
			Data.Frame[[Col.Name]] = unlist(Col)
		}
	}
	
	Data.Frame
}

Basically, check to see that all the elements are atomic vectors (ie not lists) of length 1; if so, flatten (“unlist”).

And lastly, how about automatically grabbing everything you created along the way? Just end each loop with

as.list(environment())

Putting this all together, we have:

do.trials = function(N, Func) {
	clean(as.data.frame(t(sapply(1:N, Func))))
}
Posted in R | Leave a comment

Reminding me what I’m doing

zsh makes it easy (there are sentences that begin like that, I promise) (and this would be simple in bash, too, but it gives me an opportunity to discuss the more confusing points of zsh) to include a little status in your prompt to call attention to certain directories.

E.g. if I’m in my ROS directory ~/stuff/ros (I have a folder under ~ called stuff. Please forgive me) I like my prompt to read:

[ROS] 17:36:33 ~ros

Where “[ROS]” reminds me, when I have too many shells open, which one is being used for ROS.

(“~ros” is zsh’s named directory feature (http://www.jpeek.com/talks/ukuug_20000720/))

The easiest way to do clever prompts in zsh is to define a function with the magic name “precmd”, which is called before every prompt redraw.

My prompt code looks like

setopt prompt_subst
PROMPT=$'%{\e[31m%}$PROMPT_SPECIAL%{\e[32m%}%*%{\e[00m%} %{\e[33m%}%~%{\e[0m%} '

So ‘precmd’ has the job of setting $PROMPT_SPECIAL.

In doing this, I discovered zsh actually has associative arrays. This lets me set up a table of directory->[NAME], like

typeset -A special_dirs
special_dirs=()

special_dirs[$HOME/stuff/ros]="ROS"
special_dirs[$HOME/stuff/vex]="VEX"
special_dirs[$HOME/src/pulsetor-software]="PT"

Things I always get wrong here:

  • Don’t put a space before “=”.
  • Apparently ~ isn’t expanded inside [].

BUT, what I really want is for the [TAG] to appear not just in the specified directory, but in all subdirectories. Luckily, zsh provides pattern matching on associative array keys.

But here’s the trick: what you’d expect, and what I tried first, doesn’t work. Try this:

typeset -A foo
foo=()

foo[a]=1
foo[ab]=2
foo[b]=3

Now use the (k) flag to [] to do a pattern search.

% print ${foo[(k)a]}
1

works as expected, but

% print ${foo[(k)a*]}

produces no matches.

Well, it turns out, that the (k) flag treats the actual keys as patterns, and the argument to [] as the text to match against. WTF?

EXCEPT, that is exactly what I want. I can phrase the elements of $special_dirs like so:

special_dirs[$HOME/stuff/ros(|/)*]="ROS"
special_dirs[$HOME/stuff/vex(|/)*]="VEX"
special_dirs[$HOME/src/pulsetor-software(|/)*]="PT"

and match pwd against them like

${special_dirs[(k)$PWD]}

(First I tried using zsh’s nice /**/* recursive glob, but apparently that doesn’t work with (k). Can anyone explain to me why?)

The whole code from my .zshrc is

setopt prompt_subst
PROMPT=$'%{\e[31m%}$PROMPT_SPECIAL%{\e[32m%}%*%{\e[00m%} %{\e[33m%}%~%{\e[0m%} '

typeset -A special_dirs
special_dirs=()

function precmd {
	PROMPT_SPECIAL=${special_dirs[(k)$PWD]}
}

function mkspecdir {
	special_dirs[$1(|/)*]="[$2] "
}

mkspecdir ~/stuff/ros ROS
mkspecdir ~/stuff/vex VEX
mkspecdir ~/src/pulsetor-software PT
Posted in zsh | Leave a comment

Insert Date

I wanted a hotkey to insert the current date at the cursor in ALL programs. Turns out this is rather tricky, because there is (of course) no common API to “insert text” in all X programs (open standards means you can get it to work with slightly less Perl than proprietary standards. Or more Perl and less wireshark).

The best solution (not so great) I have so far is to use xmacro. The input to xmacro would look something like this:

Delay 1
String 2006-03-27

Where “Delay 1” gives you 1 second to take your fingers off the keys.

So a handy script can generate this macro for any desired string:

#!/bin/bash

echo Delay 1
while read line ; do
        echo "String $line"
done

Then a handy script can feed it into xmacroplay:

#!/bin/sh

type-gen-macro | xmacroplay $DISPLAY

And a shortcut can be assigned to

date +'%F' | xtype

Hacky and slow, but it works.

Posted in X | Leave a comment

Countable Dense Sets

So here’s a problem. Sometimes someone’ll ask me why the real numbers are uncountable. Should I give the somewhat unconvincing diagonalization argument, or the non-empty perfect set argument that I barely understand?

Well I just thought of an argument that I think is fairly simple to understand and, as far as I can tell, mathematically correct. Basically I want to prove the following statement:
If A and B are countable, linearly ordered sets, neither of which has a maximum or minimum element, and they are both dense in the sense that for all a1, a2 in A, there is an a3 in A with a1 < a3 < a2, then A and B have the same order type.

To do this we want to construct an order-preserving bijection between A and B. We can do this recursively. Since A and B are both countable, number the elements a1, a2, … and b1, b2 …. . Call the bijection to be constructed f:A–>B, and assign it values like follows.

For picking f(a[n]), the previous a[1] … a[n-1] box off an interval of A containing a[n] (or really they box off a lot of intervals but we only care about the one containing a[n]) (you can see this is true because they are finite). And because we’ve been careful to keep f({a[1], … a[n-1]}) order-preserving, the corresponding f(a[j]) box off an interval in B that we’re allowed to put f(a[n]) in. And the rule we’ll use is to make f(a[n]) be b[j] for the smallest j corresponding to a point in that box.

Now,
– We’ll always have a non-empty box because B is dense;
– f is clearly injective because we only pick from an interval nothing’s been mapped to yet;
– That f is order preserving seems obvious, but I suppose this could be elaborated (if a[i] < a[j], and say i < j, then when f(a[j]) was picked after f(a[i]), so it must be that f(a[i]) < f(a[j]) by the rule we picked with. Similarly if j < i).
– We still need to prove that f is surjective.

So say we have b[j] in B. We want to show that there is an a in A with f(a) = b[j]. We can do that by induction. Clearly f(a[1]) = b[1]. Now say b[1], … b[j-1] have all been mapped to by elements of A. These section off a box containing b[j]. Since f is 1-1, the preimage of this box gives an interval in A. This interval is non-empty (A is dense), so it has a least-numbered element in it, call it a[k]. When f(a[k]) is chosen it must map to b[j] (by the rule we used).

So f is bijective and order-preserving.

Now the real numbers are dense (since a < (a+b)/2 < b) and have no maximum or minimum element. So if they were countable, they would have the same order-type as the rationals (or say the rationals excluding 0, if we don't want to prove the existence of irrationals just now). But they can't, because the defining property of real numbers is the supremum property, which the rationals don't have. So the reals must be uncountable.

So what do you think — is it intuitive enough for dinner-table conversation?

Posted in math | Leave a comment