Werden wir Helden für einen Tag

Home | About | Archive

Extending R with C #3: still cnchar(), but with vectorization

Posted on Oct 25, 2023 by Chung-hong Chan

(I decided to change the title, I will keep the old ones as “crappyverse”)

Previously: ccat(), cnchar()

There were some voices that I should not teach .C. I totally agree with that and it was planned to abandon .C in the third post of the series. To summarize what .C offers:

The return value (e.g. answer in this case) must be passed as a pointer argument; the C function must be a void function

#include <stdio.h>
#include <string.h>

void cnchar(char **x, int *answer) {
    *answer = strlen(*x);
}

Usually, the function (e.g. cnchar in this case) can also be used out of the context of R without any/much modification. It is because .C tries to match the data type from R to C (e.g. R character to C-style string array).

#include <stdio.h>
#include <string.h>

void cnchar(char **x, int *answer) {
    *answer = strlen(*x);
}

void main() {
    char* some_string = "hello";
	int answer = 0;
    cnchar(&y, &answer);
	printf("%d\n", answer);
}

This concludes our discussion of .C. Let’s move on. However, I’ll need to talk about the memory model of R.

.Call

.Call is usually considered to be THE FFI one should use (actually there is also .External which is extremely similar to .Call on the R side of business). .Call does not try to match data type like .C and it actually can return value (not as pointer in the argument). Unless your function is only for side effects, the return value is probably SEXP.

SEXP

Anything you bring from R to C is an SEXP (S Expression). I think it is easier to think about it as a pointer to an R object. The abstraction is so nice that you can even think of it as an R object itself (except GC, see below). You can also produce SEXP with C code and bring it back to R. Similar to different types of vector, e.g. Character vector, logical vector, there are different types of SEXP, based on what R object the pointer points to, e.g. STRSXP is character vector. It is your responsibility to make sure your C code can be correctly type-checked.

Garbage Collection

It’s great that I have written about this topic previously about C++. In that post, I have also hinted a bit about R being a GC language.

R objects are stored in (the heap) memory. When objects are no longer needed, they still occupy the memory. The occupied memory is only released when a procedure called Garbage Collection (GC) is done. GC is a check of the object in the memory to see if the object is still being referenced. If it is no longer referenced, the object is removed to free up the occupied memory. One can trigger GC by explicitly running the gc() function. Even if one doesn’t do it explicitly, GC will still be triggered in the background automatically.

You probably never come to a situation an R object you still need and GC accidentally gets it removed (or else R is quite an unsafe language). However, when you create an R object in C, the GC process might think that the memory occupied by your newly created R object using C code were not useful (because nothing in R is referencing it) and therefore that’s a garbage for collection. For safety the C API has notions of PROTECT and UNPROTECT. The section 5.9.1 of R-exts is entirely about this. I will show you how to protect an R object during its life cycle in C later.

Rewriting cnchar, but for .Call

Once again, this is cnchar written for .C.

#include <stdio.h>
#include <string.h>

void cnchar(char **x, int *answer) {
    *answer = strlen(*x);
}

This is the same thing but for .Call.

#include <R.h>
#include <Rdefines.h>
#include <string.h>

SEXP cnchar(SEXP x) {
    SEXP result;
    PROTECT(result = NEW_INTEGER(1));
    PROTECT(x = AS_CHARACTER(x));
    INTEGER(result)[0] = strlen(CHAR(STRING_ELT(x, 0)));
    UNPROTECT(2);
    return(result);
}

Several things:

  1. We need to include some more header files, specifically the header files for the R’s C API.
  2. SEXP is both the input and output types of the C function.
  3. There are several all-cap functions. Those all-cap functions are provided by the R’s C API.
    • PROTECT() is to put an SEXP (actually, the memory the SEXP occupied) into the protection stack so that R’s GC process does not collect them as garbage.
    • UNPROTECT() is to release the objects (ditto the memory statement above) in the stack. It is really a stack and the argument is only an integer. In this case, 2 means release the last two protected things (result and x) in the stack. It is important to UNPROTECT() protected SEXPs because failure to do so generates memory leak.
    • NEW_INTEGER() or more general NEW_*() functions create a new vector of a specific length.
    • AS_CHARACTER() or more general AS_*() functions declare a specific SEXP as a specific type.
    • STRING_ELT() or more general *_ELT() functions get a specific element from an SEXP at a specific location.
    • CHAR() is for converting R data structure to C data (char* aka C-style string).

The compiling of this is the same: R CMD SHLIB cnchar.c. But after dyn.load("cnchar.so"), you have to call the C function by:

.Call("cnchars", x = "Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch")

First, it is less crappy than the previous version because you don’t need to run it like: .C("cnchar", x = "hello", answer = as.integer(0)). Second, it really returns a single number, not a list. Actually, it is less likely to segfault too. The following would at least give you something, not a segfault. Whether those are good behaviours are up for debate.

.Call("cnchar", x = 123)
.Call("cnchar", x = TRUE)
.Call("cnchar", x = NA)
.Call("cnchar", x = c("a", "b")) # only one number

Vectorization

As the C function is only treating the singular input as a length-1 vector. A crappy way to do vectorization is to use vapply().

vapply(c("a", "abc"), function(z) .Call("cnchar", z), 1L)

It is however super trivial to make the C function vectorize. Unlike C++, there is probably only one straight forward way to vectorize: using a for loop to iterate by the index.

#include <R.h>
#include <Rdefines.h>
#include <string.h>

SEXP cnchar2(SEXP x) {
    R_xlen_t xlength = Rf_xlength(x);
    SEXP result = NEW_INTEGER(xlength);
    PROTECT(result);
    PROTECT(x = AS_CHARACTER(x));
    for (int i = 0; i < xlength; i++) {
        INTEGER(result)[i] = strlen(CHAR(STRING_ELT(x, i)));
    }
    UNPROTECT(2);
    return(result);
}

Sure enough, it vectorizes.

.Call("cnchar2", x = c("a", "bbc"))

R’s C API

The definite guide for this is Chapter 6 of r-exts. But I also like unofficial documentation sources such as r-internals.

Let’s talk about it next time.


Powered by Jekyll and profdr theme