Werden wir Helden für einen Tag

R performance tuning #3

Posted on Aug 15, 2023 by Chung-hong Chan

Previously on this blog: #1 #2

Update 2023-08-19: Jump

Code profiling

Last time I talked about performance profiling: how the characteristics of the input data influence the performance. It’s nice. We might know what kind of data impacts performance. But up to this point, we know nothing about what R code to tune can generate performance improvement.

In the last two posts, we made an assumption: Improving XML escaping can speed up write_ods. So that we tried very hard to improve the XML escaper. But is it the best decision? Let’s quote Donald Knuth in full here:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

What we should really do is backtrack a bit and think instead of the larger task of improving the efficiency of write_ods. If we focus too much on tuning the noncritical parts but ignore the opportunities to improve the critical 3%, that’s really the root of all evil. So, the million-dollar question is: is the XML escaper that we spent two blog posts fixing a “noncritical part”, as per Knuth?

One can only find this out by profiling the code (not just profiling the performance): find out how much time each line of code runs in a call. I believe the easiest way to profile the code is to use the R package profvis.

require(profvis)
## devtools::load_all()
profvis({
    write_ods(Lahman::Batting, tempfile(fileext = ".ods"))
})

Running this with the CRAN version of readODS (v2.0.0), it gives a visualization like this:

And it shows that the entire run took 56550ms. More than half of the time (29870ms) was used for the line calling .cell_out() (where the XML escaper was used). Surprisingly, the line that performs the subsetting of the data frame and as.character took 30% of the running time!

Even within all the .cell_out() calls, the XML escaping took just 7440ms. Other time was used for file writing.

Another way to look at this is to use proffer and look at the flame graph.

require(proffer)
pprof(write_ods(Lahman::Batting, tempfile(fileext = ".ods")))

It shows that the total time taken by XML escaping (escape_xml) is less than file writing (write_as_utf8) and data frame subsetting ([).

To think about it this way: even if we could improve the XML escaping to 0ms and keep other parts as they were, the running time should be around 56550 - 7440 = 49110 (or -13%). However, if we can find a way to cut the running time of data frame subsetting part by 50%, it would give more improvement than tuning the XML escaping part. And it’s usually easier to reduce the large part by 50%; than a small part by 100%.

And with these information, we are not doing “premature optimization.” For the upcoming v2.1.0 of readODS, we solved the data.frame subsetting part already. The running time is now at 28190ms (halved). We will then attack the slow file writing part by rewriting that part in C++. XML escaping is actually our minor priority.

Caveats

The two approaches outlined in this post can only profile R code. It can’t profile C/C++/Fortran or whatever external compiled code. It can’t even detect base R steps written in C such as <-. If you want to profile C/C++ code, you can use jointprof.

Summary

Performance tuning must be data driven. #1
Performance tuning should not be based on just one benchmark test. Create performance profile to see what kind of data impacts performance. #2
Profiling the code to see what to tune. (This post)

Update

2023-08-19: write_ods has been rewritten partially in C++. And the running time is now at 3530ms (Yes, 3530ms running time, which is 6% of 56550ms). And the code profile is getting ridiculous.

So, among the 3530ms (3.5 seconds), 2450ms are for zipping up the 295MB XML file in the end. For this particular case, I don’t know whether I can tune that anymore.