December 19, 2008

Minimal Advice to Undergrads on Programming

I seem to be mostly teaching classes with a big computational component. After being hit over the head a few times by the very, very wide range of programming skill among the students, I decided to write out some advice on how to program, with a bit of special reference to R. This is not advice on how to become a brilliant programmer, because I can't give such advice; I am at best adequate for a scientist. But that's all I ask.

Corrections and suggestions are appreciated.

Update, 28 December: What follows is an updated version, incorporating the useful suggestions of Geet Duggal, Derek M. Jones, Thomas Lumley and Chris Wiggins. The original, if anyone cares, is archived here.


In roughly decreasing order of importance:

Take a real programming class
Learning enough syntax for some language to make things run without crashing is not the same as actually learning how to think computationally. One of the most valuable classes I ever took was CS 60A at Berkeley, which was an introduction to programming, and so to a whole way of thinking. (The textbook was The Structure and Interpretation of Computer Programs.) If at all possible, take a real programming class; if not possible, try to read a real programming book.
Of course by the time you are taking my class it is generally too late to follow this advice; hence the rest of the list.
(Actual software engineering is another discipline, over and above basic computational thinking; that's why we have a software engineering institute. There is a big difference between the kind of programming I am expecting you to do, and the kind of programming that software engineers can do.)
Comment your code
Comments lengthen your file, but they make it immensely easier for other people to understand. ("Other people" includes your future self; there are few experiences more frustrating than coming back to a program after a break only to wonder what you were thinking.) Comments should say what each part of the code does, and how it does it. The "what" is more important; you can change the "how" more often and more easily.
Every function (or subroutine, etc.) should have comments at the beginning saying:
  1. what it does;
  2. what all its inputs are (in order);
  3. what it requires of the inputs and the state of the system ("presumes")'
  4. what side-effects it may have (e.g., "plots histogram of residuals");
  5. what all its outputs are (in order)
Listing what other functions or routines the function calls ("dependencies") is optional; this can be useful, but it's easy to let it get out of date.
You should treat "Thou shalt comment thy code" as a commandment which Moses brought down from Mt. Sinai, written on stone by a fiery Hand. I will treat it so when I grade you.
RTFM
If a function isn't doing what you think it should be doing, read the manual. R in particular is pretty thoroughly documented. (I say this as someone whose job used to involve programming a piece of special-purpose hardware in a largely undocumented non-standard dialect of Forth.) Look at (and try) the examples. Follow the cross-references. There are lots of utility functions built into R; familiarize yourself with them.
The utility functions I keep using: apply and its variants; sort, order; aggregate; table; rbind and cbind; paste.
Start from the beginning and break it down
Start by thinking about what you want your program to do. Then figure out a set of slightly smaller steps which, put together, would accomplish that. Then take each of those steps and break them down into yet smaller ones. Keep going until the pieces you're left with are so small that you can see how to do each of them with only a few lines of code. Then write the code for the smallest bits, check it, once it works write the code for the next larger bits, and so on.
In slogan form:
  1. Think before you write.
  2. What first, then how.
  3. Design from the top down, code from the bottom up.
(Not everyone likes to design code this way, and it's not in the written-in-stone-atop-Sinai category, but there are many much worse ways to start.)
Break your code into many short, meaningful functions
Since you have broken your programming problem into many small pieces, try to make each piece a short function. (In other languages you might make them subroutines or methods, but in R they should be functions.)
Each function should achieve a single coherent task — its function, if you will. The division of code into functions should respect this division of the problem into sub-problems. More exactly, the way you break your code into functions is how you have divided your problem.
Each function should be short, generally less than a page of print-out. The function should do one single meaningful thing. (Do not just break the calculation into arbitrary thirty-line chunks and call each one a function.) These functions should generally be separate, not nested one inside the other.
Using functions has many advantages:
  • you can re-use the same code many times, either at different places in this program or in other programs
  • the rest of your code only has to care about the inputs and outputs to the function (its interfaces), not about the internal machinery that turns inputs into outputs. This makes it easier to design the rest of the program, and it means you can change that machinery without having to re-design the rest of the program.
  • it makes your code easier to test (see below), to debug, and to understand.
Of course, every function should be commented, as described above.
Never do the same thing twice
Many programs involve doing the same thing multiple times, either as iteration, or to slightly different pieces of data, or with some parameters adjusted, etc. Never write two pieces of code to do the same job. Never copy the same piece of code into two places in your program. Instead, write one piece of code (generally a function; see above) and call it twice.
Doing this means that there is only one place to make a mistake, rather than many. It also means that when you fix your mistake, you only have one piece of code to correct, rather than many. (Even if you don't make a mistake, you can always make improvements, and then there's only one piece of code you have to work on.) It also leads to shorter, more comprehensible and more adaptable code.
Use meaningful names
Unlike some older languages, R lets you give variables and functions names of essentially arbitrary length and form. So give them meaningful names. Writing loglikelihood, or even loglike, instead of L makes your code a little longer, but generally a lot clearer, and it runs just the same.
This rule is lower down in the list because there are exceptions and qualifications. If your code is tightly associated to a mathematical paper, or to a field where certain symbols are conventionally bound to certain variables, you may as well use those names (e.g., call the probability of success in a binomial p). You should, however, explain what those symbols are in your comments. In fact, since what you regard as a meaningful name may be obscure to others (e.g., me, when I am grading your work), you should use comments to explain variables in any case. Finally, it's OK to use single-letter variable names for counters in loops (but see the advice on iteration below).
Check whether your program works
It's not a enough --- in fact it's very little --- to have a program which runs and gives you some output. It needs to be the right output. You should therefore construct tests, which are things that the correct program should be able to do, but an incorrect program should not. This means that:
  • you need to be able to check whether the output is right;
  • you should program the test, so it checks whether the output is right (and you can easily repeat the test as many times as you need);
  • your tests should be reasonably severe, so that it's hard for an incorrect program to pass them;
  • your tests should help you figure out what isn't working.
Try to write tests for the component functions, as well as the program as a whole. That way you can see where failures are. Also, it's easier to figure out what the right answers should be for small parts of the problem than the whole.
Try to write tests as very small function which call the component you're testing with controlled input values. For instance, a test for a function which supposedly calculates derivatives might check whether it gets the derivative of x2 or 7e-5x right at ten randomly-chosen points. The testing function should warn you if the computed derivatives differ by more than a tolerance you specify from the actual derivatives. (That's why you're using such simple functions.)
With statistical procedures, tests can look at average or distributional results. For example, I once wrote a program to estimate some parameters by maximum likelihood; I could then use the fact that a likelihood ratio test should have a chi-squared distribution to check that the estimation part was working properly.
Of course, unless you are very clever, or the problem is very simple, a program could pass all your tests and still be wrong, but a program which fails your tests is definitely not right.
(Some people would actually advise writing your tests before writing any actual functions. They have their reasons but I think that's overkill for my courses.)
Don't give up; complain!
Sometimes you may be convinced that I have given you an impossible programming assignment, or may not be able to get some of the class code to work properly, etc. In these cases, do not just turn in nothing saying "I couldn't get the data file to load". Let me know. Most likely, either there is a trick which I forgot to mention, or I made a mistake in writing out the assignment. Either way, you are much better off telling me and getting help than you are turning in nothing.
When complaining, tell me what you tried, what you expected it to do, and what actually happened. The more specific you can make this, the better. If possible, attach the relevant R session log and workspace to your e-mail.
Of course, this presumes that you start the homework earlier than the night before it's due.
Avoid iteration
This one is very much specific to R. Explicit iteration in R is slow. (We could talk about the reasons for that sometime if you're interested.) In many languages, this would be a reasonable way of summing two vectors:
for (i in 1:length(a)) {
  c[i] = a[i] + b[i]
}
In R, this is stupid. R is designed to do all this in a single "vectorized" operation:
c = a + b
Since we need to add vectors all the time, this is an instance of using a single function repeatedly, rather than writing the same loop many times. (R just happens to call the function "+".) It is also orders of magnitude faster than the explicit loop, if the vectors are at all long.
Try to think about vectors as vectors, and, when you need to do something to them, manipulate all their elements at once, in parallel. R is designed to let you do this (especially through the apply function and its relatives), and the advantage of getting to write a+b, instead of the loop, is that it is shorter, harder to get wrong, and emphasizes the logic (adding vectors) over the implementation. (Sometimes this won't speed things up much, but even then it has advantages in clarity.)
I emphasize again, however, that the speed issue is highly specific to R, and the way it handles iteration. A good programming class (see above) will explain the virtues of iteration, and how to translate iteration into recursion and vice-versa.

Manual trackback: Stephen Kinsella; Quantum of Wantum; Hacker News; Uncertain Principles; The Shape of Code

Corrupting the Young

Posted by crshalizi at December 19, 2008 20:45 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems