Archives
Categories
Afghanistan and Central Asia
Anticontrarianism Bayes, AntiBayes Biology Commit a Social Science Complexity Corrupting the Young Creationism Cthulhiana Dialogues Enigmas of Chance Food Friday Cat Blogging Heard About Pittsburgh, PA IQ Islam Kith and Kin Learned Folly Linkage Mathematics Minds, Brains, and Neurons Modest Proposals Networks Philosophy Physics Pleasures of Detection, Portraits of Crime Postcards Power Laws Psychoceramica Scientifiction and Fantastica SelfCentered The Beloved Republic The Collective Use and Evolution of Concepts The Commonwealth of Letters The Continuing Crises The Dismal Science The Eternal Silence of These Infinite Spaces The Great Transformation The Natural Science of the Human Species The Progressive Forces The RunningDogs of Reaction Writing for Antiquity
SelfCentered
Personal homepage
Professorial homepage News and Blogroll Research Notebooks My books (LibraryThing) Buy me more books! (wishlist) pinboard (bookmarks) tumblr (images) flickr (snapshots) dopplr (travel)
Books to Read While the Algae Grow in Your Fur
Books (etc.) I've read this month and
feel I can recommend (warning: I have no taste)
Upcoming Talks
Upcoming Talks

April 01, 2014"Proper scoring rules and linear estimating equations in exponential families" (Next Week at the Statistics Seminar)
Attention conservation notice: Only of interest if you (1) care about estimating complicated statistical models, and (2) will be in Pittsburgh on Monday.
Much of what I know about graphical models I learned from Prof. Lauritzen's book. His work on sufficienct statistics and extremal models, and their connections to symmetry and prediction, has shaped how I think about big chunks of statistics, including stochastic processes and networks. I am really looking forward to this. (To add some commentary purely of my own: I sometimes encounter the idea that frequentist statistics is somehow completely committed to maximum likelihood, and has nothing to offer when that fails, as it sometimes does [1]. While I can't of course speak for every frequentist statistician, this seems silly. Frequentism is a family of ideas about when probability makes sense, and it leads to some ideas about how to evaluate statistical models and methods, namely, by their error properties. What justifies maximum likelihood estimation, from this perspective, is not the intrinsic inalienable rightness of taking that function and making it big. Rather, it's that in many situations maximum likelihood converges to the right answer (consistency), and in a somewhat narrower range will converge as fast as anything else (efficiency). When those fail, so much the worse for maximum likelihood; use something else that is consistent. In situations where maximizing the likelihood has nice mathematical properties but is computationally intractable, so much the worse for maximum likelihood; use something else that's consistent and tractable. Estimation by minimizing a wellbehaved objective function has many nice features, so when we give up on likelihood it's reasonable to try minimizing some other proper scoring function, but again, there's nothing which says we must.) [1]: It's not worth my time today to link to particular examples; I'll just say that from my own reading and conversation, this opinion is not totally confined to the kind of website which proves that rule 34 applies even to Bayes's theorem. ^ Posted by crshalizi at April 01, 2014 10:45  permanent link
March 15, 2014"The Neglect of Fluctuations in the Thermodynamics of Computation" (Next Week at the Energy and Information Seminar)
Lo these many years ago, I blogged about how a paper of John Norton's had led me to have doubts about Landauer's Principle. Prof. Norton has continued to work on this topic, and I am very happy to share the news about his upcoming talk at CMU's "Energy and Information" seminar:
(For the record, I remain of at least two minds about Landauer's principle. The positive arguments for it seem either special cases or circular, but the conclusion makes so much sense...) Manual trackback / update, 1 April 2014: Eric Drexler's Metamodern, who objects that "the stages of computation themselves need not be in equilibrium with one another, and hence subject to backandforth fluctuations" (his italics). In particular, Drexler suggests introducing an external timevarying potential that "can carry a system deterministically through a series of stages while the system remains at nearly perfect thermodynamic equilibrium at each stage". But I think this means that the whole setup is not in equilibrium, and in fact this proposal seems quite compatible with sec. 2.2 of Norton's "NoGo" paper. Norton's agrees that "there is no obstacle to introducing a slight disequilibrium in a macroscopic system in order to nudge a thermodynamically reversible process to completion"; his claim is that the magnitude of the required disequilibria, measured in terms of free energy, are large compared to Landauer's bound. The point is not that it's impossible to build molecularscale computers (which would be absurd), but that they will have to dissipate much more heat than Landauer suggests. I won't pretend this settles the matter, but I do have a lecture to prepare... Posted by crshalizi at March 15, 2014 11:25  permanent link
March 04, 2014"Unifying the Counterfactual and Graphical Approaches to Causality" (Tomorrow at the Statistics Seminar)
Attention conservation notice: Late notice of an academic talk in Pittsburgh. Only of interest if you care about the places where the kind of statistical theory that leans on concepts like "the graphical Markov property" merges with the kind of analytical metaphysics which tries to count the number of possibly fat men not currently standing in my doorway. A great division in the field of causal inference in statistics is between those who like to think of everything in terms of "potential outcomes", and those who like to think of everything in terms graphical models. More exactly, while partisans of potential outcomes tend to denigrate graphical models (*), those of us who like the latter tend to presume that potential outcomes can be read off from graphs, and hope someone will get around to showing some sort of formal equivalence. That somebody appears to have arrived.
As always, the talk is free and open to the public, whether the public follows their arrows or not. *: I myself have heard Donald Rubin assert that graphical models cannot handle counterfactuals, or nonadditive interactions between variables (particularly that they cannot handle nonadditive treatments), and that their study leads to neglecting analysisofdesign questions. (This was during his talk at the CMU workshop "Statistical and Machine Learning Approaches to Network Experimention", 22 April 2013.) This does not diminish Rubin's massive contributions to statistics in general, and to causal inference in particular, but does not exactly indicate a thorough knowledge of a literature which goes rather beyond "playing with arrows". Posted by crshalizi at March 04, 2014 17:50  permanent link
February 28, 2014On Reaching the Clearing at the End of the Tenure Track
Attention conservation notice: Navelgazing by a middleaged academic. I got tenure a few weeks ago. (Technically it takes effect in July.) The feedback from the department and university which accompanied the decision was gratifyingly positive, and I'm pleased that blogging didn't hurt me at all, and perhaps even helped. I got here with the help of a lot of mentors, colleagues, and friends (not a few of whom I met through this blog), and I feel some vindication on their behalf. For myself, I feel — relieved, even pleased. Relieved and pleased, but not triumphant. I benefited from a huge number of lucky breaks. I know too many people who would be at least as good in this sort of job as I am, and would like such a job, but instead have ones which are far less good for them. If a few job applications, grant decisions, or choices about what to work on when had turned out a bit different, I could have been just as knowledgeable, had ideas just as good, worked on them as obsessively, etc., and still been in their positions, at best. Since my tenure decision came through, I've had two papers and a grant proposal rejected, and another paper idea I've been working on for more than a year scooped. A month like that at the wrong point earlier on might well have sunk my academic career. You don't get tenure at a major university without being a productive scholar (not for the most part anyway), but you also don't get it without being crazily lucky, and I can't feel triumphant about luck. It's also hard for me to feel triumph because, by the time I get tenure, I will have been at CMU for nine years and change. Doing anything for that long marks you, or at least it marks me, and I'm not sure I like the marks. The point of tenure is security, and I hope to broaden my work, to follow some interests which are more speculative and risky and seem like they will take longer to pay off, if they ever do. But I have acquired habits and made commitments which will be very hard to shift. One of those habits is to think of my future in terms of what sort of scholarly work I'm going to be doing, and presuming that I will be working all the time, with only weak separation between work and the rest of life. I even have some fear that this has deformed my character, making some ordinary kinds of happiness insanely difficult. But maybe "deformed" is the wrong word; maybe I stuck with this job because I was already that kind of person. I can't bring myself to wish I wasn't so academic in my interests, or that I hadn't pursued the career I have, or that I had been less lucky in it. But I worry about what I have given up for it, and how those choices will look in another nine years, or twentynine. Sometime in the future, I may write about what I think about tenure as an institution. But today is a beautiful winter's day here in Pittsburgh, cold but clear, the sky is a brilliant pale blue right now. It's my 40th birthday. I'm going outside to take a walk — and then probably going back to work. Posted by crshalizi at February 28, 2014 15:52  permanent link
January 02, 201436350, Fall 2013: SelfEvaluation and Lessons Learned (Introduction to Statistical Computing)
This was not one of my better performances as a teacher. I felt disorganized and unmotivated, which is a bit perverse, since it's the third time I've taught the class, and I know the material very well by now. The labs were too long, and my attempts to shove the excess parts of the labs into revised homework assignments did not go over very well. The final projects were decent, but on average not as good as the previous two years. I have two ideas about what went wrong. One is of course about kids these days (i.e., blaming the victims), and the other is about my own defects of character. First, in retrospect, previous iterations of the course benefited from the fact that there hadn't been an undergraduate course here in statistical computing. This meant there was a large pool of advanced statistics majors who wanted to take it, but already knew a lot of the background materials and skills; the modal student was also more academically mature generally. That supply of overtrained students is now exhausted, and it's not coming back either — the class is going to become a requirement for the statistics major. (As it should.) So I need to adjust my expectations of what they know and can do on their own downward in a major way. More exactly, if I want them to know how to do something, I have to make sure I teach it to them, and cut other things from the curriculum to make room. This, I signally failed to do. Second, I think the fact that this was the third time I have taught basically the same content was in fact part of the problem. It made me feel too familiar with everything, and gave me an excuse for not working on devising new material up until the last moment, which meant I didn't have everything at my finger's ends, and frankly I wasn't as excited about it either. Putting these together suggests that a better idea for next time would be something like the following.
All of this will be a lot of work for me, but that's part of the point. Hopefully, I will make the time to do this, and it will help. Posted by crshalizi at January 02, 2014 18:01  permanent link
January 01, 2014End of Year Inventory, 2013
Attention conservation notice: Navelgazing. Paper manuscripts completed: 4
Grant proposals submitted: 5
Talk given and conferences attended: 17, in 10 cities Classes taught: 2 [i, ii]
Manuscripts refereed: 21
Students who completed their dissertations: 0
Promotions received: 0
Book reviews published on dead trees: 0 Weblog posts: 93
Books acquired: 260
Major life transitions: 1 Posted by crshalizi at January 01, 2014 00:01  permanent link
December 04, 2013Computing for Statistics (Introduction to Statistical Computing)
(My notes from this lecture are too fragmentary to post; here's the sketch.) What should you remember from this class?Not: my mistakes (though remember that I made them). Not: specific packages and ways of doing things (those will change). Not: the optimal algorithm, the best performance (human time vs. machine time). Not even: R (that will change). R is establishing itself as a standard for statistical computing,, but you can expect to have to learn at least one new language in the course of a reasonable career of scientific programming, and probably more than one. I was taught rudimentary coding in Basic and Logo, but really only learned to program in Scheme. In the course of twenty years of scientific programming, I have had to use Fortran, C, Lisp, Forth, Expect, C++, Java, Perl and of course R, with glances at Python and OCaml over collaborators' shoulders, to say nothing of nearlanguages like Unix shell scripting. This was not actually hard, just tedious. Once you have learned to think like a programmer with one language, getting competent in the syntax of another is just a matter of finding adequate documentation and putting in the time to practice it  or finding minimal documentation and putting in even more time (I'm thinking of you, CAMForth). It's the thinkinglikeaprogrammer bit that matters. Instead, remember rules and habits of thinking
Posted by crshalizi at December 04, 2013 10:30  permanent link
December 02, 2013Speed, Complexity and Interfacing with Other Systems (Introduction to Statistical Computing)
(My notes from this lecture are too fragmentary to type up; here's the sketch) Programmer time is (usually) much more valuable than computer time; therefore, "premature optimization is the root of all evil". That said, computer time isn't free, and if you have to wait around for the machine to finish, computer time is programmer time. VectorizeIn R, vectorization can lead to great speedups. The reason has to do with how R changes existing objects. This is to make a new instance of the object, copy over the unchanged values and fill in the changed one, and then delete the old instance of the object. This makes repeated small changes to large big objects very inefficient. (On the other hand, incrementally updating a handful of parameters in optimization isn't that bad.) Changing the size of the object is even more inefficient. Vectorization avoids all this allocation, copying, and deallocation.very.stupid.adder < function(x,y) { z < c() for (i in 1:length(x)) { z < c(z,x[i]+y[i]) } return(z) } stupid.adder < function(x,y) { z < vector(length=length(x)) for (i in 1:length(x)) { z[i] < x[i] + y[i] } return(z) } > a < rnorm(1e5) > b < rnorm(1e5) > 1e6*system.time(very.stupid.adder(a,b)) # time in microseconds user system elapsed 20266000 3224000 23409000 > 1e6*system.time(stupid.adder(a,b)) user system elapsed 204000 1000 204000 > 1e6*system.time(a+b) user system elapsed 0 0 0(R's copyonchange policy isn't a pointless perversity; knowing that objects don't "change in place" makes some memory management issues much more efficient.) One way to speed up your code, then, is to think carefully about how to vectorize it. Again, the major advantage of vectorization is clarity, but ordersofmagnitude speedups aren't bad either. Cultivate Slop and Arbitrary IgnoranceWe can often speed things up by accepting a more approximate answer in place of a more exact one. For statistical problems, in particular, there is no point in making our answers more precise than the data will justify. We've emphasized this for optimization, but it applies to lots of numerical computations as well. (Here, it is useful to understand the tradeoff between computing effort and approximation. If getting a precision of \( \pm \epsilon \) takes \( O(\log{1/\epsilon}) \) operations, then doubling the \( \epsilon \) we tolerate shaves only a constant amount of time off the calculation  we'd have to accept exponentially more error to get real speedups. [Optimistically, if we can do the calculation at all, we should be able to do it precisely.] If however the scaling is \( O(1/\epsilon^2) \), doubling \( \epsilon \) would reduce the time by a factor of four.) Computational problems also generally slow down as the data size increases. For many statistical problems, we can actually make use of this, by sampling the data and only running our procedure on a sample. This introduces some approximation error, compared to running on the full data, but this tradeoff can be well worth it. Moreover, we can crosscheck different randomized subsets of the data, if we're worried about being misled by small parts of it. If you can't sample, try to reduce how often you have to go over the data. There is also a whole, very important, field of computer science concerned with developing randomized approximation algorithms (see e.g. Mahoney on randomized matrix algorithms). ParallelizeIf you can break your problem up into independent chunks, you can farm each chunk out to a separate processor, and do them in parallel. (This presumes you have a lot of processors.) The split/apply/combine strategy is designed to work with this. It may be possible to do this for one stage in your problem, followed by others which don't divide neatly; you then alternate between embarrassinglyparallel and centralized processing. There are also problems which lend themselves to parallel processing with communication between the processors; these tend to be ones which are somehow "local" in space. I am vague here, because the technology for parallelizing data processing is advancing rapidly. Exploit Other's WorkDon't reinvent the wheel: look for wellexecuted software which does a part of your job and use it. (Don't write your own linear algebra package, unless you really know what you are doing, and have a good reason not to just improve existing packages.) Modularity and abstraction make it easier to use other's work, and replace one external library with another, better one. Give Up on RHighly optimized tools in the Unix shell for particular datamanipulation jobs. (Bradnam and Koch just scratches the surface.) Cultivating their acquaintance is a good use of your time. Remember that system lets you call arbitrary system functions, and capture the output. At some point, speed (and maybe memory) efficiency point to a more streamlined, compiled language for specific parts of your work. The natural choices here are C and (still) Fortran; see Writing R Extensions. You could write all your software for data analysis C, but then you will spend all your time tracing down memory allocation bugs. Actual Complexity AnalysisComputational complexity: resources (time [in operations not seconds], space [in bits not liters], communication, samples) needed as function of problem size. Mergesort, for example. Intrinsic limits (so far as we can tell) to how easily certain sorts of problems can be solved. Clever algorithms vs. exploiting special structure in problems. Really deserves attention on its own: read Moore and Mertens. (Though Kleinberg and Tardos may be a bit more directly practical, Moore and Mertens will do more for your soul.) Posted by crshalizi at December 02, 2013 10:30  permanent link
November 25, 2013Simulation V: Matching Simulation Models to Data (Introduction to Statistical Computing)\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \DeclareMathOperator*{\argmin}{argmin} \] (My notes for this lecture are too incomplete to be worth typing up, so here's the sketch.) Methods, Models, SimulationsStatistical methods try to draw inferences about unknown realities from data. They always involve some more or less tightly specified stochastic model, some story (myth) about how the data were generated, and so how the data are linked to the world. We want to know how well our methods work, which is usually hard to assess by just running them on real data. Sometimes, we can push through analytic probability calculations under our model assumptions, but often not. Then we can get some baseline idea of how well our methods should work by simulating the models, running the simulation through the methods, and comparing the results to what we know the truth was in the simulation world. (That is, we see if we can recover the myth.) We can also see how robust our methods are by simulating processes which break the modeling assumptions more or less grossly.If your method, or your implementation of a common method, doesn't work under simulations, then you had better have a very good argument for why it should be trusted with real data. Always go back and forth between simulations and method development \[ \mathrm{simulation}:\mathrm{methods}::\mathrm{unit\ tests}:\mathrm{functions} \] Adjusting Simulation Models to Match DataMany (most?) useful probability models are easily specified as stepbystep recipes for simulation, but have ugly, complicated distributions for the data. (Nonlinear dependencies and latent variables are both tend to lead to complicated distributions.) Since the models have adjustable parameters, we would like to adjust them so that they match the data we have about the world. If the distributions were straightforward, we could just calculate the likelihood and maximize it (or use Bayesian updating, etc.), but that's not really possible when the distribution is ugly. What to do? Method of Moments and FriendsMoments (\(\Expect{X}, \Expect{X^2}, \ldots ) \) are functional of the probability distribution. If our model has a \( k \)dimensional parameter vector \( \theta \), then we should generally be able to find \( k \) moments which characterize or identify the parameter, meaning that there are functions \( f_1, f_2, \ldots f_k \) such that (i) \( \Expect{X} = f_1(\theta), \Expect{X^2} = f_2(\theta), \ldots \Expect{X^k} = f_k(\theta) \), and (ii) if \( \theta_1 \neq \theta_2 \), then at least one of the \( k \) moments differs, \( f_j(\theta_1) \neq f_j(\theta_2) \) for some \( j \in 1:k \). If we know the functions \( f_1, \ldots f_k \), and we knew the population moments, we could then solve for the matching \( \theta \). (We could also test the model, as opposed to the parameter value, by seeing if that \( \theta \) let us match other population moments; and of course if there were no solution for \( \theta \).) Since we only see sample moments, in the method of moments we try to solve for the parameters which match them: \( \hat{\theta}_{MM} \) is the \( \theta \) where \[ f_i(\theta) = \overline{x^i} ~, ~ i \in 1:k \] (We can then test this by looking at whether we can match other, mathematicallyindependent functionals.) Since it's annoying to have to keep the subscripts and superscripts \( i \) around, let's say that \( f(\theta) \) is the vectorvalued function of theoretical moments, and \( m \) is the vector of sample moments. Our earlier assumption (ii) means that \( f^{1} \) exists, so \[ \hat{\theta}_{MM} = f^{1}(m) \] By the law of large numbers, we expect \( m \) to approach the population moments as we see more data. For \( n \) nottoodependent observations, the error should be \( O(1/\sqrt{n}) \), and then if \( f^{1} \) is a reasonably smooth function, propagation of error tells us that \( \\hat{\theta}_MM  \theta \ = O(1/\sqrt{n}) \) itself. It might happen, due to sampling fluctuations, that the observed moments \( m \) are not ones which could ever be theoretical moments for the model  that \( m \) is not in the range of \( f(\theta) \), no matter how we adjust \( \theta \). The usual approach then is to minimize: \[ \hat{\theta}_{MM} = \argmin_{\theta}{\ f(\theta)  m \} \] (While I've written this for the ordinary Euclidean norm, it is often better to minimize a weighted norm, giving more emphasis to the moments which have smaller standard errors. This refinement can come later.) So far I've said nothing about simulation. I've presumed that we can find the moments as explicit functions of the parameters  that we know \( f \). However, if we can simulate our model at our favorite \( \theta \), we can get arbitrarily accurate approximations to \( f \). Specifically, if we can do \( b \) separate runs of the model at \( \theta \), and average the output, we get an approximation of \( f(\theta) \), say \( \tilde{f}(\theta) \), whose error is \( O(1/\sqrt{b} \). We can thus look at \[ \hat{\theta}_{MSM} = \argmin_{\theta}{\ \tilde{f}(\theta)  m \} \] and try to minimize it, to get our method of simulated moments estimate. There will now be two sources of error, simulation error (shrinking with \( b \), and under our control), and statistical error (shrinking with \( n \), and out of our control). Again, it can be better to use a weighted norm, emphasizing the moments which can be estimated more precisely. There is nothing magic about moments  all we need is a set of functionals of the distribution which characterize the parameters. The foregoing carries over without modification to the generalized method of moments, and the generalized method of simulated moments. It is often particularly useful to make the "generalized moments" parameters in some other statistical model, which is misspecified but very easy to estimate; then we have indirect inference. A BruteForce Approach to Simulated Maximum LikelihoodObserve data \( x \). Pick an initial \( \theta \). Do \( b \) independent runs at that parameter value. Use your favorite nonparametric density estimator to get \( \tilde{p}_{\theta} \). (As \( b \rightarrow \infty \), the density estimate converges on the true density, \( \tilde{p}_{\theta} \rightarrow p_{\theta} \).) Now evaluate that estimated density at \( x \), and treat \( \tilde{p}_{\theta}(x) \) as the likelihood. Adjust \( \theta \) to maximize \( \tilde{p}_{\theta}(x) \). This is not very efficient  as the dimension of \( x \) grows, \( b \) has to grow exponentially to maintain a constant accuracy. Moreover, our representation of \( \tilde{p}_{\theta} \) is going to be horridly redundant, using a huge number of degrees of freedom when we only need \( k \). There are ways one could begin to get around these issues, but that's beyond the scope of these notes. Checking the Simulation Model Against DataBecause we can simulate, we can look at what our model predicts for arbitrary aspects of the distribution of data. Basically any statistic of the data can thus be a test statistic. What makes a good test statistic? Obviously, if we adjusted the parameters to match a certain statistic, that is not a good test. (Or at least, I hope it's obvious that if we adjusted the parameters to match the sample variance, we learn nothing by seeing that simulations have about the sample variance.) A little more subtly, nothing which depends mathematically on the statistics we matched is a good test. If we matched the first \( k \) moments, no function of those moments is suitable. (If we matched \( \overline{x} \) and \( \overline{x^2} \), we learn nothing by seeing if we can also match the sample variance.) We might turn to seeing if we can match higher moments, but those tend to be rather unstable, so that doesn't have much power. The best recommendation I can make is to actually understand the process you're studying, and pick out interesting, relevant features which are logically independent of the moments you matched. This however requires subjectmatter knowledge, rather than statistical or computational skill. In the absence of understanding, we can try to check the model using summary statistics other than the moments we used to estimate parameters  e.g., if we matched the mean, how well do we do on the median? Another useful fallback is to try rerunning exploratory data analyses on simulation output, and seeing how well that matches the EDA on the data. Overidentifying: If \( \theta \) is \( k \)dimensional but we try to match more than \( k \) moments, it's not guaranteed that we can match all of them at once. The overall discrepancy \( \ \tilde{f}(\theta)  m \ \) should still shrink as we get more data if the model is right. Under some circumstances, one can work out analytically what the distribution of that discrepancy should be under the model (e.g., \( \chi^2 \) or related), but you can also just simulate. Breaking models: If you want to know how robust some model prediction is, you can try to deliberately search for parameters which will not make that prediction. Again, out of space I won't go into details, but see Miller's paper. This isn't, strictly speaking, about whether the model fits the data, though it could be adapted to that. Readings(Woefully incomplete, merely alphabetical order)
Posted by crshalizi at November 25, 2013 10:30  permanent link
November 22, 2013Homework: Several Hundred Degrees of Separation (Introduction to Statistical Computing)
Homework 10: in which we refine our webcrawler from the previous assignment, by way of further working with regular expressions, and improving our estimates of pagerank. (This assignment ripped off from Vince Vu, with permission.) Posted by crshalizi at November 22, 2013 11:30  permanent link
Lab: Baseball Salaries (Introduction to Statistical Computing)
In which America's true pasttime proves to be wrestling with relational databases. Assignment, database (large!) Posted by crshalizi at November 22, 2013 10:30  permanent link
November 20, 2013Databases (Introduction to Statistical Computing)
Lecture 25: The idea of a relational database. Tables, fields, keys, normalization. Serverclient model. Example of working with a database server. Intro to SQL, especially SELECT. Aggregation in databases is like split/apply/combine. Joining tables: what it is and how to do it. Examples of joinery. Accessing databases from R with the DBI package. Example database file (30MB): baseball.db Readings: Spector, chapter 3 Posted by crshalizi at November 20, 2013 10:30  permanent link
November 18, 2013Change of Representation (Introduction to Statistical Computing)
(My notes for this lecture are too fragmentary to post. What follows is the sketch.) The "raw data" is often not in the format most useful for the model one wants to work with. Lots of statistical computing work is about moving the information from one format to another  about changing representations. Lossless transformations vs. lossy; why we often want lossy transformations. Reorganizing data to group it properly. (Example: going from multidimensional arrays to 2D dataframes and vice versa.) Aggregation as a change of representation. (Example: Going from dates of adoption for each doctor to cumulative proportion of adopters.) Text processing via change of representation: the bagofwords ("vector space") representation. Cosine and Jaccard similarities. Term frequencyinverse document frequency. Document clustering and classification. Readings: Spector, chapters 8 and 9. On text: Lectures 1, 2 and 4 (+ slides) from data mining (vintage 2009). Posted by crshalizi at November 18, 2013 10:30  permanent link
November 15, 2013Homework: A Maze of Twisty Little Passages (Introduction to Statistical Computing)
Homework 10: In which we build a little webcrawler to calculate pagerank (the hard way), so as to practice working with text, regular expressions, and Markov chains. Supplied code, which may or may not contain deliberate bugs. (This assignment ripped off from Vince Vu, with permission.) Posted by crshalizi at November 15, 2013 11:30  permanent link
Lab: Scrape the Rich (Introduction to Statistical Computing)
In which we practice extracting data from text, to learn about our betters. Assignment; files: rich1.html, rich2.html, rich3.html, rich4.html (This assignment ripped off from Vince Vu, with permission.) Posted by crshalizi at November 15, 2013 10:30  permanent link
