CSCS SoC Ideas --- Google Summer of Code 2010.
Under construction until 2010-03-12
For Google Summer of Code 2010 projects, CSCS would like to find students interested in working on one of the following projects. We anticipate we need at most 1 student for the first project, 2-3 students for the second one and 2-3 for the third.
However, if you have a project you think are related to complex systems research, please feel free to contact us, or better, submit a proposal via SoC to CSCS. If it is a project we feel we can mentor and one that is related to complex systems research, we will consider it.
If you have questions about these projects, please contact the head mentor for these projects, Rick.Riolo@gmail.com, Director of the CSCS Computing Lab.
Note: Some information you should include in your application:
FAQ -- please read regarding required/desired experience.
Project #1: GridSweeper Improvements and Enhancements
GridSweeper is a tool for running batch simulations with different parameter settings on grid computing systems. The core architecture was implemented for Google Summer of Code 2006, including code for generating many kinds of parameter sweeps and submitting jobs to the grid via DRMAA. In GSoC 2007-2009, several more pieces were implemented and tested:
We have an official 1.0 release version available here.
Since that time a number of researchers have used the Drone adapter for GridSweeper to carry out large scale experiments. For 2010 we would like to have someone continue to work on GridSweeper, with the primary goal being a complete Release 1.0, including:
Candidates for working on the GridSweeper project should have OOP experience, preferably with Java and/or Objective-C. Experience with SGE, PBS/Torque, Xgrid or other Grid technologies would be useful as well.
Project #2: Tools for Analysis of Computational ExperimentS (TFACES)
A large part of any computational modeling project involves analyzing and displaying the data generated, to help the experimenter understand what the model is doing and why, to guide the selection of additional model runs and experiments to be carried out, and ultimately to prepare data for use in publications. Besides the intellectual challenges of figuring out what is useful to measure and display, which tend to be somewhat dependent on the specific "content" of each model, there are also a number of technical challenges which often are common across models. For example, for typical agent/individual-based models (ABM or IBM) of Complex Adaptive Systems (CAS), an "experiment" typically consists of running the model many times with the same set of parameter values, just starting with different random number generator (RNG) seeds, to get a distribution of outcomes for that one "case". Then one typically will run many cases, each with different combinations of parameter values, to do sensitivity analysis or just to explore the space of possible parameter values (or initial conditions). Given all that data, it would be nice to have tools to calculate statistics across runs, within runs, for parts of runs, etc., and repeat for a set of cases (e.g., results for different values of a parameter), and create plots of various kinds based on the data and statistics.
Scientific computing environments, which provide extensive statistical and display capabilities, are typically interactive and provide useful and standardized tools for data exploration and analysis. They also typically rely on interpreted languages, making the implementation of computationally expensive models daunting at best. Further, these languages have limited or nonexistent support for object-oriented programming (OOP), which is at the heart of most agent-based models. OOP also tends to stress modularity, which tends to make code re-usable across models and also more amenable to the kind of sharing that is implied by the philosophy behind both open source software and scientific collaboration.
Compiled languages, such as Java, provide many advantages for computational models. In the first instance, their use typically results in a drastic increase in speed as compared to interpreted languages. Other benefits include the existence of many modeling packages, such as Repast, Netlogo, Mason and Swarm, which provide extensive tools for many functions including graphics and scheduling. Such models also tend to produce large volumes of data, which are typically stored either in text files or database formats. Meaningful analysis of these data typically requires the usage of a scientific computing package such as R, Matlab or Mathematica, and the often awkward and time-consuming writing of scripts to collect and interpret model output that is typically in a non-standard format.
Broadly speaking, TFACES is a suite of tools for managing the process of computational experimentation, from running models to analysis and display of results, as well as using those results to guide choice of parameters for additional experiments. TFACES has three broad goals---the first two are fairly straightforward and short-term, and the third much more ambitious and long-term.
The first goal of TFACES is to ease the transition from implementation to execution. Executing a large model is a complicated task, involving coordination between the model, storage objects, and queuing software. To ease this process, TFACES provides facilities to invisibly bridge the gap between the model and these components secondary to model design. Thus GridSweeper and GSDrone (see Project #1) can be considered part of the overall TFACES suite of tools.
The second goal of TFACES is to simplify the transition from model execution to model interpretation. This short/medium-term goal of TFACES is aimed at reducing the work involved in managing and analyzing data from computational models, by designing and creating generic tools for organizing, analyzing and displaying data generated by computational models of CAS. This project will involve writing a core set of tools to:
The tools will generic in that they will provide the user with features to select data to be analyzed, what data to aggregate (e.g., to calculate statistics), etc. The tools will also serve as "templates" which others could use as starting points for more specialized scripts. Furthermore, this set of core TFACES tools should form the basis of additional tools to automate various kinds of regression or sensitivity analysis.
The third, long term goal of TFACES is to work toward closing the loop between compiled computational models and standard statistical and scientific computing platforms, by creating an interface between analysis packages and compiled computational models. This would serve as part of the larger goal of confronting computational models with data whenever possible. By providing a standardized front-end for Java-based models that makes this task easier, we hope to enable faster prototyping of models and to make it easier for researches to understand both the general behavior of their models and their relationship to data.
For Google Summer of Code 2010, CSCS is looking for up to 3 students to work on the TFACES project, focusing on the second goal above, i.e., to continue work to design and create a generic, broadly useful set of tools to extract, organize and manipulate data generated by a wide range of computational experiments. However, part of the project time will be devoted to getting a start on goal #3, e.g., exploring possible approaches, describing specifications for a prototype, and implementing some prototype systems, to test particular approaches and to gain experience that can help in the design of subsequent versions. Good candidate interns for this project will have experience with python and addons like scipy, numpy, etc; experience with R and and various other scientific analysis packages (e.g., matlab, mathematica) would be useful as well.
A first cut at many of the basic tools needed for the TFACES core have already been implemented, work started in SoC 2008 and continued since then. Last year we got a start at a core of tools entirely to Python. Numpy and scipy can efficiently satisfy all representational and mathematical requirements. In addition to completing the the core set of tools, the TFACES protocol should be solidified, and a suite of end-user tools written that make use of the core tools to perform analysis and display tasks that are commonly carried out as part of computational experiments, e.g.:
and many other similar tasks. But to be clear, the TFACES plan is to (a) first complete the design and creation of a generic core set of tools to extract, organize and manipulate data from computational experiments, and then (b) to create end-user tools built using the core tools, which will serve as templates to show how the core tools are intended to be used, and of course will show that the core tools are adequate and well designed to meet our goals.
Building bridges to span the gap between model implementation and interpretation/communication requires a number of steps. Firstly, model implementations generate data in any variety of forms. TFACES must understand not only filesystem or database storage formats or protocols, but also the underlying scalar, vector, or matrix representations, whether time series or endpoint. Secondly, these representations should be easily translated into interpretations via the standard methods of regression, principal component analysis, and model selection. Furthermore, visualization methods must be implemented to communicate both the representation and the interpretation.
As time permits after the core of TFACES is completed and a set of commonly used end-user tools implemented using those core tools, the next iteration of TFACES will include support for model-based inference using monte carlo and markov-chain monte carlo (MCMC) statistical methods. These methods are useful in the context of agent-based models (ABMs) for several reasons. First, ABMs, particularly those developed for social and biological problems, typically have many degrees of freedom and may be opaque to analysts who do not employ quantitative techniques to understand their behavior. For example, if an ABM is constructed for the purpose of scenario-based planning, it is important to be able to verify that a) the model is capable of reproducing the scenario in question and b) to find the set of model parameters that make the scenario feasible. MCMC methods can be particularly useful in this case, as the state space of the model is likely to be very large and the posterior distribution of the parameters is typically computationally expensive to sample from. These tools are also important for testing the mapping between an ABM and a potentially simpler analytic representation of a problem: if the state space of an ABM is well-represented by a 'lighter' and analytically tractable method, it is often sensible to work with the simpler implementation. It may also be the case that in some areas of the parameter space, the simpler model may be substituted for the ABM, but in others areas, the behavior of the two models diverge. This information can be useful both from an analytical and computational perspective, but requires reliable and well-constructed tools to become part of the typical agent-based modelling workflow.
This relationship can also be turned around and ABMs may also be used for the verification of statistical models that seek to verify the existence of mechanisms and perform parametric inference for real-world data. An ABM can be used to 'stress-test' a statistical technique and to see how far the behavioral and mechanistic assumptions of the estimation technique can be bent before they eventually break.
The GSoC-2009 TFACES sources are available here: http://code.google.com/p/tfaces/source/checkout
Project #3: Develop Libraries for using GPGPU for Agent-Based Models
Motivation -- Why large scale ABM on GPUs?
The greatest challenges facing the world today involve complex adaptive systems (CAS): ecology and biodiversity, social/political systems, economics, and public and individual health. Agent based modeling has emerged as a complementary approach to traditional methods in a wide range of fields, by employing a bottom-up approach to directly represent the dynamics of diverse, adaptive agents, embedded in and interacting through non-homogeneous spatial environments and social networks. Because agent- based models (ABMs) are "solved" by running extensive computational experiments their use carries considerable computational overhead. Given the processing constraints of single CPU computers most researchers are limited in the size of the ABMs they can use. Parallel CPU clusters have not adequately solved scalability issues due to limited memory bandwidth of CPU interconnect.
The overall goal of this project is to transform the modeling of CAS by innovative use of graphics processing units (GPUs) to address the computational requirements of large-scale ABMs. We want to create innovative algorithms and tools in order to implement computational experiments with GPU-based ABMs of CAS in medicine, ecological biodiversity, sociology of segregation and epidemiology of pandemics. All of these projects involve modeling agents interacting in spatially explicit, non-homogeneous environments, and several involve integration of GIS and census data with ABMs. By addressing the common challenges of agent-based modeling of CAS across many fields, we will design and build general-purpose tools and techniques to make large-scale agent-based modeling easily accessible to researchers in a variety of disciplines.
Graphics Processing Units (GPUs) were originally developed to handle computations related to the shaded display of three dimensional objects in computer graphics. The demand for specialized shading routines, especially for video games, led vendors to enable customized shading through user-defined functions that could directly programmed into hardware. Computational scientists used the same functionality to speed up certain scientific computations using general purpose-GPUs, and these have been previously used to simulate fluids, protein folding, solve large equation systems etc. The increasing use of GPUs in scientific computations has led to the development of direct APIs such as CUDA for general purpose data-parallel computation on GPUs. These APIs circumvent the graphics pipe line and allow user-defined data-structures and greater control on allocating memory and computing resources.
Background -- Preliminary Work
This project will build on preliminary work to develop a generalized data-parallel ABM simulation framework done by Roshan D'Souza and colleagues at Michigan Tech. Univ. For an description of the overall approach, some example algorithms and an agent-based model implemented using GPUs, see "A Framework for Megascale Agent Based Model Simulations on Graphics Processing Units," Mikola, L, and D'Souza, R., Journal of Artificial Societies and Social Simulation 11(4) 10 (2008), http://jasss.soc.surrey.ac.uk/11/4/10.html .
Our overall plan, building on ABM-GPUs prototypes built as part of GSoC projects for 2009, is to develop of a suite of GPU-based ABM implementations used to address research problems in range of different disciplines. As we develop those ABMs using GPU platforms, we will begin to develop libraries of routines (an API) that provides GPU support for typical capabilities that ABM implementations typically require. Then we will use that experience to guide the development of a general-purpose GPU-based ABM platform, a necessary step for the application of ABMs to a wider range of CAS modeling problems. Thus the goals for SoC 2010 are to:
The expectation is these tools will be built at the level of CUDA, or OpenCL (should it be readily available this year).
Some examples of typical ABM operations which are used in a wide variety of models across many disciplines. include: (A) Agent movement and agent birth, which must deal with possible collisions; (B) Predator-prey interactions, which can involve collisions or conservation-of-mass issues; (C) various agent decision-making methods, from simple heuristics to hedonic choice functions; (D) Agent interaction topologies (e.g., to spread information, infections, etc.) which can be as simple as nearest neighbors but can involve complex social or other network structures (which will require dealing with the inability to dynamically allocate memory, one object at a time, and limited recursion stack on the GPU); (E) agent learning, which might use Bayesian classifiers, neural networks, temporal difference methods as well as other approaches; (F) agent population evolution, which can include a number of operations including fitness evaluation, selection of parents, mating (recombination), and so on. In addition, ABM projects involve calculations of various statistical and pattern-based measures of system behavior.
Based on the experience gained and specific constructs developed to support particular agent-based models, a longer term goal is to develop a meta-modeling layer that will enable easy access to the underlying data-parallel simulation engine. We also want to develop tools for experimentation and statistical analysis. Novel visualization techniques will be investigated for rendering results as well as model validation and debugging.
Some of the specific modeling projects we have in mind as candidates for ABM's implemented using GPUs include: (1) In ecology, models to study the effect of human activities on biodiversity, looking at absract models of species dispersal and competition, with and without simple representations of various human interventions; (2) In social sciences, models of how ethnic preferences and economic differences among ethnic groups contribute to observed patterns of ethnic segregation; (3) In epidemiology, ABMs to explore how the responses of socio-economic-behaviorally diverse populations to non-pharmaceutical interventions (e.g., voluntary quarantines) affect the spread of pandemic influenza across large cities or regions in the USA. However, we are open to other proposals of candidate ABM projects to implement using this GPU-based approach.
GPGPU amd TFACES overlaps
A promising and growing area for agent-based models is in parametric statistical inference for behavioral and biological systems where an analytical likelihood function is either difficult or impossible to write. For example, many epidemiological question are scenario-based and involve both an infection process and the purposeful behavior of agents in a way that is not likely to be well-specified by a parametric distribution. In these cases, particle filtering methods, such as the one derived by Ionides, Breto & King (2006), and markov-chain Monte Carlo methods can be used to approximate or sample from the posterior distribution of the likelihood via simulation. These approaches usually involve many independent runs of a single model and are highly processor intensive. The significant speedup from a GPGPU implementation of the relevant ABM promises to take the application of these estimation processes to real-world data from the realm of the theoretically interesting into that of the practical and feasible.
Monte Carlo statistical methods, which rely on large numbers of random samples, are a large and growing part of the literature in model-based inference in biology and epidemiology. Many of these problems can be estimated in a way that exploits the property of optimal substructure in their likelihood function. For example, Zelner, King, Moe & Eisenberg (2009) perform inference for a dataset comprised of 149 independent households that were exposed to a single infectious source. The likelihood function for each household can be computed in parallel using the GPU and then aggregated by the CPU for evaluation of the fit of the parameters to all samples. Tools such as hierarchical random graphs [See Clauset (2008)] can be used to develop large scale epidemiological mixing models that involve a complicated contact network but also have the property of optimal substructure. This opens up the possibility of performing inference on the unobserved structure of social contacts, a problem that is typically high-dimensional and difficult to solve on traditional hardware, as the computation of a large number of pairwise relationships, which is typically unavoidable, approaches O(N^2).
Ionides EL, Breto C, King A.
Zelner JL, King A, Moe CL, Eisenberg JNS.
Clauset A, Moore C, Newman MEJ.
Additional information and downloads may be found on the CSCS software pages.
We gratefully acknowledge Google for supporting the Summer of Code program in general, and for supporting the CSCS projects in particular.
We also thank NVidia for contributions of GPU Cards and a Tesla 1070 through its Professor Partnership Program.