**CSCS SoC Ideas** --- Google Summer of Code 2009.

For Google Summer of Code 2009 projects, CSCS would like to find students
interested in working on one of the following projects.
We anticipate we need 1 student for the first project, 3 students
for the second one and 2-3 for the third.

However, if you have a project you think are related to complex systems
research, please feel free to contact us, or better, submit a proposal
via SoC to CSCS. If it is a project we feel we can mentor and one that
is related to complex systems research, we will consider it.

If you have questions about these projects, please contact
the head mentor for these projects, Rick.Riolo@gmail.com,
Director of the CSCS Computing Lab.

Note: Some information you should include in your application:

- Information about the skills you bring to the project.
(classes taken, programming languages you know, past projects, etc.)
- What skills you would like to gain or improve while working on this project.
- What interests you about the project(s), including
how the project(s) fit in with your other interests and goals,
including goals related to Complex Systems research.

FAQ -- *please read regarding required/desired experience.*

**Project #1: GridSweeper Improvements, Enhancements and Release**

GridSweeper is a tool for running batch simulations with different
parameter settings on grid computing systems. The core architecture
was implemented for Google Summer of Code 2006, including code for
generating many kinds of parameter sweeps and submitting jobs to the grid via DRMAA.
In GSoC 2007/2008, several more pieces were implemented and tested:

- A command-line interface for submitting and monitoring jobs.
- A plug-in to support models compatible with the Drone batch script
- A draft of a User Manual and reference pages.
- Some stress testing, bug fixes, etc.

Since that time a number of researchers have used the Drone adapter
for GridSweeper to carry out large scale experiments.
For 2009 we would like to have someone continue to work on GridSweeper,
with the primary goal being a complete Release 1.0, including:

- Complete stress testing, fix bugs, etc.
- Add some additional experiment control features.
- Add some features to aid users setting up experiments
(e.g., to be sure model parameters are properly specified).
- Complete work on the User Manual.
- Complete work on a Version 1.0 official distribution.

If time is available, a secondary set of goals include:

- Write adapters for Matlab, NetLogo, R, etc.
- Test GridSweeper on non-SGE grid systems, e.g. Torque/PBS, Xgrid, etc.
- Continue work that was just (barely) started in 2008,
to enable more advanced control of parameter sweeps for doing
sensitivity analysis, including standard techniques like
latin-hypercube samping (LHS) as well as "intelligent" techniques
like adaptive nonlinear testing (ANT).

Candidates for working on the GridSweeper project should have OOP
experience, preferably with Java and/or Objective-C.
Experience with SGE, PBS/Torque, Xgrid or other Grid technologies
would be useful as well.

Documentation and a jar for the current Gridsweeper is here in gridsweeper-20090310.tar.gz.

The source for GridSweeper is here in gridsweeper-src-20090310.zip.

**Project #2: Tools for Analysis of Computational ExperimentS (TFACES)**

A large part of any computational modeling project involves analyzing and
displaying the data generated, to help the experimenter understand what
the model is doing and why, to guide the selection of additional model
runs and experiments to be carried out, and ultimately to prepare data for
use in publications. Besides the intellectual challenges of figuring out
what is useful to measure and display, which tend to be somewhat dependent
on the specific "content" of each model, there are also a number of
technical challenges which often are common across models. For example,
for typical agent/individual-based models (ABM or IBM) of Complex Adaptive
Systems (CAS), an "experiment" typically consists of running the model many
times with the same set of parameter values, just starting with different
random number generator (RNG) seeds, to get a distribution of outcomes for
that one "case". Then one typically will run many cases, each with different
combinations of parameter values, to do sensitivity analysis or just to
explore the space of possible parameter values (or initial conditions).
Given all that data, it would be nice to have tools to calculate statistics
across runs, within runs, for parts of runs, etc., and repeat for a set of
cases (e.g., results for different values of a parameter), and create plots
of various kinds based on the data and statistics.

Scientific computing environments, which provide extensive statistical and
display capabilities, are typically interactive and provide useful
and standardized tools for data exploration and analysis. They also typically
rely on interpreted languages, making the implementation of computationally
expensive models daunting at best. Further, these languages have limited or
nonexistent support for object-oriented programming (OOP), which is at the
heart of most agent-based models. OOP also tends to stress modularity, which
tends to make code re-usable across models and also more amenable to the kind
of sharing that is implied by the philosophy behind both open source software
and scientific collaboration.

Compiled languages, such as Java, provide many advantages for computational
models. In the first instance, their use typically results in a drastic
increase in speed as compared to interpreted languages. Other benefits include
the existence of many modeling packages, such as Repast, Netlogo, Mason and
Swarm, which provide extensive tools for many functions including graphics
and scheduling. Such models also tend to produce large volumes of data, which
are typically stored either in text files or database formats. Meaningful
analysis of these data typically requires the usage of a scientific computing
package such as R, Matlab or Mathematica, and the often awkward and
time-consuming writing of scripts to collect and interpret model output
that is typically in a non-standard format.

Broadly speaking, TFACES is a suite of tools for managing the process of computational experimentation, from running models to analysis and display
of results, as well as using those results to guide choice of parameters
for additional experiments. TFACES has three broad goals---the first two are
fairly straightforward and short-term, and the third much more ambitious
and long-term.

The *first goal* of TFACES is to ease the transition from implementation to
execution. Executing a large model is a complicated task, involving
coordination between the model, storage objects, and queuing software.
To ease this process, TFACES provides facilities to invisibly bridge
the gap between the model and these components secondary to
model design. Thus GridSweeper and GSDrone (see Project #1) can be considered
part of the overall TFACES suite of tools.

The *second goal* of TFACES is to simplify the transition from model
execution to model interpretation. This short/medium-term goal of TFACES
is aimed at reducing the work involved in managing and analyzing
data from computational models, by designing and creating generic tools for organizing, analyzing and displaying data generated by computational models
of CAS. This project will involve **writing a core set of tools to**:

- extract and model parameter values and model generated data from
large sets of report files generated by computational experiments,
- store that data in sqllite tables so that it can be easily accessed
by a range of analysis tools (R, python, Excel, etc), and
- perform basic data manipulation functions to support further analysis
and display of the results from computational experiments.

The tools will generic in that they will provide the user with features
to select data to be analyzed, what data to aggregate (e.g., to calculate
statistics), etc. The tools will also serve as "templates" which others
could use as starting points for more specialized scripts. Furthermore,
this set of core TFACES tools should form the basis of additional tools
to automate various kinds of regression or sensitivity analysis.

The *third, long term goal* of TFACES is to work toward closing the loop
between compiled computational models and standard statistical and
scientific computing platforms, by creating an interface between analysis
packages and compiled computational models. This would serve as part of the
larger goal of confronting computational models with data whenever possible.
By providing a standardized front-end for Java-based models that makes this
task easier, we hope to enable faster prototyping of models and to make it
easier for researches to understand both the general behavior of their models
and their relationship to data.

For *Google Summer of Code 2009*, CSCS is looking for 3 students to work
on the TFACES project, focusing on the second goal above, i.e., to continue
work to design and create a generic, broadly useful set of tools to extract,
organize and manipulate data generated by a wide range of computational
experiments. However, part of the project time will be devoted to getting a
start on goal #3, e.g., exploring possible approaches, describing
specifications for a prototype, and implementing some prototype systems,
to test particular approaches and to gain experience that can help in the
design of subsequent versions. Good candidate interns for this project will
have experience with python and addons like scipy, numpy, etc; experience with
R and and various other scientific analysis packages (e.g., matlab,
mathematica) would be useful as well.

A first cut at many of the basic tools needed for the TFACES core have
already been implemented, work started in SoC 2008 and continued since then.
However, the core of TFACES is currently implemented in Python and R,
and one of the goals of TFACES this year is to move entirely to Python.
Numpy and scipy can efficiently satisfy all representational and mathematical
requirements. In addition to completing the the core set of tools,
the TFACES protocol should be solidified, and some end-user tools written
that make use of the core tools to perform analysis and display tasks
that are commonly carried out as part of computational experiments, e.g.:

- calculate average values of measures over time in individual model runs,
then average those across multiple runs, and then plot those averages
(with error bars) as a function of some model parameter that has been
varied across sets of runs ("cases").
- extract and plot time series for particular runs, sets of runs, or
averages across runs.

But to be clear, the TFACES plan is to (a) first complete the design and
creation of a generic core set of tools to extract, organize and manipulate
data from computational experiments, and then (b) to create end-user tools
built using the core tools, which will serve as templates to show how the
core tools are intended to be used, and of course will show that the core
tools are adequate and well designed to meet our goals.

Building bridges to span the gap between model implementation and
interpretation/communication requires a number of steps. Firstly, model
implementations generate data in any variety of forms. TFACES must understand
not only filesystem or database storage formats or protocols, but also the
underlying scalar, vector, or matrix representations, whether time series or
endpoint. Secondly, these representations should be easily translated into
interpretations via the standard methods of regression, principal component
analysis, and model selection. Furthermore, visualization methods must be
implemented to communicate both the representation and the interpretation.

As time permits after the core of TFACES is completed and a set of
commonly used end-user tools implemented using those core tools,
the next iteration of TFACES will include support for model-based inference
using monte carlo and markov-chain monte carlo (MCMC) statistical methods.
These methods are useful in the context of agent-based models (ABMs) for several reasons. First, ABMs, particularly those developed for social and
biological problems, typically have many degrees of freedom and may be opaque
to analysts who do not employ quantitative techniques to understand their
behavior. For example, if an ABM is constructed for the purpose of
scenario-based planning, it is important to be able to verify that a) the model
is capable of reproducing the scenario in question and b) to find the set of
model parameters that make the scenario feasible. MCMC methods can be
particularly useful in this case, as the state space of the model is likely to
be very large and the posterior distribution of the parameters is typically
computationally expensive to sample from. These tools are also important for
testing the mapping between an ABM and a potentially simpler analytic
representation of a problem: if the state space of an ABM is well-represented
by a 'lighter' and analytically tractable method, it is often sensible to work
with the simpler implementation. It may also be the case that in some areas of
the parameter space, the simpler model may be substituted for the ABM, but in
others areas, the behavior of the two models diverge. This information can be
useful both from an analytical and computational perspective, but requires
reliable and well-constructed tools to become part of the typical
agent-based modelling workflow.

This relationship can also be turned around and ABMs may also be used for the
verification of statistical models that seek to verify the existence of
mechanisms and perform parametric inference for real-world data. An ABM can be
used to 'stress-test' a statistical technique and to see how far the behavioral
and mechanistic assumptions of the estimation technique can be bent before they
eventually break.

**Project #3: Develop Libraries for using GPGPU for Agent-Based Models**

*Motivation -- Why large scale ABM on GPUs?*

The greatest challenges facing the world today involve complex adaptive
systems (CAS): ecology and biodiversity, social/political systems,
economics, and public and individual health. Agent based modeling has
emerged as a complementary approach to traditional methods in a wide range
of fields, by employing a *bottom-up* approach to directly represent the
dynamics of diverse, adaptive agents, embedded in and interacting through
non-homogeneous spatial environments and social networks. Because agent-
based models (ABMs) are "solved" by running extensive computational
experiments their use carries considerable computational overhead. Given
the processing constraints of single CPU computers most researchers are
limited in the size of the ABMs they can use. Parallel CPU clusters have
not adequately solved scalability issues due to limited memory bandwidth
of CPU interconnect.

The overall goal of this project is to transform the modeling
of CAS by innovative use of graphics processing units (GPUs) to address
the computational requirements of large-scale ABMs. We want to create
innovative algorithms and tools in order to implement computational
experiments with GPU-based ABMs of CAS in medicine, ecological
biodiversity, sociology of segregation and epidemiology of pandemics. All
of these projects involve modeling agents interacting in spatially
explicit, non-homogeneous environments, and several involve integration of
GIS and census data with ABMs. By addressing the common challenges of
agent-based modeling of CAS across many fields, we will design and build
general-purpose tools and techniques to make large-scale agent-based
modeling easily accessible to researchers in a variety of disciplines.

Graphics Processing Units (GPUs) were originally developed to handle
computations related to the shaded display of three dimensional objects in
computer graphics. The demand for specialized shading routines,
especially for video games, led vendors to enable customized shading
through user-defined functions that could directly programmed into
hardware. Computational scientists used the same functionality to speed up
certain scientific computations using general purpose-GPUs, and these have
been previously used to simulate fluids, protein folding, solve large
equation systems etc. The increasing use of GPUs in scientific
computations has led to the development of direct APIs such as CUDA for
general purpose data-parallel computation on GPUs. These APIs circumvent
the graphics pipe line and allow user-defined data-structures and greater
control on allocating memory and computing resources.

*Background -- Preliminary Work*

This project will build on preliminary work to develop a generalized data-parallel ABM simulation framework done by Roshan D'Souza and colleagues
at Michigan Tech. Univ. For an description of the overall approach, some
example algorithms and an agent-based model implemented using GPUs,
see "A Framework for Megascale Agent Based Model Simulations on Graphics
Processing Units," Mikola, L, and D'Souza, R., *Journal of Artificial Societies and Social Simulation 11(4) 10* (2008),
http://jasss.soc.surrey.ac.uk/11/4/10.html .

Our overall plan, guiding the SoC projects for 2009, is to develop of a suite
of GPU-based ABM implementations used to address research problems
in range of different disciplines. As we develop those ABMs using GPU
platforms, we will begin to develop libraries of routines (an API) that
provides GPU support for typical capabilities that ABM implementations
typically require. Then we will use that experience to guide the development
of a general-purpose GPU-based ABM platform, a necessary step for the
application of ABMs to a wider range of CAS modeling problems.
Thus the goals for SoC 2009 is to:

- develop one or two ABM models using GPUs as the main computational engine;
- based on what is needed for those models, begin to create a library/API
of tools that provide typical ABM operations using GPU.

The expectation is these tools will be built at the level of CUDA,
or perhaps OpenCL (should it become readily available early enough this year).

Some examples of typical ABM operations which are used in
a wide variety of models across many disciplines. include:
(A) Agent movement and agent birth, which must deal with possible
collisions; (B) Predator-prey interactions, which can involve
collisions or conservation-of-mass issues; (C) various agent
decision-making methods, from simple heuristics to hedonic choice
functions; (D) Agent interaction topologies (e.g., to spread information, infections, etc.) which can be as simple as nearest neighbors but can
involve complex social or other network structures (which will require
dealing with the inability to dynamically allocate memory,
one object at a time, and limited recursion stack on the GPU);
(E) agent learning, which might use Bayesian classifiers,
neural networks, temporal difference methods as well as other
approaches; (F) agent population evolution, which can include a number
of operations including fitness evaluation, selection of parents,
mating (recombination), and so on. In addition, ABM projects involve
calculations of various statistical and pattern-based measures of system
behavior.

Based on the experience gained and specific constructs developed to
support particular agent-based models, a longer term goal is to
develop a meta-modeling layer that will enable easy access to the
underlying data-parallel simulation engine. We also want to develop
tools for experimentation and statistical analysis. Novel visualization
techniques will be investigated for rendering results as well as model
validation and debugging.

Some of the specific modeling projects we have in mind as candidates
for ABM's implemented using GPUs include:
(1) In ecology, models to study the effect of human activities on biodiversity, looking at absract models of species dispersal and
competition, with and without simple representations of various human
interventions; (2) In social sciences, models of how ethnic preferences
and economic differences among ethnic groups contribute to observed
patterns of ethnic segregation; (3) In epidemiology, ABMs to
explore how the responses of socio-economic-behaviorally diverse
populations to non-pharmaceutical interventions (e.g., voluntary
quarantines) affect the spread of pandemic influenza across large cities
or regions in the USA. However, we are open to other proposals of
candidate ABM projects to implement using this GPU-based approach.

*GPGPU amd TFACES overlaps*

A promising and growing area for agent-based models is in parametric
statistical inference for behavioral and biological systems where an analytical
likelihood function is either difficult or impossible to write. For example,
many epidemiological question are scenario-based and involve both an infection
process and the purposeful behavior of agents in a way that is not likely to be
well-specified by a parametric distribution. In these cases, particle filtering
methods, such as the one derived by Ionides, Breto & King (2006), and
markov-chain Monte Carlo methods can be used to approximate or sample from the
posterior distribution of the likelihood via simulation. These approaches
usually involve many independent runs of a single model and are highly
processor intensive. The significant speedup from a GPGPU implementation of the
relevant ABM promises to take the application of these estimation processes to
real-world data from the realm of the theoretically interesting into that of
the practical and feasible.

Monte Carlo statistical methods, which rely on large numbers of random samples,
are a large and growing part of the literature in model-based inference in
biology and epidemiology. Many of these problems can be estimated in a way that
exploits the property of optimal substructure in their likelihood function. For
example, Zelner, King, Moe & Eisenberg (2009) perform inference for a dataset
comprised of 149 independent households that were exposed to a single
infectious source. The likelihood function for each household can be computed
in parallel using the GPU and then aggregated by the CPU for evaluation of the
fit of the parameters to all samples. Tools such as hierarchical random graphs
[See Clauset (2008)] can be used to develop large scale epidemiological mixing
models that involve a complicated contact network but also have the property of
optimal substructure. This opens up the possibility of performing inference on
the unobserved structure of social contacts, a problem that is typically
high-dimensional and difficult to solve on traditional hardware, as the
computation of a large number of pairwise relationships, which is typically
unavoidable, approaches O(N^2).

**Bibliography**

Ionides EL, Breto C, King A.

Inference for nonlinear dynamical systems.\\
Proceedings of the National Academy of Sciences 2006;103:18438-43.

Zelner JL, King A, Moe CL, Eisenberg JNS.

Inference for Household Transmission Systems: An Analysis of Secondary Transmission in a Norovirus Outbreak.

Unpublished manuscript.

Clauset A, Moore C, Newman MEJ.

Hierarchical structure and the prediction of missing links in networks.

Nature, 2008:98-101.

Additional information and downloads may be found
on the CSCS software pages.

We gratefully acknowledge Google for supporting the Summer of Code program in general,
and for supporting the CSCS projects in particular.

We also thank NVidia for contributions of GPU Cards and a Tesla 1070 through its
Professor Partnership Program.