Menu ▼

Science on Beagle

We describe in this section some highlights of the research carried out on Beagle. While the majority of usage of Beagle is on health-related projects, as required by NIH, some work on projects in other areas has also been accomplished.

Health-related projects

SwiftSeq: A High-Performance Workflow to for Processing DNA Sequencing Data

Jason Pitt, Kevin White Lab, Committee on Genetics, Genomics, and Systems Biology, University of Chicago

With increasing throughput and decreasing cost of next generation sequencing (NGS) technologies, investigators and consortia are now producing extraordinary amounts of genomics data. This transformation of genomics to a data-intensive science has left a significant portion of researchers unable to reap the benefits of their own data, let alone the troves available from public repositories.

Figure 1. Schematic of the SwiftSeq workflow. By employing multiple map-reduce steps, an optimized job packing scheme, and the parallel nature of Swift, SwiftSeq provides both a fast and efficient way to process next-generation DNA sequencing data.

In the Kevin White lab, much of our data comes from The Cancer Genome Atlas, a project which strives to characterize the genomic properties of over 20 different cancer types. To date, this project has generated over a petabyte of compressed sequencing data. While significant hardware resources such as Beagle are necessary to perform large-scale analyses, insufficient software infrastructure is a substantial bottleneck to data-intensive genomics. Using Beagle as a test-bed, and in collaboration with the Computation Institute, we’ve developed SwiftSeq, a highly-parallel and scalable next generation DNA sequencing workflow underpinned by the parallel scripting language Swift (Figure 1).
SwiftSeq is able to perform end-to-end analysis beginning with sequencing reads and ending with annotated genotypes. Independent samples, tumor-normal pairs, and either bam or fastq files are accepted as input. Samples are pre-processed, aligned, post-processed, and genotyped in a manner consistent with the 1,000 Genomes Project and Genome Analysis Toolkit’s best practices. To maintain workflow flexibility, algorithms for key processes such as alignment and genotyping are interchangeable, which is critical given dynamic nature of genome analysis. Computational adversities such as execution site selection, parallelization, and synchronization are all managed under-the-hood by capitalizing on Swift’s implicitly parallel framework. Executed tasks are robust to transient software and localized hardware failures, keeping user intervention to a minimum. Due to an optimized packing scheme, SwiftSeq requires less than a third of the computational resources necessary for standard pipeline-based approaches. Importantly, the workflow is easily portable and has been successfully deployed to clusters and Bionimbus clouds in addition to Beagle. Both members of the Beagle team (Lorenzo Pesce, Ana Marija Sokovic, & Joe Urbanski) and the Swift developers have provided significant technical and development support to make this software possible. With their continued assistance, we hope to provide SwiftSeq as an analysis resource to the University of Chicago as well as the genomics community at large.
Scientifically, SwiftSeq has become a workhorse for many collaborative projects such as whole genome sequencing of Nigerian individuals with breast cancer (Funmi Olopade, University of Chicago), exome sequencing of autism families (Andrey Rzhetsky, University of Chicago), and the ENCODE-Cancer working group. To date, over 5,500 germline exomes from The Cancer Genome Atlas have been uniformly processed with SwiftSeq using approximately 4.8 million core hours. By integrating these exonic genotype calls with somatic copy number alterations, we’ve shown that inherited and acquired DNA changes can interact to reveal novel tumor suppressor genes. Using cells engineered to carry our mutations of interest, we’re performing experiments to validate if these aberrations can indeed induce cancer phenotypes. With SwiftSeq and Beagle we intend to continue discovering and validating novel genomic features, as well as exploring biological questions that would otherwise be computationally infeasible.

Ion selectivity of the bacterial NavAb channel: Gaining insight from microsecond timescale molecular dynamics simulations.

Karen Callahan, Benoit Roux Lab
Department of Biochemistry and Molecular Biology, University of Chicago
The cell membrane separates a cell from its environment, but there are many kinds of transport proteins that can allow ions, water, and small molecules to cross. Ion channels are a type of transport that allows ions to pass into or out of the cell, some of these only allow a specific anion or cation type to pass, others are less selective. In humans voltage-gated, ion-selective ion channels are part of many pathways, including action potentials of nerves, muscles, and many signaling pathways.
The first voltage-gated potassium channel was solved over ten years ago, but other ion-selective channel families have been more elusive. Two summers ago, the first crystal structure of a putative sodium-selective, voltage-gated ion channel, NavAb, was published. The pore domain has a narrow region called a selectivity filter, a large water-filled cavity, and at the bottom it has a gate that is opened and closed by the voltage sensors(see figure).

Figure 1. Top left, the crystal structure of the pore domain of NavAb (PDB ID : 3RVY) with front and back segments removed for clarity. Note that the gate at the bottom on the pore domain is closed preventing ion conduction. Top right, bird’s eye view through the top of the truncated model of NavAb.

Figure 2. The side view of the truncated model of NavAb used for simulations of ion conduction with the front and back segments removed. Water (blue) and sodium (white) penetrate the pore, however chloride (yellow) does not.

The selectivity filter of NavAb is not as narrow as the selectivity filter of the known potassium-selective channels, in most places there is room for the penetrating ion to remain solvated, and cations may pass each other in some places, potentially leading to a multitude of non-serial pathways for conduction. To provide non-static picture of the ion channel, and to study the mechanism of ion conduction and ion selectivity through the selectivity filter of the pore in a statistically based manner, microsecond-timescale molecular dynamics simulations of ion conduction were performed on the Beagle supercomputer. While these timescales are frequently achieved on dedicated machines like DE Shaw Research’s Anton, they are less commonly attained using local clusters and were unheard of in the recent past. The long timescales of the simulation, along with the versatility of using parallel versions of traditional molecular dynamics code (NAMD), allowed us to use a protein that was truncated to provide a model of the selectivity filter in a stiff supporting lattice (figure), rather than requiring a complete pore in a lipid membrane, reducing necessary system size by nearly half. The effect of cations in solution, their concentration and applied voltage on the selective conduction of sodium and potassium through the NavAb selectivity filter has been studied.

Zodiac: a comprehensive depiction of genomic interactions in cancer

Yuan Ji,PhD, Lorenzo Pesce,PhD, Yitan Zhu PhD Complex diseases like cancer are rarely caused by an abnormality in a single gene, but rather reflect perturbations to intracellular molecular interaction networks that attract cells to new malignant and carcinogenic states. Learning the genomic interaction networks is critical to the elucidation of the molecular mechanisms of cancer, identification of cancer genes and pathways, and discovery of network-based biomarkers that improve disease diagnosis and prognosis.
We introduce Zodiac, a comprehensive cancer genomic interaction database, based on integrative analyses of The Cancer Genome Atlas (TCGA) data (, which include whole-genome measurements of multiple genomic and epigenomic features, such as DNA sequence, copy number, DNA methylation, gene expression, protein expression, and miRNA expression for thousands of matched cancer patient samples in over 20 types of cancer. We develop and extend novel Bayesian Graphical Models (BGMs) to infer genomic interactions in each pair of genes (total ~200 million gene pairs) for every cancer type using the multi-modal TCGA data.

Figure 1. An illustration of some genomic interactions that can be inferred in the analysis of a gene pair. Blue solid lines indicate intragenic interactions. Orange dashed lines indicate intergenic interactions.

Figure 2. Examples of significant molecular interactions identified by BGMs in breast invasive carcinoma. A. The identified intragenic interaction network of tumor suppressor gene PTEN. B. The most significant interaction between Androgen Receptor (AR) and KLK3 inferred by BGM. AR is a transcription factor that regulates KLK3 expression. AR has been demonstrated to be a prognostic factor for breast cancer and is considered a target of endocrine therapies for breast cancer.

Figure 1 illustrates potential intragenic and intergenic interactions that can be inferred or validated by our analysis. Intragenic interactions indicate relationships between the genomic and epigenomic features of the same gene, such as repression of transcription by promoter methylation. Intergenic interactions refer to interactions between genetic and epigenetic features of different genes. For example, we can validate whether the protein expression of a transcription factor (TF) is statistically inferred to have a significant positive interaction with the gene expression of its potential target gene in various cancer conditions. Some other potential intergenic interactions that can be identified include protein-protein integrations, miRNA regulation of gene expression, coexpressed or anti-expressed genes/proteins, similar or anti-correlated methylation patterns of genes, and similar or anti-correlated copy number changes of genes.
The genomic interaction network is taken as a random variable to be inferred by the proposed BGM. The key idea is to combine prior knowledge (if any) on the interaction networks with the information contained in TCGA data to conduct posterior inference, i.e.,

which improves the biological relevance of the inference results.

In our analysis, prior knowledge about genomic interaction network is obtained from the following four different sources.

(1) Databases of miRNA-mRNA regulations, such as miRtarbase and TargetScan, which provide miRNA targets curated based on either biological validation experiments or predictions based on complementary sequence matching between miRNA seed region and binding site in mRNA 3’UTR.
(2) Databases of TF-DNA binding, such as ENCODE, which provides TF ChIP-seq data in various cell lines including some cancer cell lines, as well as TRANSFAC and JASPAR, which provide predictions of TF binding sites based on sequence motif analysis.
(3) “Central dogma” of intragenic interactions, such as gene copy number amplification increases gene expression and gene expression upregulation increases protein expression.
(4) Other comprehensive databases about molecular interactions, including pathway databases, such as KEGG Pathway and Pathway Commons, and protein-protein interaction databases, such as HPRD, IntAct, MINT, and BioGRID.

Markov chain Monte Carlo sampling is used to fit the BGM to the multi-platform TCGA data for interaction network analysis, which takes about 50 seconds in average for the analysis of one gene pair. The analyses of gene pairs are independent from each other and can be performed in parallel, which forms an idea computation task to be executed on Beagle. The total computation time of all gene-pair analyses (~200 million) is about 110,000 node hours on Beagle. We finished the first round analysis of all gene pairs and are currently in the process of screening the analysis results for biological findings. Figure 2 shows examples of inferred biologically plausible genomic interaction networks.


Supercomputing for the parallelization of whole genome analysis

Megan Puckelwartz, PhD, Lorenzo Pesce, PhD, Elizabeth McNally, MD, PhD (University of Chicago) explore the cost of sequencing an entire human genome is now moving into the range where it is being broadly applied in both the research and clinical setting. Whole genome analysis (WGA) requires the alignment and comparison of raw sequence data, with approximately 125GB of data generated per individual genome. While the cost of generating such data has declined dramatically, a computational bottleneck exists limiting accuracy and efficiency to analyze multiple genomes simultaneously. We now adapted Beagle, the Cray XE6 supercomputer at Argonne National Laboratories and the Computational Institute, to achieve the parallelization necessary for concurrent multiple genome analysis and named this workflow Megaseq. Compared to other rapid approaches for WGA, Megaseq produces more usable sequence per genome from the same raw data. Megaseq relies on publicly-available software packages, and using MegaSeq WGA of as many as 240 genomes, from fastq extraction through variant calling, can be completed in approximately 50 hours. This achievement is poised to greatly facilitate defining the range and utility of human genome variation.

Figure 1. Cost of sequencing. Prior to 2007, the cost of sequencing (green line) followed the same scaling as Moore’s law (white line). The advent of next generation sequencing technology has caused a dramatic shift in that the bottleneck for whole genome analysis is no longer sequencing cost/time, but computational efficiency.

Figure 2. Time Constraints of MegaSeq. Steps of MegaSeq workflow are shown along the X-axis. Wall time (read squares), CPU time for 1 genome (blue triangles) and CPU time for 240 genomes (green triangles) are plotted. Beagle’s large size and memory allow for massively parallel analysis of genomic data.

Time Constraints: Computational time is a major bottleneck in the analysis of whole genome sequence data. We calculated CPU (central processing unit) time based on a 2.1GHz processor at 2208 hours for a single genome. This time can vary based on clock speed, memory speed and disk speed. These calculations are bound by disk and memory more than by CPU. We suggest that these single genome times push the envelope for the maximum speed for the Beagle supercomputer.
A total of ~37 years of CPU time is required to complete the workflow for 240 genomes. The MegaSeq workflow can be completed on 240 genomes in 50.4 real time hours using Beagle (Figure 2).

Space Constraints: Space constraints are another major hurdle in large volume whole genome analysis. We estimate that approximately 1TB of space, per genome, was needed to complete the MegaSeq analysis .While it is possible to discard output from previous steps, there is still a large volume of data that is active at any one-time. During the cleanup stages, each genome uses approximately ~85GB of space; roughly 5TB of space is needed to process each step for 61 genomes. These space requirements can generally not be met by small clusters or by large clusters with many users. The final merged VCF file for each patient is approximately 1GB.

We employed Beagle to improve the throughput and efficiency of WGA, using freely available, publicly available software packages. MegaSeq provides a workflow by which, whole genome sequencing can be rapidly analyzed relieving the computational bottleneck that currently exists in analyzing human genomic variation.


Taeyoon Kim’s group (University of Chicago) explore how a cell contracts against an external matrix using a computational model incorporating actin filaments, actin cross-linking proteins, and molecular motors. Results illustrate a novel mechanism for rigidity-sensing, which depends on the distance a motor walks along an actin filament to generate contractile force.Cells modulate themselves in response to the surrounding environment like substrate elasticity, exhibiting structural reorganization driven by the contractility of cytoskeleton. The cytoskeleton is the scaffolding structure of eukaryotic cells, playing a central role in many mechanical and biological functions. It is composed of a network of actins, actin cross-linking proteins (ACPs), and molecular motors.

Illustration of the network morphology with different levels of substrate stiffness (E) and showing the details of motors and ACPs. (A) An actin network before contraction, consisting of actin filaments (cyan) cross-linked by ACPs (green) and molecular motors (red). (B) Cross-sections of the network at steady states for three different values of E showing morphology and the magnitudes of extensional forces ( ). The network contracts to different degrees and at different rates depending on E. Soft substrates (lower E) lead to a condensed network, whereas stiff substrates (higher E) contract very little, resulting in a heterogeneous network with tensed filaments.


Aaron Dinner’s group (University of Chicago) tested new software that they developed on Beagle. This software enables more than a thousand fold speedup of simulation of molecular processes in living systems. The specific process studied was the folding of an RNA that must take a well-defined shape to perform its biological function.

Example of an unfolding intermediate harvested at high flow. The RNA is tethered in a microfluidic channel at the point indicated by the gray sphere. Colored lines show major contacts that are broken. (Credit: Dinner)


Rick Stevens’ group (University of Chicago & Argonne National Laboratory) has used Beagle to build flux balance analysis (FBA) based metabolic models for all sequenced microbes (over 4,000 models). These models have been used to computationally predict essential genes in hundreds of pathogens, thus enabling deeper understanding of potential targets for antimicrobial therapies. These targets are being made available to the public via the ModelSEED web site (

Reconstructing Models in High Throughput: Model SEED (Credit: Henry and Stevens)


Maryellen Giger’s group (University of Chicago) works on Advanced Breast Cancer Image Analysis and Computer Aided Diagnosis (CAD). They explore state-of-the-art techniques for recovering complex, non-linear structures from high-dimensional data, with the specific goal of relating disease and information extracted from radiologically imaged breast tissue. Using Beagle, they have investigated the properties of various dimensionality reduction schemes and their influence on CAD engine performance, and a manuscript is in the final phases of preparation. Overall, CAD is a tool that is expected to help improve performance and consistency for radiologists. This has the potential to improve both the speed and accuracy of cancer diagnoses, reducing unnecessary biopsies and avoiding false negatives, and it may later impact diagnoses in other areas where imaging plays an important role. Other research that is part of this project involved development and testing of reconstruction methods for CT and MRI scans, analysis of MRI images based on contrast agents or dynamic data, which is promising for breast cancer diagnosis. Other lines of research involve an investigation of the potential of simulation-based methods for the optimization of breast tomosynthesis.

Example of feature extraction followed by dimensionality reduction before the training of a classifier for the identification of breast cancer lesions. (Credit: Giger)


The collaborating groups of Karl Freed and Tobin Sosnick (University of Chicago) use Beagle to develop and validate new protein structure and folding pathway prediction algorithms that start from the amino acid sequence, without the use of templates. Their approach employs novel, compact molecular representations and innovative “moves” of the protein backbone to achieve accurate prediction with far less computation than previous methods. One of their methods, called “loop modeling”, which rebuilds local protein structures, was highly ranked in the international CASP9 competition (Critical Assessment of Protein Structure Prediction). The group is using Beagle to evaluate loop modeling algorithm on new proteins and parameter spaces, to develop a new “tertiary fixing” algorithm, and to compare these algorithm to other approaches. Computational prediction is important for proteins whose structure has not yet been, or cannot be, determined by experimental methods. These largely unsolved problems are crucial for basic biological research and for applications in national priority areas, such as pathogens, biofuels, remediation, ecosystems, metagenomics, crops, and nutrition.


Tobin Sosnick’s group developed and evaluated a new, more accurate and efficient predictive methods for large-scale explorations of RNA-protein interactions, yielding new insights into cellular networks. Their “modFTDock” application (M. Parisien et al.) can identify protein molecules having a high structural affinity towards a specific RNA molecule. Non-coding RNAs often function in cells through specific interactions with their protein partners. Experiments alone cannot provide a complete picture of this interaction. No existing computing method was previously able to provide such genome-wide predictions of docked RNA-protein complexes. Using Beagle, the group was able to process 1,500 proteins using the Swift parallel scripting language. This work was presented in a poster at the 2011 annual RNA Society meeting: Parisien M., Sosnick T.R., Pan T., “Systematic prediction and validation of RNA-protein interactome,” RNA Society, Kyoto, June 2011, A manuscript is in progress.


The Jinbo Xu’s group (University of Chicago) is developing continued enhancements to a “template based” method of protein structure modeling that he has embodied in the “Raptor” family of structure prediction applications. These methods give excellent results for proteins where sufficient homologies can be found among proteins of known structure. The methods have scored excellently among template-based approaches in the international CASP assessment process. Results are described in two submitted papers that acknowledge Beagle: 1) Jianzhu Ma, Jian Peng, Sheng Wang and Jinbo Xu, “A Conditional Neural Fields model for protein threading,” submitted to RECOMB 2012; to be submitted to the Bioinformatics Journal and 2) Feng Zhao and Jinbo Xu, “Position-specific distance-dependent statistical potential for protein folding and assessment,” submitted to PNAS. Computational prediction is important for proteins whose structure has not yet been, or cannot be, determined by experimental methods. These largely unsolved problems are crucial for basic biological research and for applications in national priority areas, such as pathogens, biofuels, remediation, ecosystems, metagenomics, crops, and nutrition.


The Roux Group (Benoit Roux, University of Chicago, with Jim Phillips of the UIUC NAMD group) led work to port and tune NAMD for the Beagle XE6 system and its Gemini interconnect. NAMD is one of the flagship “workhorse” applications of molecular dynamics for biological applications. The results characterized Beagle’s NAMD performance as “excellent scaling, comparing well to (other leading) machines like TACC’s Lonestar, completely obliterating older ones like Ranger, and making our old rule-of-thumb of aiming for 1000 atoms/core outdated”. The effort achieved multiple nanoseconds of simulated time/day, excellent performance for a general purpose NAMD. This work benefits many applications, as this code is widely used. In addition, the Roux Group (with J. C. Grubart) is using Beagle to perform molecular dynamics simulations to quantify the free energy of the insertion of membrane proteins and determine all of the contributions to that energy. These results help to resolve a longstanding experiment-simulation disagreement on the magnitude of the free energies for different amino acids. This work has been submitted to Biophysical Journal as James Gumbart and Benoit Roux, “Determination of membrane-insertion free energies by molecular dynamics simulations,” 2011.


Andrey Rzhetsky (University of Chicago) is developing tools to use Beagle for his project “Assessing and Predicting Scientific Progress through Computational Language Understanding”. The project develops general methods relevant for students, policy makers and scientists. The goal is to generate dynamic high fidelity maps of knowledge claims in chemistry and related fields such as pharmaceuticals and toxicology. Their initial use of Beagle involves optimization of parameters for graph-based models of scientific collaboration. This work has potential to help scientists gain knowledge in many health and non-health domains, particularly those where the literature has already grown to a size where human comprehension of all publications is no longer possible.


Yves Lussier’s group (University of Chicago; recently moved to UIC) have been using Beagle to develop high-throughput methods for identifying novel microRNA-related targets for cancer therapy, identifying the genetic underpinning of complex disease polymorphisms and computationally identifying novel unique personal variant polymorphisms associated to a disease. They expect to complete a human phenome map of genome-wide association studies (GWAS), used to identify common genetic factors that influence health and disease, by the end of 2011. These studies are computationally intensive. Part of this work was presented as “Complex disease networks of trait-associated SNPs unveiled by Information Theory” at the AMIA Annual Symposium 2011, and will be included in JAMIA with a distinguished paper award. Another paper is in preparation: “Mechanisms of Intergenic Polymorphisms and of Intragenic Pleiotropy of Complex Diseases Unveiled by Similarity and Overlap in eQTL Networks.”

Complex Disease Gene Modules of A Cancer-Associated SNPs Unveiled by Information Theory and Protein Interaction Networks. (Credit: Lussier)


The focus of the Hatsopoulos lab’s (Nicho Hatsopoulos, University of Chicago) research is the analysis of empirical neural activity based on Granger causal network models derived from generalized linear model of large-scale networks. These models are computationally intensive and are run efficiently as Matlab compiled executables on Beagle. The group is starting to investigate how such networks of causally related neurons change over time. Eventually, when they have completed more efficient models, they will shed light on how a group of neurons form a spatiotemporal pattern that is comparable with previously published local field potential spatiotemporal patterns and what the implication of those patterns are to behavior. Results were presented at annual meeting of Society for Neuroscience, Nov 2011. The team is currently developing C++ implementations of the estimation algorithms, which will allow to scale up the models and test their performance on different Cray architectures such as the XK6.

Example of causality networks estimated at different timings in relation to visual cue onset. Before: [-101,50] ms. At cue: [51, 200] ms. After: [201, 350] ms.

Distribution of directions of LPF beta wave propagation during Cue period. (Credit: Hatsopoulos)


In non-muscle cells, contractile force production is governed by highly dynamic self-organized actomyosin networks whose structure and mechanical properties emerge from a continuous interplay of assembly, interaction and disassembly of actin filaments, cross-linkers and molecular motors. To explore how, Ed Munro and collaborators (University of Chicago) have developed agent-based Brownian dynamics simulations that predicts macroscopic dynamics of contractility and stress relaxation based on purely local microscopic interactions among filaments, cross-linkers and motors. Using Beagle, they have studied the interplay of network architecture and turnover of network elements determine “effective viscosity” and network flow on long timescales, and force-dependent cross-bridge kinetics and myosin minifilament assembly state control size and spacing of pulsed actomyosin contractions. Two articles are in preparation, by J. Alberts and E. M. Munro; and T. Y. Kim, M. Gardel, and E. M. Munro.

Snapshot of agent-based Brownian dynamics simulation of macroscopic dynamics of contractility and stress relaxation in actin filaments, cross-linkers and molecular motors. (Credit: Munro)


Epilepsy is a disease that is still poorly understood. No obvious biological cause has been unequivocally identified and it appears to be a complex disease. To understand its mechanics, the Van Drongelen lab (Wim van Drongelen, University of Chicago_ has been working at performing realistic computational simulations that bridge the gap between the experimentally accessible individual behavior (~100 cells) and the clinically recorded aggregate behavior (~ 1M). They have ported their models to Beagle and have performed scaling and optimization studies, which have been presented as Lee HC, Hereld M, Visser S, Pesce L, and van Drongelen W. “Scaling Behavior Of Computational Model of Neocortex”, poster at 4th International Meeting on Epilepsy Research, Chicago, IL, May 19-21, 2011. A paper describing those findings in more details will be prepared soon. In the near future, they plan to investigate the network and cell properties that appear to affect the dynamics of epileptic seizures through simulation of networks of neurons. Parameter spaces will be studied to understand which appear to be the key factors. A key part of the project is also to analyze experimental data from neurons grown in vitro.

Snapshot of neuronal activity in a simulation of brain neocortex. (Credit: Van Drongelen)


Kevin White and Bob Grossman (University of Chicago) are developing reliable and robust profiling and alignment software, which requires regular and thorough testing almost every time a new version is produced. This operation requires the reanalysis of a considerable amount of data. In practice, at this time, it involves the regular analysis of a number of datasets currently available (2000+ datasets). Other tests might also be added in the future. The scale of this operation requires a supercomputer such as Beagle to be able to provide a timely and reliable answer. The team has developed workflows and successfully ported analysis tools, and are now getting ready to perform a full-scale test on Beagle.


An important next step in host-pathogen evolution studies is to incorporate models of within-host dynamics into existing between-host models, which is being studied by Stefano Allesina and Greg Dwyer at the University of Chicago. These within-host models, however, are often difficult to incorporate, because little is known about pathogen dynamics inside hosts. As early as 1961 (Saaty) it was postulated that within-host models could be compared using dose-response-time data, yet the computational complexity involved prevented the implementation of this strategy. Thus, despite many within-host models being proposed, few have been used for parameter estimation and none have been tested. Today, with the advancement of computing technology, in particular cluster computing, it is possible to perform parameter estimation on mechanistic models of within-host disease dynamics. By implementing message passing on Beagle, this team uses a simulation-based approach for model selection on nested models to learn the relative importance of various mechanisms involved in virus growth in the gypsy moth caterpillar. From this analysis, they draw biological conclusions that provide the foundation to improve current practices in applying gypsy moth virus as an environmentally-benign pesticide. They also use their results to gain an understanding of host-pathogen coevolution more generally.

To qualitatively assess the fit of the model to the data, 104 parameter sets were generated from the joint posterior distribution of the parameters, and used to simulate speed-of-kill. The median (1st percentile and 99th percentile) values for each time are plotted in red as a solid (dotted) line. The overlaying black squares are the data collected during the dose-response study. Also contained in the figure is the virus dose received (D) and the number of larvae infected (n). (Credit: David A. Kennedy)


The Voth group (Greg Voth, University of Chicago) has used Beagle to simulate actin, which is the primary component of the eukaryotic cellular cytoskeleton and has been the target of much recent experimental work. The specific focus of these simulations has been to characterize the structure and heterogeneity of the multi-protein actin filament. These simulations approach 1 million atoms in size and therefore the Beagle resource has been critical for achieving these research aims. This work has contributed new knowledge on the behavior of actin filaments and also to our understanding of possible binding sites for therapeutic agents, e.g., in cancer cells.

A snapshot from fully atomistic MD simulations of a 13-subunit piece of an actin filament used to understand the structural heterogeneity of this system, and its response to nucleotide state. Critical components to accurately simulate the filamentous state include the periodically repeating filament (boundaries of the periodic cell are shown in blue) and explicit solvation with appropriate ion concentrations. The lower half of the simulation box shows the water environment; potassium and chloride ions, added to approximate physiological concentrations, are shown as tan and cyan spheres respectively. (Credit: Voth)


The research group of Bob Eisenberg (Rush University) works on the calcium channels that control the contraction of the heart and the sodium channels that provide signaling in the nervous system. They have systematically studied how these channels choose particular ions to perform their function. In the time that Beagle has been operation, it has allowed Eisenberg’s group to make progress, for example, showing that dielectric boundaries have important roles in determining selectivity. Improvement in channel function would have profound effects on the treatment of arrhythmias and diseases like multiple sclerosis that arise from poor function of these channels.

Ion Moving through a Bacterial Channel (Credit: Eisenberg)


Non-Health-related projetcs

Using the actor model of computation on Beagle ?

Massively parallel applications are required to perform alignment-free analyses of environmental soil microbiomes. Data for such soil microbiomes are typically gathered by using shotgun DNA sequencing of all genomes found in such an environment (called a metagenome). Massively parallel tools are now required to tackle the big data in environmental research. Pipelines exist for losely-coupled tasks (such as short read alignment), but crafting tightly-coupled genomic tools (such as assemblers, or alignment-free approaches in general) that can tackle huge datasets remains challenging.

Figure 1. A 512-node metagenome assembly job on Beagle. The computation lasted 20 minutes, 21 seconds. The purple part at the beginning is input/output. This picture was generated with Spate (a metagenome assembler in development) and periodic patterns are visible.

Here, we describe an ongoing project for building the next generation of tightly-coupled high-performance applications (this is not targetting losely-coupled tasks) in genomics. The idea of a Sequence Analysis Library (SAL) is that it can assist in solving various sequence-based biology problems. We have decided to use the actor model of computation [1] as the theoretical framework of computation. We implemented a first-principle object-oriented distributed actor engine named Thorium. This engine is designed and implemented for multi-node multi-core computations in mind although it also works on one node and/or one core too. Thorium also leverages mature and proven technologies such as C 1999 (programming language), MPI (MPI-1.0) and Pthreads. Figure 1 shows the basic design ideas of Thorium.

Our work is presently focused on the Spate application, a metagenome assembler that is implemented strictly with actors. Spate will allow us to tackle one of the U.S. Department of Energy (DOE) Joint Genome Institute (JGI) grand challenges, namely the one entitled Great Prairie Soil Metagenome Grand Challenge. This grand challenge contains 6 of soil metagenome samples from Kansas, Iowa, Wisconsin (1 cultivated corn soil sample and 1 native prairie soil sample for every of these 3 states). All these samples have least roughly 2 billion DNA sequencing reads and the one with the higher number of reads has more than 5 billion reads.

1. How did Beagle met our needs ?

Spate is a metagenome assembler in development, and it uses the Thorium actor model engine to drive the computation. This project uses the C 1999 programming language.

We are currently testing the development code on Beagle using the DOE Great Prairie Soil Metagenome Grand Challenge (6 samples, see ref. [2]). One of the sample is Iowa Continuous Corn Soil which contains 2,228,341,042 Illumina DNA sequences. Using Beagle, our goal is to obtain a metagenome assembly in less than 1 hour. Right now, it takes around 20 minutes (256 Beagle nodes) to load data from storage to distributed memory and to build the assembly graph which contains 141,189,180,698 DNA sequences (called k-mers) with 140,183,562,558 overlaps.

Beagle is a Cray XE6 product and has 726 nodes connected with a Cray Gemini interconnect. This interconnect offers high throughput and low latency. Each Beagle node has 2 AMD Opteron 6100 series processors each with 12 x86-64 cores (total: 24 cores) and 32 GiB (ECC DDR3 SDRAM) of hysical memory. This node architecture is suitable for our work because Thorium (the underlying computation engine) is a multi-node multi-core runtine environment and we typically use 1 single MPI rank and 22 or 23 worker threads per Beagle node. The multi-core processor (AMD Opteron “Magny-Cours”) supports the x86 mfence instruction to allow us to insert memory barrier in the code to make sure that visibility of changes to memory is timely.

Submitting a job on Beagle is easy too. We write a PBS script and submit it. Also, several C compilers are available on Beagle, such as GNU gcc 4.8.1 and Cray craycc 8.1.4. Finally, the amount of memory available per core (32 GiB / 24 = 1.33 GiB) on Beagle is a little higher than the amount of Mira at Argonne (IBM Blue Gene/Q has 1 GiB per core).

Beagles meets our needs by providing us with elastic high-performance compute capability. We are also thankful to the following CI people: Joe Urbanski and Ana Marija Sokovic for technical support and Ian Foster for ideas about actors, complexity, and computation.

2. Progress as a Beagle user

To get started on Beagle, we have consulted the online documentation and obtained answers to our questions using the Beagle Help Desk Service. Our requirements are MPI, a C 1999 compiler, and pthreads. Beagle satisfies all of these requirements. We use the following build script to generate a executable for Beagle:

3. Actor model versus MPI

The actor model of computation was first described in 1973 [3]. The actor model has messages and actors. An actor has a name, and a script describing what it can do. When receiving a message, an actor can do one or more of the following things: change its behavior (replacement behavior), send messages, and spawn actors. The actor model is very simple: the only way that an actor can interact with any other actor is by sending it a message. This strict interface produces robust and scalable reactive systems. To send a message to another actor, an actor needs to know the name of the other actor (acquaintance is an important idea in the actor model).

The actor model can be used to build an abstraction on top of MPI (multi-node) and Pthreads (multi-core). We are using the actor model on Beagle with our Thorium actor model engine.

As the project advances, we will have a better tool to obtain a better picture of soil microbiomes.


[1] Hewitt, C., Bishop, P. & Steiger, R. A universal modular ACTOR formalism for artificial intelligence. In Proceedings of the 3rd international joint conference on Artificial intelligence, IJCAI’73, 235-245 (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1973).

[2] Great Prairie Soil Metagenome Grand Challenge ( Proposal ID: 949 )


Glen Hocky of the Reichman Group at Columbia is evaluating a new cavity method for theoretical chemistry studies of how and why a liquid becomes a glass. This fundamental topic of statistical mechanics and is described as “the deepest and most interesting unsolved problem in solid state theory”. Beagle was used via Swift to calculate from simulation what is known as the “mosaic length”, where particles are simulated by molecular dynamics or Monte Carlo methods within cavities having amorphous boundary conditions. This work has implications in many fields, including biology, biophysics, and computer science. The framework used in these Beagle runs is the same as one popular theory for understanding protein folding, discussed in highly cited works by Wolynes and collaborators (Spin glasses and the statistical mechanics of protein folding. Bryngelson and Wolynes. PNAS November 1, 1987 vol. 84 no. 21 7524-7528.)


Joshua Elliott (University of Chicago) in the NSF Center on Robust Decision making for Climate and Energy Policy (RDCEP) is developing a large-scale integrated modeling framework for decision makers in climate and energy policy. The project has used Beagle to study land use, land cover, and the impacts of climate change on agriculture and the global food supply using a crop systems model called DSSAT (decision support system for agro-technology transfer.) Benchmarks of this model have been performed on a simulation campaign, measuring yield and climate impact for a single crop (maize) across the US with daily weather data and climate model output spanning 1981-2100 with 16 configurations of fertilizer and irrigation, and cultivar. Initial results have been presented in an advisory board meeting of the RDCEP project. Each simulation runs 125,000 DSSAT models on Beagle using Swift. This work has the potential to help decision makes understand the tradeoffs in decisions on climate and energy.


The Paul Fisher’s group (Argonne National Laboratory) has been developing, porting, and testing the NEK5000 code on Beagle. The main purpose of their work is to develop large-scale, high-fidelity simulations of fluid flow in order to understand phenomena such as transitions in vascular flow, thermal hydraulics of nuclear reactor cores, and magneto-hydrodynamics of accretion disks. NEK5000 was ported to Beagle and timing and scaling studies on various MPI operations were conducted successfully. Fischer’s group conducted preliminary studies of the development and saturation of magneto-rotational instability (MRI) and MRI-driven dynamo, believed to be responsible for the loss of orbital momentum and energy in accretion disks powering the most energetic phenomena in the universe (see image). These results were presented as Fausto Cattaneo, “Magneto-Rotational Turbulence and Dynamo Action,” at the Mini-conference on Understanding Astrophysical Dynamos, 53rd Annual Meeting of the APS Division of Plasma Physics, November 2011. Parametric studies of these phenomena are planned.

Azimuthal magnetic field in MHD simulations of large-scale MRI-driven dynamo with NEK5000.


Publications 2014

  • Molecular mechanism for differential recognition of membrane phosphatidylserine by the immune regulatory receptor Tim4.Proc Natl Acad Sci U S A
    Tietjen, G. T. Gong, Z. Chen, C. H. Vargas
  • Escherichia coli peptidoglycan structure and mechanics as predicted by atomic-scale simulations.PLoS Comput Biol
    Gumbart, J. C. Beeby, M. Jensen, G. J. Roux, B.
  • Characterization of nonlinear propagation waves of local field potentials in motor cortical areasto be submitted to IEEE EMBC 2014.
    J Chemili and K. Takahashi
  • Spatiotemporal dynamics of motor cortical network using ECoG signalsto be submitted to IEEE EMBC 2014.
    H. Watanabe, K. Takahashi, T. Isa
  • Multi-modal Decoding: Identifying Information Redundancy between Spike Trains and Sub-dural Electrocorticogram Signalsto be submitted to IEEE EMBC 2014.
    K. Balasubramanian, K. Takahashi, M. Slutzky, N. Hatsopoulos
  • Mixed performance of biomolecular force fields in reproducing the properties of polyglutamine in solution.To be submitted, 2014
    Fluitt, A.M. and de Pablo, J.J.
  • TCGA-Assembler: An Open-Source Pipeline for TCGA Data Downloading, Assembling, and Processing. (Submitted, 2014).
    Y. Zhu, P. Qiu, Y. Ji.
  • Compiler Optimization for Extreme-Scale Scripting To appear at CCGrid 2014 Doctoral Symposium, 2014
    T.G. Armstrong, J.M. Wozniak, M. Wilde, I.T. Foster
  • Compiler Optimization for Data-Driven Task Parallelism on Distributed Memory Systems Argonne National Laboratory Tech Report ANL/MCS-P5080-0214, 2014.
    T.G. Armstrong, J.M. Wozniak, M. Wilde, I.T. Foster
  • Atomistic simulation of polyglutamine in solution. American Institute of Chemical Engineers Annual Meeting. San Francisco, CA. (11/6/2013)
    A.M. Fluitt, J.J. de Pablo
  • Differences in dynamics and stability of the wild type β-Amyloid Aβ(1-40) and ΔE22-Aβ(1-39) (Japanese) protofibril structures, a Molecular Dynamics Study. Biophysical society 58th annual meeting. February 2014. San Francisco, CA.
    K.W. Johnson, T.R. Sosnick, K.F. Freed, E.J. Haddadian.
  • Peptidoglycan Structure and Mechanics as Predicted by Atomic-Scale Simulations.PLOS Computational Biology.10(2), (2014).
    J.C. Gumbart, Morgan Beeby, G.J. Jensen, B. Roux.
  • Constraints and potentials of future irrigation water availability on agricultural production under climate change. Proceedings of the National Academy of Sciences of the United States of AmericaPNAS.111(9):3241 (2014).
    J. Elliott, D. Deryng, C. Müller, K. Frieler, M. Konzmann, D. Gerten, et al.
  • Computational analysis of the binding specificity of Gleevec to Abl, c-Kit, Lck, and c-Src tyrosine kinases.J Am Chem Soc. 135(39), 2013.
    Y. Lin, B. Roux.
  • Locking the active conformation of c-Src kinase through the phosphorylation of the activation loop.J Mol Biol., 426(2), 2014.
    Y. Meng, B. Roux.
  • Large-Scale Modeling of Epileptic Seizures: Scaling Properties of Two Parallel Neuronal Network Simulation Algorithms.Computational and Mathematical Methods in Medicine, vol. 2013, Article ID 182145, 10 pages, 2013.
    L.L. Pesce, H.C. Lee, M. Hereld, S. Visser, R.L. Stevens, A. Wildeman, W. van Drongelen
  • Clinical Predictors of Port Infections within the First 30 Days of Placement.J Vasc. Interv. Radiol. 2014;25:419–423
    R. Bamba, J.M. Lorenz, A.J. Lale, B.S. Funaki, S.M. Zangan.
  • Supercomputing for the parallelization of whole genome analysis.Bioinformatics. (In press, 2014.)
    M.J. Puckelwartz, L.L. Pesce, V. Nelakuditi, L. Dellefave-Castillo, J.R. Golbus, S.M. Day, T.P. Cappola, G.W. Dorn II, I.T. Foster and E.M. McNally.
  • A molecular mechanism for differential recognition of membrane phosphatidylserine by the immune regulatory receptor Tim4.PNAS. (In press, 2014).
    G.T. Tietjen, Z. Gong, C.-H. Chen, E. Vargas, J.E. Crooks, K.D. Cao, J.M. Henderson, C.T.R. Heffern, M. Meron, B. Lin, B. Roux, M.L. Schlossman, T.L. Steck, K.Y. C. Lee and E. J Adams.
  • Constraints and potentials of future irrigation water availability on agricultural production under climate change
    Joshua Elliott, Delphine Deryng, Christoph Müller, Katja Frieler, Markus Konzmann, Dieter Gerten, et al. in Proceedings of the National Academy of Sciences of the United States of America (2013)
  • Subgroup-Based Adaptive (SUBA) Designs for Multi-Arm Biomarker Trials
    Yanxun Xu, Lorenzo Trippa, Peter Muller, Yuan Ji in Statistics in Biosciences (2014)
  • Molecular Dynamics Studies of Ion Conduction in a Prokaryotic Channel
    Karen M Callahan, Benoit Roux in Biophysj (2014)
  • A molecular mechanism for differential recognition of membrane phosphatidylserine by the immune regulatory receptor Tim4.(Under review at PNAS).
    Gregory, T. Tietjen, Zhiliang Gong, Chiu-Hao Chen, Ernesto Vargas, James E. Crooks, Kathleen D. Cao, J. Michael Henderson, Charles T. R. Heffern, Mati Meron, Binhua Lin,, Benoît Roux, Mark L. Schlossman, Theodore L. Steck, Ka Yee C. Lee and Erin J Adams.


Publications 2013

Publications 2012

  • A statistically defined anthropomorphic software breast phantom
    by Beverly A Lau, Ingrid Reiser, Robert M Nishikawa, Predrag R Bakic Medical Physics (2012)
  • Analysis of cardiomyopathy using whole genome sequencing
    E. M. McNally, J. R. Golbus, L. L. Pesce, D. Wolfgeher, L.Dellefave-Castillo, M. J. Puckelwartzin 62nd Annual Meeting of The American Society of Human Genetics (2012)
  • Molecular autopsy for sudden death using whole genome sequencing
    M. J. Puckelwartz, L. Dellefave-Castillo, D. Wolfgeher, V. Nelakuditi, M. T. Campbell, J. R. Golbus, in 62nd Annual Meeting of The American Society of Human Genetics (2012)
  • Population-based variation in cardiomyopathy genes
    JR Golbus, MJ Puckelwartz, JP Fahrenbach, LM Dellefave-Castillo, D Wolfgeher, EM McNally in Circ Cardiovasc Genet (2012)
  • Computer-aided image analysis and detection of prostate cancer using immunostaining for alpha-methylacyl-CoA racemase, p63, and high-molecular-weight cytokeratin
    Y Peng, Y Jiang, XJ Yang in Machine Learning in Computer-Aided Diagnosis: Medical Imaging Intelligence and Analysis
  • Complex-disease networks of trait-associated single-nucleotide polymorphisms (SNPs) unveiled by information theory
    Haiquan Li, Younghee Lee, James L. Chen, Ellen Rebman, Jianrong Li, Yves A. Lussier in Journal of American Medical Informatics Association (2012)
  • Context and Force Field Dependence of the Loss of Protein Backbone Entropy upon Folding Using Realistic Denatured and Native State Ensembles
    MC Baxa, EJ Haddadian, AK Jha, KF Freed, Sosnick TR in Journal of the American Chemical Society (2012)
  • Numerical Simulations of Strong Incompressible Magnetohydrodynamics Turbulence
    J Mason, Perez, Boldyrev, F Cattaneo in Physics of Plasmas (2012)
  • A network of subpopulation of neurons in primary motor cortex and beta oscillation waves exhibit similar spatiotemporal patterns
    K. Takahashi, K. Brown, S. Kim, T. Coleman, N.G. Hatsopoulos in Nanosymposium on Signal Propagation, The annual meeting, Society for Neuroscience (2012)
  • Granger causality analysis of functional connectivity of spiking neurons in orofacial motor cortex during chewing and swallowing
    K. Takahashi, L. Pesce, J. Iriarte-D ́ıaz, S. Kim, T.P. Coleman, N.G. Hatsopoulos, in IEEE Conference Engineering in Medicine and Biology Society (2012)
  • Granger causality analysis of state dependent functional connectivity of neurons in orofacial motor cortex during chewing and swallowing
    K Takahashi, L. Pesce, M. Best, J. Iriarte-Dıaz, S. Kim, T.P. Coleman, et al. in IEEE the 6th International Conference on Soft Computing and Intelligent Systems (2012)
  • Discovering RNA-protein interactome using chemical context profiling of the RNA-protein interface
    M Parisien, X Wang, G II Perdrizet, C Lamphear, CA Fierke, KC Maheshwari, et al.
  • On docking, scoring and assessing protein-DNA complexes in a rigid-body framework
    M Parisien, KF Freed, TR Sosnick in PloS one (2012)
  • Growing Point-to-Set Length Scale Correlates with Growing Relaxation Times in Model Supercooled Liquids
    Glen Hocky, Thomas Markland, David Reichman in Physical Review Letters (2012)
  • Evaluating Communication Performance in IBM BG/Q and Cray XE6 Supercomputing Systems
    H. Bui, V. Vishwanath, J. Leigh, M. E. Papka in Proceedings of the IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2012) (2012)
  • FLASH simulations of 120MJ target explosions in LIFE reactor chamber
    Ryan Sacks, Gregory Moses, Milad Fatenejad in 54th Annual Meeting of the APS Division of Plasma Physics (2012)
    Gregory T. Tietjen, Zhiliang Gong, Chiu-Hao Chen, James Crooks, Ernesto Vargas, Kathleen Cao, et al. in Biophysical Society Annual Meeting (2012)
  • De novo prediction of protein folding pathways and structure using the principle of sequential stabilization
    AN Adhikari, KF Freed, TR Sosnick in P Natl Acad Sci USA (2012)
  • Two gene co-expression modules differentiate psychotics and controls
    C Chen, L Cheng, K Grennan, F Pibiri, C Zhang, J a Badner, et al. in Molecular psychiatry (2012)

  • Dynamic mechanisms of cell rigidity sensing: insights from a computational model of actomyosin networks
  • Carlos Borau, Taeyoon Kim, Tamara Bidone, José Manuel García-Aznar, Roger D Kamm in PloS one (2012)

  • Mechanisms of contractility and coarsening of active cytoskeletal networks
    Taeyoon Kim, Margaret L. Gardel, Ed Munroe in Biophysical Society Meeting (2012)
  • Determinants of fluid-like behavior and effective viscosity in cross-linked actin networks
    Taeyoon Kim, Margaret L. Gardel, Ed Munro in Biophysical Society Meeting (2012)
  • Structural reorganization of active actin networks via competition between force generation and dissipation
    Taeyoon Kim, Ed Munro, Margaret L. Gardel in BMES annual meeting (2012)


Publications 2011

  • Scaling Behavior Of Computational Model of Neocortex
    Hyong C. Lee Lee, Mark Hereld, Sid Visser, Lorenzo L Pesce, Wim van Drongelen in 4th International Meeting on Epilepsy Research (2011)
  • Flow-dependent unfolding and refolding of an RNA by nonequilibrium umbrella sampling
    A. Dickson, M. Maienschein-Cline, A. Tovo-Dwyer, J. R. Hammond, A. R. Dinner in J. Chem. Theor. Comp. (2011)
  • A. N. Adhikari, J. DeBartolo, K. F. Freed, and T. R. Sosnick, “Ab initio protein structure prediction aided by sequential stabilization of tertiary structure,” poster, Protein Folding Symposium. Berkeley, California, USA, June, 2011.
  • F. Cattaneo, “Magneto-Rotational Turbulence and Dynamo Action,” Mini-conference on Understanding Astrophysical Dynamos, 53rd Annual Meeting of the APS Division of Plasma Physics, Salt Lake City, Utah, USA, November, 2011.
  • A. Dickson, M. Maienschein-Cline, A. Tovo-Dwyer, J. R. Hammond, and A. R. Dinner, “Flow-Dependent Unfolding and Refolding of an RNA by Nonequilibrium Umbrella Sampling,” Journal of Chemical Theory and Computation, v. 7(9), pp. 2710-2720, 2011.
  • R. Jamieson, “Grid Based Quantitative Breast Image Analysis,” poster, Era of Hope 2011, (DoD Fellowship Award meeting).
  • R. Jamieson, K. Drukker, M. L. Giger, “Use of Data Reduction Techniques for the Visualization of Computer-Extracted Lesion Features in the Diagnostic Interpretation of Breast Cancer,” Radiological Society of North America, 2011.
  • H. C. Lee, M. Hereld, S. Visser, L. Pesce, and W. van Drongelen, “Scaling Behavior Of Computational Model of Neocortex,” (poster) 4th International Meeting on Epilepsy Research, Chicago, IL, May 2011.
  • M. Parisien, T. R. Sosnick, T. Pan, “Systematic prediction and validation of RNA-protein interactome,” (poster), RNA Society, Kyoto, June 2011.