Physics researchers have carried out the first production-quality
simulated data generation on a data Grid, comprising sites at Caltech,
Fermilab, the University of California-San Diego, the University of
Florida, and the University of Wisconsin-Madison.
"This achievement represents an extremely challenging and important
milestone in the integration of Grid middleware components within the
current 'real world' LHC computing environment," the researchers announced.
Doug Olson of Lawrence Berkeley National Laboratory and the Particle Physics Data Grid said it has been "decided that a worldwide Grid environment is required and will be used for the computing work of the physics experiments at the LHC," the Large Hadron Collider at CERN in Switzerland. Technical details of the worldwide Grid are still being worked out, he said.
Globus Project co-leader Ian Foster called the work "a major achievement
in terms of production Grid computing."
The work was done by members of the U.S. Compact Muon Solenoid
Collaboration (CMS) in concert with the Particle Physics Data Grid, the
Grid Physics Network, and the International Virtual Data Grid
Laboratory, and was funded by the U.S. Department of Energy, the
National Science Foundation and the EU-DataGrid project, among
others.
The deployed data Grid serves as an integration framework, with Grid
middleware components brought together to form the basis for distributed
CMS Monte Carlo Production (CMS-MOP) and used to produce data for the
global CMS physics program, the researchers said. The middleware
components include Condor-G, DAGMAN, GDMP, and the Globus Toolkit
packaged together in the first release of the Virtual Data Toolkit.
The CMS-MOP distributed production system employs a tier-like hierarchy
in which a production manager at a Tier-1 center distributes production
jobs to several remote Tier-2 sites, they said. Once generated at the
Tier-2 sites, the simulated data is automatically published back to the
Tier-1 center as well as replicated to selected Tier-2 sites.
"This integration exercise showed that the Grid still presents
significant challenges in harnessing distributed resources," the
researchers said. Issues of data and security had to be overcome, such
as how to get software and data to many remote systems and be sure that
it's there, and how to get results back.
Issues of heterogeneity and error recovery also had to be addressed,
they said. "To use other sites' resources, you need to interface with
many batch systems; the Grid means more errors, more crashes, more
mysterious failures," they wrote. Unanticipated errors were handled,
such as key machines crashing in the middle of a run; Grid credentials
expiring in the middle of a run; jobs successfully completing but their
results being lost before they got sent back; various pieces of
middleware doing the unexpected; and the network going down.
"Despite these challenges, over 50,000 proton-proton collision events
inside the CMS detector have been simulated using CMS-MOP and validated
for use by CMS physicists," the researchers said. Production of another
150,000 simulated events is underway.