Grid Computing Planet   Earthweb  
Events Jobs Premium Services Media Kit Network Map E-mail Offers Vendor Solutions Webcasts
   subjects:
IT Management Webcasts:
The Role of Security in IT Service Management

Preparing for an IT Audit

More Webcasts


Search EarthWeb Network

internet.commerce
Be a Commerce Partner














Grid Computing Planet : News: Stottler Henke To Develop Grid Software For DoE



IT Management Glossary
data mining
ERP
extranet
grid computing
intranet
network appliance
outsourcing
storage
VPN
virus
FREE Tech Newsletters

Stottler Henke To Develop Grid Software For DoE
December 2, 2003
By Paul Shread

Artificial intelligence firm Stottler Henke Associates has been selected by the U.S. Department of Energy to develop "smart job recovery" software to improve the quality of service provided by computer clusters and Grids.

Stottler Henke has been awarded a $750,000 Small Business Innovation Research contract from DoE to develop the Agent-Based High Availability (ABHA) system. The goal of the system is to let computer clusters process long-running batch jobs more reliably by detecting and diagnosing problems so that ABHA can determine how best to restart those jobs and, if possible, continue executing them. With long-running batch jobs, restarting jobs by going back to the beginning wastes time, computer resources, and money.

Douglas Olson, a nuclear physicist in Lawrence Berkeley National Laboratory's Computational Research Division, one of Stottler Henke's partners on the ABHA project, said the project should improve performance for a critical high-performance computing infrastructure - clusters of commodity computers running Linux.

"With commodity hardware, some level of failure is to be expected," Olson said. "As we transition into the widely distributed mode of Grid computing, the normal failure rates in such a complex system make efficient use of the resources extremely challenging. The key is to be resilient when a failure occurs; we expect ABHA will afford us that resiliency. Without it, we could not use commodity computers, and our hardware costs would be 10 to 100 times greater."

Charu Chaubal, a Grid computing technologist at Sun Microsystems, says devising failure recovery technology is important if Grid computing is to make the transition from the academic world into mainstream business environments.

"Business computing has a more stringent requirement for high availability," Chaubal said. "And as the Grids get larger, failure recovery becomes more critical. That's why initiatives like ABHA are so timely."

Computer jobs can fail for many reasons, such as transient and permanent hardware failures; software configuration errors; insufficient computing, storage, or network resources; and application failures. As clusters and jobs continue to grow in size, failures during execution become more likely.

To provide high throughput and high reliability, the cause of task failure must be determined in enough detail to select and execute the appropriate job recovery. Automated job recovery is currently impractical, Stottler Henke says, because it is difficult, time-consuming, and error-prone for end users to implement the specific fault detection diagnosis and recovery algorithms needed by each job.

Stottler Henke says its ABHA system will monitor the execution of each task, detect task failures, diagnose their cause, and recover intelligently by applying knowledge of the cluster's configuration and topology, knowledge of each job's decomposition into parallel and sequential tasks, and knowledge of each task's resource requirements. Possible job recovery actions include aborting the task, rescheduling the task for immediate or future execution, or modifying job parameters to avoid problematic portions of the cluster.

Stottler Henke is working with Lawrence Berkeley to design and prototype the ABHA system to support the Lab's clusters of Linux-based computers, which are managed by the University of Wisconsin's Condor and Platform Computing's LSF workload management systems. The company expects that core technologies developed during the project will also support clusters and Grids implemented by other hardware platforms and operating systems. Stottler Henke is also seeking interest and participation from vendors and end user organizations that provide or employ clustered computing.

Tools:
Add www.gridcomputingplanet.com to your favorites
Add www.gridcomputingplanet.com to your browser search box
IE 7 | Firefox 2.0 | Firefox 1.5.x
Receive news via our XML/RSS feed

News Archives