Academic topics and resources
CS 7900 handouts
[USU ] [Computer Science ]
There are several ongoing projects in my lab (the Space Software Laboratory): The first is the development of a fault-tolerant and self-configuring distributed programming system and the second is providing space-platform software for the USAF Office of Scientific Research (space vehicles directorate). This work has included operating system software for the PnPSat, Maple I and II, RoadRunner, and the STRV-1d spacecraft as well as control software for the Advanced Instrument controller embedded in the DITP interceptor system. We are currently developing the software system for the Air Force Satellite Data Model – the framework Plug & Play software system to support the next three generations of USAF space research missions.
My research in this area is focused on the problems associated with fault tolerant plug & play distributed and parallel systems -- how can a system of many processors efficiently self-organize and work in parallel yet recover and continue in the event some of the nodes are lost. Other interests include the area of fault-tolerant distributed LINDA systems and the MOM system(see publications list).
In general, the traditional method for achieving fault-tolerance in multi-processor systems has been through redundant computations or redundancy masking. The primary drawback of this approach is obvious -- it is a costly use of resources. Another common approach is check pointing. In this method, a backup or monitor process is assigned to each process with a critical task. The monitoring process receives timely checkpoint information concerning the progress of the task and is designed to take over (or roll-back output and restart) in the case of a failure. While conceptually simple, this method can be difficult and costly to implement for general algorithms. Each new system may require a new implementation of a different check pointing protocol. Testing and verification of a check pointing implementation can be very difficult and may in fact, not be practical in some applications.
My research is focused on the development of systems to provide high-level support for developing parallel programs with significant fault-tolerance capabilities. This research includes self-configuring or “Plug & Play” network architectures and the MOM system. MOM is a modification to the existing general tuple-space or LINDA parallel-programming model and system framework. The specific goals are to a) efficiently utilize the available resources in general MIMD networks without requiring processors to be dedicated to redundancy, b) provide a consistent mechanism for implementing fault-tolerance which would be relatively simple to the programmer, c) require little system overhead in the absence of faults, d) be applicable to a useful range of programming problems, and e) provide high-bandwidth inter-node communications.
Department of Computer Science, USU
UMC 4205
Office Phone: 797-2015
Fax: (801)797-3265
E-Mail: scott.cannon@usu.edu
Last updated: Aug 2007