Remote Store Architecture

A new communication scheme for building high-performance parallel computers 

Let's look at the architecture of a state-of-the-art computer system (see below). A parallel computer system contains n processing elements. These at least two processing elements are connected to each other by a communication network appropriately evincing wide bandwidth and low latency. No assumptions are made concerning the communication network.

State of the art multiprocessor system

Illustratively ‘Fast Ethernet’, ATM, GigaBit Ethernet, Fibre Channel or any other fast network may be used. Again no assumptions are made concerning topology; busses, stars, rings, 2-D or 3-D networks (torus) may be used, or any other topology. The costs and performances of such networks differ and must be matched to needs. A single processing element consists of a control and computation unit (CPU), a memory and a communication unit. Typically the memories are divided into an application area and a system area.

Be it assumed that a first processor intends to transmit a message to a second processor. A message exchange takes place as follows: An application on the first processor has generated data it wishes to communicate. For that purpose it stores the data into the memory of the first processor. Now the operating system must be notified that the data is to be communicated. To assure that the data remain unchanged during communication, the operating system copies the data and writes them into the system memory of the first processor. Once the communication unit has finished, the data is again read out by the control and computing unit and is transferred to the communication unit. When using a DMA (Direct Memory Access) controller, the last step can be simplified. The DMA autonomously retrieves the data from the system memory and writes them into the communication unit. At the receiving side the data arrive in the communication unit. Therein it is fetched by the control and computing unit (CPU) and stored in the system memory. This interim storage is required because of the possibility of the application not being ready to receive the data. As soon as the application is able to receive the data, the operating system copies the data from the system memory into an application memory. The data transmission system is highly loaded by these many data transfers; the data is shifted up to 5 times per processor. When error detection during transfer entails additional check sums, the number of data shifts will be still higher. Moreover such systems evince high latencies that may amount to more than 1000 microseconds.

The Remote Store Architecture drastically reduces the number of copies; as a result the communication bandwidth is widened by a factor more than 4. Moreover the latency may be reduced by 2 orders of magnitude to less than 10 microseconds. Below you can see the configuration of a computer with the Remote Store Architecture

RSA communication

In this instance too a parallel computer system comprising n processing elements. These processing elements are connected to each other by a common communication network appropriately evincing a large bandwidth and low latency. Additionally to a standard computer system there is a communication manager unit inserted between the communication unit and the local data transfer system. The local data memory is fitted with a new segment: a communication buffer memory is introduced in addition to the system memory and the application memory. Both the application and also the communication manager unit have access to said segment. Several applications data memories and communication buffer memories may also be present in systems with several running applications. Using the Remote Store Architecture, the processor writes the results of its computations into the communication manager unit. Mentioned unit adds a global address. The data values and the address are transferred to the communication unit and, passing through the conventional communication network, arrive at the receiving communication units. The communication manager unit compares the global address of the incoming data with predefined values previously provided by the application. This comparison determines whether the processor is at all interested in these data. Irrelevant data is merely ignored by the communication manager. As regards relevant data, a local memory address in the communication memory will be computed and they will be saved therein directly.

Reading always will be local and writing always will be global as regards common data with the Remote Store Architecture. In actual applications, reading is ten to 10’000 times more extensive than writing; as a result a striking gain in speed can be achieved. The data is not additionally copied in the method of the Remote Store Architecture; this feature also is called ‘zero copying’. Because only writing on a ‘remote’ processor takes place for data exchange, the terminology ‘Remote Store’ has been selected. Each communication manager unit comprises an address comparator which determines whether the particular processing element is interested in the data, and an address computation unit using the global address to compute the local physical address in the communication memory.

Aside from the main functions of address -comparison and address computation of the unit, the communication manager unit may be broadened by functions useful in parallel processing, for instance:

The object of all these special functions is to offer relief to the processor (CPU), to simplify programming and to increase overall system performance.

The Remote Store Architecture has been realized in the Easynet and the T-NET.


frey@scs.ch