2. Comparisons with other Database Machines

This section briefly comments on some aspects of IDIOMS, and makes comparisons with a few well-known database machines.

2.1 Database machines and dataflow models

A well-accepted logical model of a shared-nothing database machine (DBM) is that it consists of storage devices, processing elements, and an interconnection network, and the physical implementation of IDIOMS clearly follows this. It is possible to combine the logical functions of processor nodes and storage nodes on the same device; the Bubba team do just this [BORA88]. The advantage of reduced communications traffic is offset by the disadvantage that the processing power of the machine becomes dependent on the storage requirements/capability. In our design, we can scale both the storage and the processing capacity independently.

As with other designs, such as Gamma [BORA88] and Bubba, MIS query execution follows the dataflow model. In other words, as soon as an operation can input from the required data stream(s), the operation will normally commence. Advantages of this strategy are that intermediate data does not have to be saved to disk, and that there are minimal delays, because processing does not need to wait until the previous operation is complete. In order to do this, it is necessary to have sufficient processors available to be used for different operations. Additionally, we will physically optimise the dataflow execution by ensuring wherever possible that if operation X sends data to operation Y, then X and Y will be carried out on physically adjacent processors.

2.2 OLTP/MIS mix

The major aim of IDIOMS is to service both OLTP and MIS queries concurrently, even for complex MIS queries. Teradata, for example, deal with an OLTP/MIS mix [REED90], but only one MIS query can be dealt with at a time, whereas IDIOMS can service multiple MIS queries concurrently.

Consider the Teradata machine, where each disk contains both MIS and OLTP data, with the result that "some DSU [disk] capacity must also be reserved for spool space for sorts" [REED90]. Clearly, this reserved space is wasted most of the time. Furthermore, this sharing of disks by OLTP and MIS data implies that while OLTP is reading/writing, MIS must wait, because there is only one physical disk arm on each disk.

By careful design we ensure that this problem cannot arise. We have two sets of disks, one reserved specifically to store OLTP data, and the other for MIS data such as summary tables, resulting in minimal interference between OLTP transactions and the access of purely MIS data. The design ensures that the OLTP disks have spare access capacity, which is used by MIS queries that require access to OLTP data. Although OLTP disks can be read by MIS, OLTP transactions have priority on disk access. Clearly, OLTP transactions can be accessing one set of disks, while MIS queries requiring only MIS data are accessing another set.

2.3 Partitioning and placement strategies

Teradata has a partitioning strategy that is superficially similar to that of IDIOMS, in that "data in each DBC/1012 table is stored evenly across all AMPs [access module processor]. This parallelism is, in fact one of the system's most significant benefits" [REED90]. However, Teradata choose a hybrid partitioning strategy based on the hashing of the primary key, whereas our data is range partitioned. The issues of partitioning strategies have been discussed in detail in [KERR91b], and our conclusion is that range partitioning is superior to hash partitioning in our application domain. OLTP accesses typically are 'single hit', using a key value. With a hash partitioning strategy, if we wish to go directly to the tuple, we would still need a dense index for each partition. Similarly, MIS accesses are not helped by hash partitioning, because MIS queries often involve ranges of values and neither indices nor hashing techniques help with such queries.

With reference to placement of partitions, the Bubba team state, "... we should identify the relations involved in this nonlinear operation (N-M join) and reduce their DegDecl [degree of declustering" [COPE88]. Their reasoning is that message costs dominate, and are of the order O(DegDecl2). We argue that this is not the answer. Message costs may be reduced by doing this, but surely, the problem is inherent in the operation i.e. an M-N join requires M*N comparisons. Indeed, in an attempt to overcome this, hash and range-partition based joins have been proposed e.g. [RICH89], [DEWI90], where one logical join is physically implemented as a set of joins between partitions, and the results merged.

Boral states that data placement must be static because of the amounts of data involved [BORA88]. As far as allocation of space to partitions is concerned, we agree. However, our placement strategy allows migration of data between partitions, with the constraint that the migration rate is low.

Some researchers have developed complex partitioning and placement strategies e.g. [SACC85], [CERI82]. At this stage we consider that the complexity of these is such to preclude their use in very large industrial applications.

Currently we are developing a cost model which will enable us to determine suitable partition ranges. Factors involved include data skew, dependencies between data values and tuple heat, update rate, and update value/partition range ratio. Basically the trade-off is that smaller partitions result in reduced scan time, but increased migration rate. This migration necessitates the update of the OLTP index, in addition to the raw migration costs. For this reason we cannot have a large number of partitions, and we should ensure that those columns which we do partition are the ones which give the greatest scan time saving over all MIS queries.