We now show our multi-column partitioning strategy in detail. Consider an artificially simple database for a bank, which consists of just two relations.
CUSTOMER( C#. BRANCH, INITIAL, LASTNAME, ADDR_L1, ADDR_L2, CITY, POSTCODE, AGE, SEX)
INSURANCE( POLICY#, BRANCH, C#, DATE, P_TYPE, PREMIUM)
The Primary Key is underlined, and there is a Foreign Key relationship between INSURANCE and CUSTOMER on (BRANCH, C#). Note that BRANCH equates to bank sort code, the first two digits of which identify the bank, the third and fourth identify the region, and the last two identify branch within region.
The first step to partitioning in our system is to decide how many Transaction Processors are required to service the OLTP queries (these are not shown on figure 8). Let us assume that we need three—this means that we need at least three physical partitions on three separate disks. In order to guarantee an even access rate for OLTP, we ensure that each of the OLTP partitions is of approximately equal size. We have no worries about overall partition "heat", as we assume that although access frequency to individual tuples varies greatly, on average heat will balance out within and across partitions.
Nothing precludes further partitioning of data. If we know that MIS queries on our database typically use age and sex in selection predicates, then we are justified in physically partitioning data on these attributes. This does not mean that we are creating a non-relational system, as the concept "relational" refers entirely to the method of specifying queries. To improve MIS access further, we allocate each of the sub-partitions to a disk, even though they are accessed by the OLTP system as if they were all on a single disk (the OLTP dense index spans a number of physical disks). This is illustrated by figure 4. In practice, we could have more age partitions, but for clarity and simplicity, these are not shown. The DSDL allows us to specify such partitioning (figure 6). Note that this is not a complete specification, a full example is given in [KERR91b].
Any column that is used for partitioning, and that contains data that changes (e.g. age) necessitates the use of data migration. The basic options are that either a process regularly scans data and relocates appropriate tuples, or that data is moved when it is updated. The first option means that much work may be needed, even if it is found that data does not need to be moved. The latter means that although OLTP would not be affected, MIS might not give as accurate a picture as with the former option. Zorner discusses the issues of data migration in the context of DSDLs [ZORN87]. Many of our ideas regarding the DSDL are based on the work of the BCS Database Administration Working Group, reported by Zorner.
Note that any partitioning strategy we choose in order to aid optimisation of MIS queries has minimal effect on OLTP access (there is of course the issue of migration). We aim to keep the cardinality of the sub-partitions approximately equal, so that MIS scan time of these is equal. However, even if the cardinality became unequal, this would not lead to any OLTP problems with different access rates to partitions, because what is important is the OLTP access rate to the Transaction Processing logical partitions. This is not affected by sub-partitioning cardinality inequalities, because of the OLTP index which spans the sub-partitions.
Finally, we point out that this is a practical system which will be used in a commercial environment. We aim for robustness, and therefore we do not envisage vast numbers of partitions as being an appropriate strategy. The main reasons for this are that increasing numbers of partitions leads to an increase in the probability of migration, and many partitions would require a multiplicity of physical join operations with resulting complexity of sequencing and routing.