Summary and conclusions

The kid looked up. "Is that it?"

It's something I keep asking myself.

Bob Geldof, Is that It?, 1986.

Following a brief summary of the work presented, the contribution to the field and the general conclusions are itemised. The final section presents areas of future research which stem from the work.

The fundamental limitations of the use of indices and space partitioning methods for reducing the workload of large MIS range queries have been shown. However, under conditions where indices are totally inappropriate, it has been shown that space partitioning may still be both beneficial and feasible.

There are, of course, limitations to the degree of partitioning in each of the dimensions; these are highlighted by a discussion of two representative multi-dimensional partitioning methods, namely the MDD file structure of Liou and Yao [LIOU77] and the Grid File [NIEV84]. It has been shown that the Grid File and related structures are not in themselves solutions to the problems being addressed, viz. how to minimise the MIS workload, and the determination of a suitable number of partitions in each dimension.

Analysis of Liou and Yao's heuristic method [LIOU77] of determination of the number of partitions (for point queries) in each dimension of a multi-column partitioned file has shown that the method is flawed, and an improved one is given in appendix 1.

Assumptions and limitations of current data partitioning and placement in shared-nothing multi-processor database machines have been discussed. A simple yet effective allocation strategy for the OLTP/MIS problem (as far as it relates to applications similar to the one used as a basis for the work) has been given, namely, the derivation of OLTP partitions based on the primary key (or fragments of it which do not contain any semantic information). Given that the OLTP transactions are a random event, the problem of load balancing for these transactions does not arise. Within each OLTP partition data is further partitioned in order to aid MIS queries by reducing the search space. If there is just a single storage disc for each of the OLTP partitions, then the assignment of MIS sub-partitions to disc is trivial i.e. they must all be placed on the disc associated with the OLTP partition. If there is more than one disc for each OLTP partition, then a strategy of round-robin (or other random) placement of the data from these MIS partitions onto the disc partitions guarantees load balancing of MIS queries within and across all OLTP partitions (cf. earlier work which suggested that for each OLTP partition, MIS sub-partitions could be placed on different discs [PEAR91, UNWA92]).

A new formula defining the number of partitions accessed by range queries in a single dimension when the number of partitions is small has been derived. Comparisons have been made with the formula defining the number of pages accessed by range queries in range partitioned files [EAST87]. The formulae are shown to produce converging results as the number of partitions increases. Eastman's formula implies that the size of a query does not affect the scan time reduction which can be obtained by increasing the number of partitions, whereas an equation can be derived from the new formula which explicitly states that the size of a query affects the scan time reduction that can be made by increasing the number of partitions. (The reader should note that the formula presented by Eastman [EAST87] is in no way being criticised, since it was derived under a totally different set of assumptions to that for the formulae derived in this thesis).

A model relating MIS savings due to partitioning and the resultant costs to OLTP from the partitioning has been presented which shows how to determine the columns to partition, and to what level. Both single and multi-column partitioning have been considered. A worked example has been given which shows the feasibility of the method.

The contribution of this work is itemised (in note form).

- A new formula describing the number of partitions accessed by (single dimension) range queries in range partitioned files. Further use of this formula in determining the cost savings which can be made by increasing the number of partitions shows that query size plays a role.
- An improvement to Liou and Yao's [LIOU77] heuristic method of determination of the number of partitions in each dimension in multi-dimensional partitioned files.
- A new proposed data allocation method for multi-processor database machines where large data mining queries are to be run against Transaction Processing data. By using a round-robin (or other random placement method) placement of MIS partition data to physical partitions in the case where data from one MIS partition is placed on more than one disc

a) the cost of partitioning can be related to the logical MIS partitioning, not the actual MIS disc partitions

b) load balancing over these disc partitions for the MIS queries is guaranteed. - A cost model for determination of which columns to partition in combined OLTP/MIS databases in order to aid MIS range queries. Both single and multi-column partitioning is catered for. This model shows

a) migration probability depends on update value, domain range and number of partitions

b) a heuristic method for determination of minimum and maximum migration rates in the multi-column case

c) a heuristic method for determination of maximum and minimum scan time saving and workload saving

d) how to determine an appropriate number of partitions for each partitioning dimension. - Evaluation of a suitable number of partitions for workload reduction for range queries in a single dimension for the case where partitioning costs are unknown; the conclusion is that large numbers of partitions are not appropriate.
- For a set of range queries of a given size on columns with large or continuous domains, have shown how wasted work in the general case is proportional to the inverse of the number of partitions (i.e.
*WW*@ 1/*n*). - Shown the invalidity in the case of large range queries of the suggestion by Copeland et al. [COPE88] that access frequency to columns could be used to determine which columns to partition; the size of queries must also be considered.

Multi-column partitioning to reduce the scan time of large MIS queries run against a Transaction Processing database is a viable proposition. However, there are many constraints and practical considerations; the major ones being re-iterated below.

- The combinatorial costs of multi-column partitioning mean that no more than about six columns can be partitioned in practice for the type of system under consideration (cf. [NIEV84] which suggests a limit of 10 columns).
- A large number of partitions does not aid the resolution of large MIS queries, but does increase the cost.
- The semantics of the database must be considered when determining the number of partitions in each dimension. For example, it would not be wise to partition if a small change in usage patterns would wipe out an otherwise large saving.
- For an MIS query set where there are many small queries and a few large ones it may not be appropriate to partition to aid the small queries, as the overall savings may be small compared to the total cost i.e. the few large queries swamp the many small ones.
- The work indicated how partitioning is beneficial to reduce wasted work, but in order to speed up retrieval of large queries parallel access is appropriate e.g. many discs for each Transaction Processor.
- The fact that the partitioning in IDIOMS is static (cf. the Grid File and other dynamic structures) makes no material difference to the basic results presented. However, these techniques should be investigated, as they might bring other benefits such as reduction in re-organisation frequency, or even elimination of the need for re-organisation.
- The requirement that OLTP is not unduly compromised puts an upper bound on the numbers of partitions in any given dimension.

This section discusses the limitations of the work presented which have not already been considered in the relevant chapters. Possible areas of research which stem directly from the work are presented.

One criticism that may be levelled at the work is that it is application dependent not only in terms of the semantics of the particular database under consideration, but also in terms of the general application domain. This criticism is rejected on the grounds that MIS/data mining applications on transaction processing databases are growing rapidly, and the trend is expected to continue, and further, that it is well accepted that there are no universal optimum file structures.

The investigation of retrieval of data for large MIS range queries has been done under the assumption that communications cost in sending this data for further processing will not be a limiting factor. A fully integrated model should take this limit into account when defining the proportion of spare disc capacity to allocate to MIS queries.

The model of partitions accessed is based on the notion of representative queries. Although this is not likely to affect unduly the estimation of the number of partitions accessed, it tells nothing of the access frequency to individual partitions. In the partitioning method presented here, this would not be a problem, but if partitions were not placed so regularly (e.g. as suggested in [PEAR91] and [UNWA92]) it is possible for load balancing problems to arise.

Two extreme cases are presented in chapter four, namely the case of partitions containing just a single data value, and those containing many possible data values; work focused on the second case. Between these extremes lies the case where there is a small number of possible data values in a partition. Analysis of the number of partitions accessed and the wasted work done in resolving range queries in this case poses an interesting problem.

Investigation into metrics for the determination of the "stability" of the partitioning with respect to changes in parameters is needed. For example, would a small change in OLTP:MIS ratios produce a large change in the overall savings that could be made by partitioning?

Chapter 5 shows how the number of migrations depends on update value, domain range and number of partitions in the case of a single update value. In practice, in some cases (e.g. for cash withdrawal, change of salary) update values are not identical. Chapter 6 simply used an average value where appropriate, which for demonstration purposes is sufficient. However, a detailed investigation of the variation in migration rate with a set of update values could be a fruitful area of research.

MIS partitioning has been predicated on the assumption that there are no indices for purely MIS purposes. This is a reasonable approach for our particular application, but in general, it could be considered an undue restriction, so future work could investigate an augmented cost model which caters for this requirement.

In some cases it may be impossible to attain the goal of equal sized partitions with low cardinality domains. For example, if the data space is partitioned on gender, and the data is skewed, there is nothing that can be done. However, nothing prevents us (at least in abstract) from having different partitioning values for sub-partitions of the male/female partitions. Kerridge and North's DSDL is designed to support any partitioning strategy we require, as well as any allocation of partitions to discs [KERR91b]. For example, if we had a male:female ratio of 3:4, we could create 3 sub partitions for the 'male' partition, and 4 sub-partitions for the 'female' partition, with the result that we have 7 equal sized partitions in total. However, this solution could pose analytic difficulties.

There is a trade-off between the number of partitions and the query size in each dimension, and the number of dimensions. For example, if *QS* is small in one dimension, *i*, but there are many dimensions, then increasing the partitioning of dimension_{i} is probably not a good idea, since it could increase directory costs to a greater extent than the savings, as well as degrading consecutive retrieval. Conversely, in low dimensional partitioning (the degenerate case being a single dimension), if *QS* is small, partitioning can in all probability safely be increased.

It has been shown why large MIS partitions are appropriate for the problem in hand. A static partitioning strategy is currently used in IDIOMS. As discussed, other researchers are investigating the use of the Grid File for databases on shared nothing multi-processor database machines, and it would be appropriate to do likewise, in spite of the known limitations (e.g. directory cost, not designed for low cardinality domains, no guaranteed consecutive retrieval).

As stated by Nievergelt et al. [NIEV84] usage patterns could be monitored and used (in conjunction with data distribution patterns) to dynamically alter the partitioning, so that columns which are not prominent in query predicates are removed from the partitioning scheme. A similar method, but with consideration given also to *QS* as well as just column reference, might allow a more flexible and yet still practical solution to MIS partitioning.

One issue not addressed in this work is the distribution of full and empty pages within a partition, and how this affects the metrics proposed. An implicit assumption has been that there are areas within each partition which are empty, i.e. the required data is not necessarily contiguous from the start of the partition (contiguous data could only improve matters).

Large joins can be a problem, and many methods have been presented to aid these (see Mishra & Eich [MISH92] for a survey). One option is partitioned joins (e.g. 2 × (0.5*n*)^{2} + merge cost is less than *n*^{2}), so a rather obvious addition to the MIS partitioning model would be to incorporate the requirements for joins. Join query predicates would need to be analysed in order to be able to determine a suitable partitioning (in conjunction with that for selection) for each of the possible partitioning dimensions.

Directory cost has not been included in the model since it will be small in the case that there are only a few thousand partitions. However, further research should be done to quantify the cost.

An implicit assumption underlying space partitioning is that data values of the partitioning columns are known. This is perhaps unrealistic, and future work could investigate how unknown (i.e. NULL in SQL terms) data values might be dealt with in a partitioned file structure.

"I also will here make an end of my book. And if I have written well and to the point in my story, this is what I myself have desired; but if meanly and indifferently, this is all I could attain unto." II. Maccabees, The Apocrypha.