Chapter 3. Research aims and associated problems

3.1 OVERVIEW

The control of DRAT is centralised in the Data Dictionary process. The dictionary is therefore more than just a repository of metadata. The dictionary placement strategies mentioned earlier (uninformed, partially informed, fully informed with replication at each node) are not relevant, since they deal with decentralised systems, and the DRAT dictionary is centralised. However, the dictionary process could become a bottleneck when multiple queries are being dealt with; this could be solved by splitting the dictionary into a number of separate parallel sub-processes [OATE89].

The OLTP part of the system is not directly relevant to the MIS part, inasmuch as OLTP queries are restricted to a limited and clearly defined set. Partitioning of (OLTP) data will be done with the optimisation of OLTP transactions in mind. The DD will need to know the partition predicates and placement strategy, so that this information can be used when MIS queries are satisfied.

Knowledge gained from the study of centralised control of distributed databases, and multiprocessor based database systems will be used in research. Development will incorporate the requirements of the ISO Information Schema for SQL2 [MELT90, IS090].

Some workers suggest that the cost/performance of semiconductor store will outstrip that of disc in the mid 1990s, and in fact, this is the basis of the EDS parallel relational database machine [ANDR90]. At first sight, this might appear to make some of the work related to the research (ie data placement/allocation) redundant, due to the possibility of storing it in RAM devices. However, this is not the case. There is no reason why RAM storage should not be partitioned in a manner analogous to disc, and assuming that data storage requirements continue to grow as they have done in the past, one might conclude that in fact, this would be just as necessary as it is now with conventional disc storage. (As an aside, the DSDL on which this aspect of the work is based allows for a variety of data storage media, and there is no apparent reason why new technologies should not be catered for).

3.2 AIMS OF RESEARCH, AND PROBLEMS TO ADDRESS

This section describes the objectives of the research, and the associated problems. Theoretical and practical constraints are discussed. How the objectives will be met are set out in chapter 4, Program of Research.

The primary task is to decide what information is needed for all aspects of control of the database machine. This information will be stored in the data dictionary. Broadly, this information can be split into the following five categories (discussed below):

data description
data partitioning and placement
machine description
allocation and control of system resource
statistics

3.2.1 DATA DESCRIPTION

This falls into two categories, namely User Data, and System Data.

User Data is the data contained in database that is defined by the Database Administrator using the SQL Schema Definition Language (also known as a Data Definition Language - DDL). This is not an issue in the research.

System Data is all the metadata that is required for the DD to function. Included in this is data contained in the SQL2 Information Schema (which now includes our additions) This data defines all the information that is contained within the five categories mentioned above.

An aim of the research is to define SQL tables that can be used to contain data that encapsulates the knowledge requirements. These tables will be an extension of SQL.

3.2.2 DATA PARTITIONING AND PLACEMENT

The DRAT design allows for a table to be spread across more than one data storage (DS) processor. Theoretically, we could have horizontal or vertical (or a combination) partitioning of tables. Only horizontal partitioning will be considered (at least initially), since data storage and selection is defined in terms of rows (or to be exact, values of columns within tuple instances). If vertical partitioning of files/relations were to occur, this inherent simplicity would be lost. Consideration of vertical partitioning would require additions to the DSDL that is being used as a definition of the functionality requirements of partitioning and placement.

Initially, data will be statically partitioned in DRAT. However, for a horizontally partitioned relation, it would be possible to introduce dynamic allocation (data migration); this is not practical for a table that is partitioned vertically, as it would require the addition of fields to one partition, and deletion of those fields in another.

Data partitioning and data placement strategies in distributed systems are clearly tightly bound (although some workers treat them as separate issues), and a model which unites the two would be advantageous. This has a corollary in our multiprocessor system. With a simple implementation of DRAT, where all the data storage elements are of the same type, the problem of the relationship between partitioning and placement does not arise. However, the DSDL which will be used in this research allows data storage elements to be of different types (eg disc, optical), and so we must consider the relationship between partitioning and placement on different storage media.

The placement of partitions will follow Kerridge's DSDL. In other words, a logical partition will be allocated to a number of pages within one or more data storage elements. The storage within each set of pages will be hash-based, indexed, or sequential. Hanson's round-robin, and VRP may be considered later. Note that the access mechanisms may involve the use of an index which may be binary, B, B+, or B*.

One of the aims of the research is to relate data placement partition predicates to query selection predicates. To do this, it will be necessary to devise some mechanism for representing predicates in some 'canonical form'.

3.2.3 MACHINE DESCRIPTION

A prerequisite for the allocation of system resource is that there is a description of the system. This includes both the physical resources available (eg relational processors,, data storage elements) , and software resources (Table Handlers, ie engines which deal with reading and writing).

The physical interconnection topology and mapping of logical processes to hardware needs to be described.

3.2.4 ALLOCATION AND CONTROL OF SYSTEM RESOURCE

In the DRAT machine, allocation and control of resource refers to the MIS part of the overall system. Allocation is necessary because there is not an infinite amount of resource available. in particular, in the hardware there is a limited number of relational operation engines, and the Front End is able to accept only a finite number of data streams. On the software side, although there is one Table Handler for each partition, within each Table Handler there is a finite number of Select, Update, Insert, and Delete engines. It may be necessary to sequence their use, if many queries all need to use one particular Table Handler.

3.2.5 STATISTICS

It is possible for OLTP or MIS data placement to become unbalanced, which could cause a reduction in the performance of the system. This is particularly problematical in the case of the OLTP sub-system. By running queries against the statistical data that will be generated, it should be possible to notice this, and take steps to rectify the problem. we envisage, initially, that the system would generate a message stating that, for example, the performance of certain transactions or queries has been reduced, and that re-placement of data on physical storage, under new partition or placement rules is suggested. A long-term goal of the research programme is for the system to do this re-placement automatically.

The purpose of keeping statistical information about the system is to allow us to do this. We need to know such things as query types; transaction times; numbers of MIS queries per unit time; number of OLTP transactions, and how this varies over time; system states - free Relational (R) processors, delayed MIS queries, rollbacks, commit/rollback ratio, tables accessed. Full requirements are not known at this stage.

3.3 AREAS OUTSIDE THE SCOPE OF THE RESEARCH

The data dictionary is just one small, though crucial, part of a database machine. There are many problems and/or design considerations which need to be addressed in the development of such a machine, but which are outside the remit of this research project. Some are listed below:

Reliability, data reduction (ie encryption), security of the data dictionary itself [JANS89].
Replication for availability (replication for archiving purposes is considered, due to its being in the DSDL).
Operation of the system under hardware and software fail conditions.
Metrics for evaluation of the design (those for evaluation of placement strategy, resource allocation, etc will be considered).
Commercial aspects, such as cost (however, the DRAT/IDIOMS architecture has been designed with the needs of the marketplace in mind).
Physical architecture (eg ring, star, MIN), (Suffice to say that the ring architecture proposed allows a direct mapping between logical processes and physical processors, with a maximum of concurrency. Additionally, some of the work itself is intended to be applicable to any loosely-coupled multiprocessor database machine).
Physical and electrical limitations to the size of the machine [HART87].
Algorithms for relational operations (eg hash join, or nested loop join).
A distributed DRAT system (ie many DRAT units functioning as a distributed database).
Investigation of limiting factors in the machine (eg is it interprocessor communication, data access, or data manipulation/processing). (Ideally, these would be balanced).
Deadlock detection.

Two areas in which related research is being conducted by others at Sheffield University are Recovery, and Query optimisation. The former is not directly relevant to this research, although one should note that there is a case for including certain aspects of recovery information, such as the location of Shadow Pages, or log files, in the Data Dictionary.

Logical Query optimisation, however, is related to this work, inasmuch as there will be a mapping from the logically optimised query to a physical access path. Under optimal conditions (ie when sufficient processing resource is available) there will be a simple mapping, all processes will proceed concurrently. On the other hand, a logically optimised query will have to be processed (at least in part) in a sequential manner if sufficient resource is not available.