Schema Management for Data Integration : A Short Survey

Schema management is a basic problem in many database application domains such as data integration systems. Users need to access and manipulate data from several databases. In this context, in order to integrate data from distributed heterogeneous database sources, data integration systems demand the resolution of several issues that arise in managing schemas. In this paper, we present a brief survey of the problem of schema matching which is used for solving problems of schema integration processing. Moreover, we propose a technique for integrating and querying distributed heterogeneous XML schemas.


Introduction
Heterogeneous data sets contain data that may be represented using different data models and different structuring primitives.They may use different definition and manipulation facilities, and run under different operating systems and on different hardware [3].Schemas have been used in information systems for a long time for these data sets.They provide a structural representation of data or information.A schema is a model of data sets which can be used for both understanding and querying data.As diverse data representation environments and application programs are developed, it is becoming increasingly difficult to share data across different platforms, primarily because the schemas developed for these purposes are developed independently and suffer from problems like data redundancy and incompatibility.When we consider different systems interacting with each other, it is very important to transfer data from one system to another.This has led to research on heterogeneous database systems.(Multidatabase systems make up a subclass of heterogeneous database systems.)Heterogeneity in databases also leads to problems like schema matching and integration.The problem of schema matching is becoming an even more important issue in view of the new technologies for the Semantic Web [4].
The operation which produces a match of schemas in order to perform some sort of integration between them is known in the literature as a matching operation.Matching is intended to determine which attribute in one schema corresponds to which attribute in another.Performing a matching operation among schemas is useful for many particular applications such as mediations, schema integration, electronic commerce, ontology integration, data warehousing, and schema evolution.Such an operation takes two schemas as input and produces a mapping between elements of the two schemas that correspond semantically to each other [29].
Until recently, schema matching operations have typically been performed manually, sometimes with some support from graphical tools, and therefore they are time-consuming and error-prone.Moreover, as systems become able to handle more complex databases and applications, their schemas become larger.This increases the number of matches to be performed.The main goal of this paper is to survey briefly the different issues that arise in managing schemas and to show how they are tackled from different perspectives.
The remainder of the paper is structured as follow.Section 2 describes schema heterogeneity.Section 3 presents schema matching approaches.Section 4 introduces schema integration methodologies.Section 5 describes data integration.In section 6 we present our proposal for a data integration system in the context of heterogeneous XML data sources.Section 7 concludes the paper.

Schema heterogeneity
Schemas developed for different applications are heterogeneous in nature, i.e. although the data is semantically similar, the structure and syntax of its representation are different.Data heterogeneity is classified according to the level of abstraction at which they are detected and handled (data instance, schema or data model).Schema heterogeneity arises due to different alternatives provided by one data model to develop schemas from the same part of the real world.For example, a data element modelled as an attribute in one relational schema may be modelled as a relation in another relational schema for the same application domain.The heterogeneity of schemas can be classified into three broad categories: l Platform and system heterogeneity [22] -differences in operating systems, hardware, and DBMS systems.
l Syntactic and structural heterogeneity, which encompasses the differences between data model, schema isomorphism [35], domain, and entity definition incompatibility [14] and data value incompatibility [10].
l Semantic heterogeneity -this includes naming conflicts (synonym and homonyms) and abstraction level conflicts [23] due to generalization and aggregation.

Schema matching
To integrate or reconcile schemas we must understand how they correspond.If the schemas are to be integrated, the corresponding information should be reconciled and modelled in a single consistent way.Methods for automating the discovery of correspondences use linguistic reasoning on schema labels and the syntactic structure of the schema.Such methods have come to be referred to as schema matching.Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing.
To motivate the importance of schema matching, we should understand the relation between a symbol and its meaning.We can consider a word to be a symbol that evokes a concept which refers to a thing.The meaning is in the application that deals with the symbol, and in general in the mind of the designer, and not in the symbol itself.Hence, it is difficult to discover the meaning of a symbol.The problem gets more complicated as soon as we move to a more realistic situation in which, for example, an attribute in one schema is meant to be mapped in two more specialized attributes in another schema.In general we can say that the difficulty of schema matching is related to the lack of any formal way to expose the intended semantic of the schema.
To define a match operation, a particular structure for its input schemas and output mapping must be chosen.It can be represented by an entity-relationship model, an object--oriented model, XML, or directed graphs.In each sort of representation, there is a correspondence among the set of elements of the schemas.For example, entities and attributes in an entity-relationship model; objects in an object oriented model; elements in XML; and nodes and edges in graphs.A mapping is defined to be a set of mapping elements, each of which indicates how the elements in the schemas are related.
There are several classification criteria that must be considered for realization of individual matching.Matching techniques may consider the instance data level as in [17,38] or schema level information [12,15].Such techniques can be performed for one or more elements of one schema to one or more elements of the other.
Various approaches have been developed over the years that can be grouped into classes, according to the kind of information and the actual idea used: l Manual approaches.The mechanisms used in these approaches involve the use of an expert to solve the matching, for example drag and drop.
l Schema based approaches.These are based on knowledge of the internal structure of a schema and its relation with other schemas.
l Data driven approaches.Here, the similarities are more likely to be observed in the data than in the schema.

Schema integration
Schema integration is the process of combining database schemas into a coherent global view.Schema integration is necessary in order to reduce data redundancy in heterogeneous database systems.It is often hard to combine different database schemas because of the different data models or structural differences in how the data is represented and stored.Thus, there are many factors that may cause schema diversity [6]: There are several features of schema integration that make it difficult.The key issue is resolution of conflicts among the schemas.A schema integration method can be viewed as a set of steps to identify and resolve conflicts.Schema conflicts represent differences in the semantics that different schema designers associate with syntactic representation in the data definition language.Even when the two schemas are in the same data model, conflicts like naming and structural may arise.
Naming conflicts occur when the same data is stored in multiple databases, but is referred to by different names.Naming conflicts arise when names are homonyms and when names are synonyms.The homonym naming problem is when the same name is used for two different concepts.The synonym naming problem occurs when the same concept is described using two or more different names.
Structural conflicts arise when data is organized using different model constructs or integrity constraints.Some common structural conflicts are: l type conflicts -using different model constructs to represent the same data, l dependency conflicts -a group of concepts related differently in different schemas ( e.g.1-to-1 participation versus 1-to-N participation), l key conflicts -a different key for the same entity, l Interschema properties -schema properties that only arise when two or more schemas are combined.
The schema integration process involves three major steps: 1. Pre-integration, a step in which input schemas are re-arranged in various ways to make them more homogeneous (both syntactically and semantically).2. Correspondence identification, a step devoted to the identification of related items in the input schemas and the precise description of the relationships these inter--schemas.3. The final step, which actually unifies the corresponding items into an integrated schema and produces the associated mappings.
A robust integration methodology must be able to handle both naming and structural conflicts.There have been various attempts from different perspectives.The work [25] broadly classifies these attempts into two categories: l Structural approaches -also called the common data model approach.In this, the participating databases are mapped to a common data model.The problem with such systems is the amount of human participation required.Human intervention is required to qualify the mappings between the individual databases and the common model.
l Semantic approaches -these use a higher order language that can express information ranging over individual databases.Ontology based integration approaches belong to this category.Many research projects (SHOE [21], ONTOBroker [7], OBSERVER [19]) and others use ontologies to create a global schema [20,30].
In the past several years, many systems have been developed in various research projects on data integration using the techniques mentioned above.Here are some of the more prominent representative systems: l Garlic [11,18] uses an ODMG-93 based object oriented model.It extends ODMG to allow modelling of data items in the case of a relational schema with weak entity.
l TSIMMIS [13,37] and MedMaker [31] were developed at Stanford around 1995.They use the Object Exchange Model (OEM) [32] as a common data model.OEM allows irregularity in data.The main focus is to generate mediators and wrappers based on application specification.
l MIX [8,3], a successor of TSIMMIS, uses XML to provide the user with an integrated view of the underlying database systems.It provides a query/browsing interface called Blended Browsing and Querying.
These were the prominent techniques in the structuring approach.There are many other techniques which use ontology as a common data model or use ontologies to translate queries over component databases.Below we present some of these techniques: l Information Manifold [24] employs a local-as-view approach.It has an explicit notion of global schema/ontology.
l The OBSERVER [28] system uses a different strategy for information integration.It allows individual ontologies and defines terminological relationships between them, instead of creating a global ontology to support all the underlying source schemas.

Data integration
Data integration is the process of combining data at the entity-level.After schema integration has been completed, a uniform global view has been constructed.However, it may be difficult to combine all the data instances in the combined schemas in a meaningful way.Combining the data instances is the focus of data integration.Data integration is difficult because similar data entities in different databases may not have the same key.Determining which instances in two databases are the same is a complicated task, if they do not share the same key.Entity identification [27] is the process of determining the correspondence between object instances from more than one database.Data integration is further complicated because attribute values in different databases may disagree or be range values.Simply said, data integration is the process which: l takes as input a set of databases (schemas), and l produces as output a single unified description of the input schemas (the integrated schema) and the associated mapping information supporting integrated access to existing data through the integrated schema.
Parent and Spaccapietra [33] present a general data integration process in their survey on database integration.First, they convert a heterogeneous schema to a homogeneous representation, using transformation rules that explain how to transform constructs from the source data models to the corresponding one in the target common data model.The transformation specification produced by this step specifies how to transform instance data from the source schema to the corresponding target schema.Then, correspondences are investigated, using the semantic descriptions of the data to produce correspondence assertions.Finally, correspondence assertions and integration rules are used to produce the unified schema.
In general, data integration systems can be classified into data-warehouse and mediator-wrapper systems.A data warehouse [9] is a decision support database that is extracted from a set of data sources.The extraction process requires data to be transformed from the source format into the data warehouse format.A mediator-wrapper approach [39] is used to integrate data from different databases and other data sources by introducing a middleware virtual database, called a mediator, between the data sources and the application using them.Wrappers are interfaces to data sources that translate data into a common data model used by the mediator.
Based on the direction of the mappings between a source and a global schema or common schema, mediator-wrapper systems can be classified into so called global-as-view and local-as-view [19,26].In global-as-view (GAV) approaches [16], each item in the global schema/ontology is defined in terms of source schemas/ontologies.In local-as-view (LAV) approaches, each item in each source schema/ontology is defined in terms of the global schema/ontology.Methods for query rewriting and query answering views are presented in [11].The most important techniques in the literature for LAV are presented.

Integration and querying XML via mediation
In this section, we propose a general framework for a system for XML date Integration and Querying XML via Mediation (IQXM) [2].The architecture of IQXM is shown in Fig. 1.IQXM mainly refers to the problem of integrating heterogeneous XML data sources.It can be used for resolving structural and semantic conflicts for distributed heterogeneous XML data.A global XML schema is specified by the designer to provide a homogeneous view over heterogeneous XML data.A mediation layer is proposed for describing mappings between global and local schemas.An XML mediation layer is introduced to manage: (1) establishing appropriate mappings between the global schema and the schemas of the sources; (2) querying XML data sources in terms of the global schema.The XML data sources are described by XML Schema language.The former task is performed through a semi-automatic process that generates local and global paths.A tree structure for each XML schema is constructed and represented by a simple form.This is in turn used for assigning indices manually to match local paths to corresponding global paths.By gathering all paths with the same indices, the equivalent local and global paths are grouped automatically, and an XML Metadata Document is constructed.The Query Translator acts to decompose global queries into a set of subqueries.A global query from an end-user is translated into local queries for XML data sources by looking up the corresponding paths in the XML Metadata Document.

Conclusion
In this paper, we have presented some problems behind schema management, such as schema matching and schema integration.Schema matching is a basic problem in many database application domains.We have introduced some of the past and current approaches employed to solve these problems.Finally, we have described a framework for an XML data integration and querying system.

CzechFig. 1 :
Fig. 1: System architecture [34]aid[36]uses a relational common data model and allows only relational schema integration.Clio[34]was developed by IBM around 2000.It involves transforming legacy data into a new target schema.Clio introduces an interactive schema mapping paradigm, based on value correspondences.
[1]egasus[1]takes advantage of object-oriented data modelling and programming capabilities.It allows the user to access and to manipulate multiple autonomous hetero-l l