Sunday, November 29, 2009

XML Data and Document Storage in SQL Server 2005

XML is a platform-independent data representation format based originally on SGML. Since its popularization, it is becoming used as a data storage format. It has its own type system, based on the XML Schema Definition language (XSD). Both XML and XSD are W3C standards at the Recommendation level. An XML schema defines the format of an XML document as a SQL Server schema defines the layout of a SQL Server database.

The XML type system is quite rigorous, enabling definition in XML Schema Definition language of almost all of the constructs available in a relational database. Because it was originally designed as a system that could represent documents with markup as well as what is traditionally thought of as data, the XML type system is somewhat bifurcated into attributes and elements. Attributes are represented in the XML serialization format as HTML attributes are, using the name=’value’ syntax.

Attributes can hold only simple data types, like traditional relational attributes. Elements can represent simple or complex types. An element can have multiple levels of nested sub elements. This means that an element can be used to represent a table in a relational database. Each row would be represented as a child element, with relational attributes (columns) represented as either attributes or sub elements.

The two ways of representing relational column data in XML are known as element-centric mapping (where each column is a nested subelement) and attribute-centric mapping (where each column is an attribute on an element row) Since sub elements can be nested in XML documents, a document more closely corresponds to a hierarchical form of data than a relational form.

This is reinforced by the fact that, by definition, an XML document must have a single root element. Sets of elements that do not have a single root element are called document fragments. Although document fragments are not well formed XML documents, multiple fragments can be composed together and wrapped with a root element, producing a well-formed document. In addition to being able to represent relational and hierarchical data, XML Schema Definition language can represent complex type relationships.

XSD supports the notion of type derivation, including derivation by both restriction and extension.
This means that XML can directly represent types in an object hierarchy. A single XML schema document (which itself is defined in an XML form specified by the XML Schema Definition language) represents data types that scope a single XML namespace, although you can use XML namespaces in documents without having the corresponding XML schema. An XML namespace is a convenient grouping of types, similar to a userschema in SQL Server.

An XML schema defines the namespace that its types belong to by specifying the target Namespace attribute on the schema element. An XML document that uses types from a namespace can indicate this by using a default namespace or explicitly using a namespace prefix on each element or attribute of a particular type. Namespace prefixes are arbitrary the xmlns attribute established the correspondence between  namespace prefix and namespace.

This is analogous to using SQL Server two-part or three-part names in SQL statements. Only when an XML document contains types defined by XML schemas is it possible to determine the exact data types of elements or attributes. XML elements and attributes are data type string by definition. A predecessor to XML schemas, known as Document Type Definition (DTD), was primarily concerned with defining document structure and allowed only limited information about data types.

XSD is a superset of the aforementioned type systems, including all the DTD structure types. Using an XSD schema or schemas to determine if a document is correct is known as schema validation. Schema validation can be thought of as applying type constraints and declarative integrity constraints to ensure that an XML document is correct.

A nonvalidated XML schema still must conform to XML well-formedness rules, and a single XML document adheres to a set of rules known as the XML Information Set (Infoset), consisting of structure and some content information. Validating an XML document against schemas produces what is called a Post-Schema-Validation InfoSet (PSVI). The PSVI information makes it possible to determine a strong, well-defined type for each XML element and attribute.

SQL Server 2005 introduces an XML data type. This data type can be used in table definitions to type a column, as a variable type in Transact-SQL procedural code, and as procedure parameters. In addition, columns, variables, and parameters of the XML data type can be constrained by an XML schema.
XML schemas are defined in the SQL Server catalog. XML, like relational databases, has its own query language optimized for the data format.

Since XML data is hierarchical, it’s reminiscent of a hierarchical file system. The archetypical query language for XML documents is known as XPath. Queries in XPath reflect the hierarchical nature of XML, since nodesets are selected by using syntax similar to that used to specify files in the UNIX file system. As an example, when a typical XML document is queried using a hierarchical XPath query, the result is a nodeset ontaining all the nodes at that level of hierarchy.

Like a SQL query, an XPath query simply produces a resultset consisting of possibly multiple instances of items. Unlike in SQL, these results are not always rectangular in shape. XPath results can consist of nodesets of any shape or even scalar values. In SQL, database vendors can implement a variation of SQL-PSM (persistent stored modules) that composes possibly multiple SQL queries and some procedural code to produce a more complex result.

SQL Server’s variation of SQL-PSM is known as Transact-SQL. XML processing libraries implement an analogous concept by using an XML-based nonprocedural language called XSLT. Originally meant to produce nice looking HTML pages from XML input, XSLT has evolved into an almost full-fledged programming language. Vendors have even added proprietary extensions to XSLT to allow it to execute code routines in procedural programming languages like Visual Basic or C#.

Since XPath and XSLT were not originally developed to process large amounts of data or data from multiple sources, a new programming language for XML, known as XQuery, has been developed. XQuery implements many of the best features of XPath and XSLT, and is developed from the ground up to allow queries that include multiple documents. It is also designed specifically to be optimizable. In addition, it adds some of the syntax features of SQL.

XQuery’s data can be strongly typed; this also assists in query optimization. XQuery includes a query language, the equivalent of SQL Server SELECT, but does not define a standard implementation of DML, SQL Server’s INSERT, UPDATE, and DELETE statements. SQL Server 2000 allowed users to define mapping schemas (normal XML schemas with extra annotations that mapped XML items and concepts to SQL items and concepts) that represented all or a portion of the database as a virtual XML document, and issue XPath queries against the resulting data structure.

In addition, SQL Server 2000 extended Transact-SQL to enable relational resultsets to be returned as XML. This consists of support for a FOR XML clause. Three different subcategories of FOR XML are supported. The SQL Server 2000 support allowed XML document composition from relational data and XML document decomposition into multiple relational tables.

SQL Server 2005 extends this support by adding direct support for XQuery. The XQuery engine runs directly inside SQL Server, as opposed to XPath support in SQL Server 2000. XPath support in SQL Server 2000 is accomplished by a portion of the SQL Server OLE DB provider (SQLOLEDB) that took a mapping schema and an XPath query, produced a SELECT... FOR XML query and sent that to SQL Server.

Native support for XQuery, combined with XQuery’s design for optimization, and support for multiple documents (a series of XML columns) should improve on the already good support for querying XML data.

Source: https://www.nilebits.com/blog/2009/12/xml-data-and-document-storage-in-sql-server/