XML Data

Storing and working with XML data in BlobCity

BlobCity offers exhaustive support for XML storage. It allows storage and retrieval of XML, along with running queries on XML as if the data was in a tabular structure.

A schema inference engine processes the XML document or object being submitted to BlobCity and maps to a tabular schema for appropriate indexing and search-ability. The XML elements are converted to column names, and the values encapsulated within the elements associate with the value of the particular column. Nesting of XML element is permitted with the nested tags forming a column name by concatenating the tags separated by a ‘.’ (dot).

Let us start with an empty table and insert the following XML document into it.

<?xml version="1.0" encoding="UTF-8"?>
<user>
  <name>John</name>
  <age>45</age>
</user>

The collection after taking the insert of the above XML appears as shown below.

_iduser.nameuser.age
xxxJohn45

The insert operation results in two columns getting automatically created within the collection. The _id column is a self created mandatory column in each table, and acts as a way for the database to differentiate between records. The columns created due to insert of the XML are named user.name and user.age. Although not visible within a collection schema, the name and age columns are actually placed within the user column. The user column will not show up in a collection schema but can be used as a column in a select query. The user column however may not be used as in a where clause

select `user` from ds.Users
select * from ds.Users where `user` = '....'

The response of all SQL queries is defaulted to a JSON array and is consistent with the response received when JSON was inserted into the table. The XML is automatically mapped to equivalent JSON when running a SQL query. The database however stores the XML, and it can be fetched in its original form if explicitly specified. However the record cannot be retrieved in XML form as part of the SQL search query.

SELECT * FROM ds.Users
[
  {
    "user": {
      "name": "John", 
      "age": 45
    }
  }
]

The response is a JSON array with each record being a JSON object entry within the array. A major difference between the working of the JSON and working of the XML document is that the name and age value actually gets nested within the user key as was present in the original document. This may happen to be the case for most XML documents or objects you use. A standard practice with XML encapsulates the entries within a root tag, which results in the nesting.

The root tag and the XML format specification tag is not mandatory for BlobCity to infer the XML document. Let us see the table with an insert without the xml element at the beginning of the document.

<user>
  <name>Tom</name>
  <age>26</age>
</user>
_iduser.nameuser.age
xxxJohn45
xxxTom26

The insert is successful with the collection getting an additional record. The xxx is the placeholder for auto-defined values, with the actual record containing a unique value that is auto generated by the database.

To explicitly specify the _id value for any element, the following XML structure needs to be used.

<?xml version="1.0" encoding="UTF-8"?>
<_id>1000</_id>
<user>
  <name>Mary</name>
  <age>30</age>
</user>
_iduser.nameuser.age
xxxJohn45
xxxTom26
1000Mary30

The _id value for the new record will corresponding to the _id passed at time of insert.

Nested XML's are also supported. An example of a nested XML is shown below.

<?xml version="1.0" encoding="UTF-8"?>
<user>
  <name>Stacy</name>
  <age>43</age>
  <addr>
    <line1>line1</line1>
    <line2></line2>
    <state>IL</state>
    <zip>60002</zip>
  </addr>
</user>
_iduser.nameuser.ageuser.addr.line1user.addr.line2user.addr.stateuser.addr
xxxJohn45
xxxTom26
1000Mary30
xxxStacy43line1IL60002

Collection post insertion of a nested XML

Four new columns are added into the table. These columns are added under a sub-column of user.addr. While user.addr is not an actual column that stores data.user.addr.line1, user.addr.line2, user.addr.state, user.addr.zip are actual columns. However user.addr can be used in a n SQL query to select all of its nested columns.

The created columns will remain in the schema unless explicitly deleted, even if the record for Stacy is deleted.

No root tag

While XML documents by general convention have a root tag, the same can be skipped when inserting an XML into BlobCity. The database supports interpretation of XML documents that have an XML like structure without a root tag present. The root tag in the XML that we used so far has been the <user></user> tag.

<?xml version="1.0" encoding="UTF-8"?>
<name>Stacy</name>
<age>43</age>
<addr>
  <line1>line1</line1>
  <line2></line2>
  <state>IL</state>
  <zip>60002</zip>
</addr>
_idnameageaddr.line1addr.line2addr.stateaddr.zip
xxxStacy43line1IL60002

When the above XML is inserted in a new empty table, the columns created are not under the user root tag as was previously the case. It is important to note that the structure of this table is exactly same as the table created by inserting an equivalent JSON object.

Arrays in XML
When XML documents have a tag that is repeated, depending on the position of the tag, the inference engine consider occurrence of the tag as separate records, or a collection field within a single record. If the root tag is repeated, and no other tag occurs at the root level, then each root tag and its contents are treated as a separate record. If sub-tags are repeated, or a tag at the root level is repeated, but there are other tags other than the repeated tag at the root level, then contents of each repeated tag are collected into the single column as a collection field.