Condensing meaning from the vapor of nuance.

Shredding XML into a SQL 2005 database

clock July 5, 2007 15:34 by author brian.kuhn

As you may or may not know, SQL 2005 comes with some really great support for handling XML natively. You can find a good overview of these new capabilities on MSDN at "Using XML in SQL Server". In this post I am going to show how you can easily take XML data like an RSS 2.0 feed and using the xml data type methods, select specific XML nodes as SQL columns that could then be inserted into a database. I have also provided a SQL script that demonstrates the concepts and tSQL code used in the post.

You will want to take a look at "XQuery Expressions" while reviewing the SQL code below, as it will help you better understand the query syntax used to select information from the XML data type.

Getting Started

The first step is to represent the XML data in SQL using the new XML data type. In a real world situation you would likely have a stored procedure that accepts a parameter of type XML, but for this post I will be just declaring a variable of type XML and selecting the RSS 2.0 feed XML data into an untyped XML variable like so:

/**************************************************************************************************************
    Define RSS 2.0 feed XML document to process
**************************************************************************************************************/
DECLARE     @Feed        XML

SET    @Feed=
'
<rss version="2.0">

    <channel>

        <title>My Channel Title</title>
        <link>http://localhost/rss</link>
        <description>This is a generic RSS 2.0 syndication feed.</description>

        <cloud 
            domain="rpc.sys.com" 
            port="80" 
            path="/RPC2" 
            registerProcedure="myCloud.rssPleaseNotify" 
            protocol="xml-rpc" 
        />
        <copyright>Copyright 2007, Oppositionally Defiant</copyright>
        <docs>http://www.rssboard.org/rss-specification</docs>
        <generator>Argotic Syndication Framework</generator>
        <image>
            <url>http://localhost/rss/images/image.gif</url>
            <title>Generic Channel Image</title>
            <description>Describes the web site this feed belongs to.</description>
            <link>http://localhost/rss</link>
            <width>88</width>
            <height>31</height>
        </image>
        <language>en-US</language>
        <lastBuildDate>Sat, 07 Sep 2002 09:42:31 GMT</lastBuildDate>
        <managingEditor>john.doe@domain.com (John Doe)</managingEditor>
        <pubDate>Sat, 07 Sep 2002 00:00:01 GMT</pubDate>
        <rating>PICS Rating</rating>
        <textInput>
            <title>The label of the Submit button in the text input area.</title>
            <description>Explains the text input area.</description>
            <name>The name of the text object in the text input area.</name>
            <link>http://localhost/rss/cgi.script</link>
        </textInput>
        <ttl>60</ttl>
        <webMaster>jane.doe@domain.com (Jane Doe)</webMaster>

        <item>

            <title>My Channel Item</title>
            <link>http://localhost/rss/items.aspx?id=1234</link>
            <description>This is a generic RSS 2.0 channel item.</description>

            <author>john.smith@domain.com (John Smith)</author>
            <comments>http://localhost/rss/comments.aspx?id=1234</comments>
            <enclosure 
                url="http://localhost/rss/mp3s/SomeSong.mp3" 
                length="12216320" 
                type="audio/mpeg" 
            />
            <guid isPermaLink="true">http://localhost/rss/2007/07/1234.aspx</guid>
            <pubDate>Sun, 19 May 2002 15:21:36 GMT</pubDate>
            <source url="http://www.domain.com/rss.xml">External Feed Source</source>
            
            <category domain="mscomdomain:ContentType">SQL 2005 XML </category>
            <category domain="mscomdomain:Audience">Database developers</category>
            <category domain="mscomdomain:Operating Systems">Windows</category>
            <category domain="mscomdomain:Subject">.NET development</category>
            
        </item>

        <item>

            <title>Another Channel Item</title>
            <link>http://localhost/rss/items.aspx?id=5678</link>
            <description>This is another generic RSS 2.0 channel item.</description>

            <author>joe.smith@domain.com (John Smith)</author>
            <comments>http://localhost/rss/comments.aspx?id=5678</comments>
            <enclosure 
                url="http://localhost/rss/mp3s/SomeOtherSong.mp3" 
                length="14216320" 
                type="audio/mpeg" 
            />
            <guid isPermaLink="false">http://localhost/rss/2007/07/04.aspx</guid>
            <pubDate>Sun, 20 May 2002 15:21:36 GMT</pubDate>
            <source url="http://www.domain.com/rss2.xml">Another Feed Source</source>
            
            <category domain="mscomdomain:ContentType">SQLXML</category>
            <category domain="mscomdomain:Audience">Application developers</category>
            <category domain="mscomdomain:Operating Systems">Linux</category>
            <category domain="mscomdomain:Subject">Web development</category>
            
        </item>
        
        <category domain="mscomdomain:ContentType">Announcement</category>
        <category domain="mscomdomain:Audience">Application developers</category>
        <category domain="mscomdomain:Audience">Component developers</category>
        <category domain="mscomdomain:Audience">Database developers</category>
        <category domain="mscomdomain:Audience">Developers (general)</category>
        <category domain="mscomdomain:Audience">Game developers</category>
        <category domain="mscomdomain:Audience">Web developers</category>
        <category domain="mscomdomain:Operating Systems">Windows</category>
        <category domain="mscomdomain:Subject">.NET development</category>
        <category domain="mscomdomain:Subject">Web development</category>

        <skipDays>
            <day>Monday</day>
            <day>Tuesday</day>
            <day>Wednesday</day>
            <day>Thursday</day>
            <day>Friday</day>
            <day>Saturday</day>
            <day>Sunday</day>
        </skipDays>

        <skipHours>
            <hour>0</hour>
            <hour>1</hour>
            <hour>2</hour>
            <hour>3</hour>
            <hour>4</hour>
            <hour>5</hour>
            <hour>6</hour>
            <hour>7</hour>
            <hour>8</hour>
            <hour>9</hour>
            <hour>10</hour>
            <hour>11</hour>
            <hour>12</hour>
            <hour>13</hour>
            <hour>14</hour>
            <hour>15</hour>
            <hour>16</hour>
            <hour>17</hour>
            <hour>18</hour>
            <hour>19</hour>
            <hour>20</hour>
            <hour>21</hour>
            <hour>22</hour>
            <hour>23</hour>
        </skipHours>
        
    </channel>
    
</rss>
'

Selecting scalar values

Now that we have an RSS 2.0 feed represented as an XML data type, we can utilize the various query methods available to select the information we are after from the XML data. To begin, lets select the <channel> node information using the value() method, which performs an XQuery against the XML and returns a value of SQL type. This method returns a scalar value. The value() method accepts an XQuery expression that defines what XML element we want to extract and the SQL data-type we wish to cast the element or attribute value to. When using the value() method, the static typing rules require you to explicitly specify that the path expression returns a singleton. Therefore, the additional [1] is specified at the end of the path expression.

Note: I am using a user-defined function called FromRfc822DateTime that takes the RFC 822 date time of the <lastBuildDate> and <pubDate> elements and converts them to a standard SQL date time. For details on RFC 822 date time functions, see my post SQL user-defined functions (UDF) for consuming and generating RFC-822 DateTime's.

 

/**************************************************************************************************************
    A. Process channel
**************************************************************************************************************/
SELECT 
         ---------------------------------------------------------------------------------------------------------------
         --    Extract feed information
         ---------------------------------------------------------------------------------------------------------------
         @Feed.value('(/rss/@version)[1]', 'nvarchar(7)' )                                        AS 'Version'
        
        ---------------------------------------------------------------------------------------------------------------
         --    Extract channel information
         ---------------------------------------------------------------------------------------------------------------
        ,@Feed.value('(/rss/channel/title)[1]', 'nvarchar(255)' )                                AS 'Title'
        ,@Feed.value('(/rss/channel/link)[1]', 'nvarchar(255)' )                                AS 'Link'
        ,@Feed.value('(/rss/channel/description)[1]', 'nvarchar(255)' )                            AS 'Description'
        ,@Feed.value('(/rss/channel/copyright)[1]', 'nvarchar(500)' )                            AS 'Copyright'
        ,@Feed.value('(/rss/channel/docs)[1]', 'nvarchar(255)' )                                AS 'Documentation'
        ,@Feed.value('(/rss/channel/generator)[1]', 'nvarchar(255)' )                            AS 'Generator'
        ,@Feed.value('(/rss/channel/language)[1]', 'nchar(5)' )                                    AS 'LanguageCode'
        ,dbo.FromRfc822DateTime(@Feed.value('(/rss/channel/lastBuildDate)[1]', 'nchar(70)' ))    AS 'LastBuildDate'
        ,@Feed.value('(/rss/channel/managingEditor)[1]', 'nvarchar(100)' )                        AS 'ManagingEditor'
        ,dbo.FromRfc822DateTime(@Feed.value('(/rss/channel/pubDate)[1]', 'nchar(70)' ))            AS 'PublicationDate'
        ,@Feed.value('(/rss/channel/rating)[1]', 'nvarchar(500)' )                                AS 'Rating'
        ,@Feed.value('(/rss/channel/ttl)[1]', 'int' )                                            AS 'TimeToLive'
        ,@Feed.value('(/rss/channel/webMaster)[1]', 'nvarchar(100)' )                            AS 'ManagingEditor'
         ---------------------------------------------------------------------------------------------------------------
         --    Extract channel cloud information
         ---------------------------------------------------------------------------------------------------------------
        ,@Feed.value('(/rss/channel/cloud/@domain)[1]', 'nvarchar(255)' )                        AS 'CloudDomain'
        ,@Feed.value('(/rss/channel/cloud/@path)[1]', 'nvarchar(255)' )                            AS 'CloudPath'
        ,@Feed.value('(/rss/channel/cloud/@port)[1]', 'int' )                                    AS 'CloudPort'
        ,@Feed.value('(/rss/channel/cloud/@protocol)[1]', 'nvarchar(20)' )                        AS 'CloudProtocol'
        ,@Feed.value('(/rss/channel/cloud/@registerProcedure)[1]', 'nvarchar(500)' )            AS 'CloudProcedure'
         ---------------------------------------------------------------------------------------------------------------
         --    Extract channel image information
         ---------------------------------------------------------------------------------------------------------------
        ,@Feed.value('(/rss/channel/image/url)[1]', 'nvarchar(255)' )                            AS 'ImageUrl'
        ,@Feed.value('(/rss/channel/image/title)[1]', 'nvarchar(255)' )                            AS 'ImageTitle'
        ,@Feed.value('(/rss/channel/image/link)[1]', 'nvarchar(255)' )                            AS 'ImageLink'
        ,@Feed.value('(/rss/channel/image/description)[1]', 'nvarchar(500)' )                    AS 'ImageDescription'
        ,@Feed.value('(/rss/channel/image/height)[1]', 'int' )                                    AS 'ImageHeight'
        ,@Feed.value('(/rss/channel/image/width)[1]', 'int' )                                    AS 'ImageWidth'
         ---------------------------------------------------------------------------------------------------------------
         --    Extract channel text input information
         ---------------------------------------------------------------------------------------------------------------
        ,@Feed.value('(/rss/channel/textInput/title)[1]', 'nvarchar(255)' )                        AS 'TextInputTitle'
        ,@Feed.value('(/rss/channel/textInput/description)[1]', 'nvarchar(500)' )                AS 'TextInputDescription'
        ,@Feed.value('(/rss/channel/textInput/name)[1]', 'nvarchar(255)' )                        AS 'TextInputName'
        ,@Feed.value('(/rss/channel/textInput/link)[1]', 'nvarchar(255)' )                        AS 'TextInputLink'

 

Selecting multiple rows

Next lets select the one-to-many relationship of the channel's <item> nodes using the nodes() method, which allows you to identify nodes that will be mapped into a new row. Note that the rowset returned by the nodes() method is an unnamed rowset. Therefore, it must be explicitly named by using aliasing. Please note that the "." designator indicates the current value of the node for the expression.

 

/**************************************************************************************************************
    B. Process channel items
**************************************************************************************************************/
SELECT 
         ---------------------------------------------------------------------------------------------------------------
         --    Extract item information
         ---------------------------------------------------------------------------------------------------------------
         T.c.query('title').value('.', 'nvarchar(255)')                                AS 'Title'
        ,T.c.query('description').value('.', 'nvarchar(2000)')                        AS 'Description'
        ,T.c.query('link').value('.', 'nvarchar(255)')                                AS 'Link'
        ,T.c.query('author').value('.', 'nvarchar(255)')                            AS 'Author'
        ,T.c.query('comments').value('.', 'nvarchar(255)')                            AS 'Comments'
        ,dbo.FromRfc822DateTime(T.c.query('pubDate').value('.', 'nvarchar(70)'))    AS 'PublicationDate'
        
---------------------------------------------------------------------------------------------------------------
         --    Extract item enclosure information
         ---------------------------------------------------------------------------------------------------------------
        ,T.c.query('enclosure').value('(enclosure/@url)[1]', 'nvarchar(255)' )        AS 'EnclosureUrl'
        ,T.c.query('enclosure').value('(enclosure/@type)[1]', 'nvarchar(100)' )        AS 'EnclosureType'
        ,T.c.query('enclosure').value('(enclosure/@length)[1]', 'bigint' )            AS 'EnclosureLength'
         ---------------------------------------------------------------------------------------------------------------
         --    Extract item guid information
         ---------------------------------------------------------------------------------------------------------------
        ,T.c.query('guid').value('(guid/@isPermaLink)[1]', 'bit' )                    AS 'GuidIsPermaLink'
        ,T.c.query('guid').value('.', 'nvarchar(255)' )                                AS 'GuidValue'
         ---------------------------------------------------------------------------------------------------------------
         --    Extract item source information
         ---------------------------------------------------------------------------------------------------------------
        ,T.c.query('source').value('(source/@url)[1]', 'nvarchar(255)' )            AS 'SourceUrl'
        ,T.c.query('source').value('.', 'nvarchar(255)' )                            AS 'SourceTitle'
        
FROM   @Feed.nodes('/rss/channel/item') T(c)

 

Finally lets select the rest of the elements associated to the <channel> and <item> nodes. Please note that in the select statement for the each item's <category> elements, I use the ".." designator in the XQuery expression to retrieve the parent <item> element for each category so that I can map the category to the item it belongs to.

 

/**************************************************************************************************************
    C. Process item categories
**************************************************************************************************************/
SELECT 
         ---------------------------------------------------------------------------------------------------------------
         --    Extract item guid information
         ---------------------------------------------------------------------------------------------------------------
         T.c.query('../guid').value('(guid/@isPermaLink)[1]', 'bit' )    AS 'GuidIsPermaLink'
        ,T.c.query('../guid').value('.', 'nvarchar(255)' )                AS 'GuidValue'
         ---------------------------------------------------------------------------------------------------------------
         --    Extract item category information
         ---------------------------------------------------------------------------------------------------------------
        ,T.c.value('@domain','nvarchar(255)')                            AS 'CategoryDomain'
        ,T.c.value('.', 'nvarchar(4000)')                                AS 'CategoryValue'
FROM   @Feed.nodes('/rss/channel/item/category') T(c)
/**************************************************************************************************************
    D. Process channel categories
**************************************************************************************************************/
SELECT 
        T.c.value('@domain','nvarchar(255)')        AS 'CategoryDomain'
       ,T.c.value('.', 'nvarchar(4000)')            AS 'CategoryValue'
FROM   @Feed.nodes('/rss/channel/category') T(c)
/**************************************************************************************************************
    E. Process channel skip days
**************************************************************************************************************/
SELECT 
         T.c.value('.', 'nvarchar(9)')            AS 'Day'
        ,CASE T.c.value('.', 'nvarchar(9)')
WHEN    'Sunday'    THEN 1
WHEN    'Monday'    THEN 2
WHEN    'Tuesday'    THEN 3
WHEN    'Wednesday'    THEN 4
WHEN    'Thursday'    THEN 5
WHEN    'Friday'    THEN 6
WHEN    'Saturday'    THEN 7
ELSE    0
END                                    AS 'DayOfWeek_EnumValue'
FROM   @Feed.nodes('/rss/channel/skipDays/day') T(c)
/**************************************************************************************************************
    F. Process channel skip hours
**************************************************************************************************************/
SELECT 
         T.c.value('.','int')        AS 'Hour'
FROM   @Feed.nodes('/rss/channel/skipHours/hour') T(c)

 

Conclusion

As you can see, once you have a basic understanding of the XQuery expression syntax and the methods available on the SQL xml data type, extracting information from XML data from within a SQL statement is very easy. There are many advantages to utilizing XMl from within SQL 2005, especially if you are storing information that maps to a commonly used XML schema specification like RSS.

 

Download the tSQL script that demonstrates shredding XML in SQL: SelectRssXmlDataExample.sql (13.91 kb)

Categories: SQL Server | XML
Actions: E-mail | Permalink | Comment Comments (0) | RSS comment feedComment RSS |

Related posts

Add comment


(Will show your Gravatar icon)  

  Country flag




Live preview

September 6. 2008 11:47

Gravatar

Calendar

<<  September 2008  >>
MoTuWeThFrSaSu
25262728293031
1234567
891011121314
15161718192021
22232425262728
293012345