Using Service Broker to move data asynchronously from one application to another, I've become comfortable with parsing and loading XML data into SQL Server tables. If the number of rows is relatively small, Xquery is fast and effective. In my experience, however, when the number of rows grows, so does the amount of time it takes to parse the XML, and so another solution is in order.
To understand the problem, let's look at a table called Products, which you're going to load in T-SQL using Xquery. Here's the DDL schema for the table.
CREATE TABLE [dbo].[Products]( [ProductID] [int] NOT NULL, [ProductName] [nvarchar](40) NOT NULL, [SupplierID] [int] NULL, [CategoryID] [int] NULL, [QuantityPerUnit] [nvarchar](20) NULL, [UnitPrice] [money] NULL, [UnitsInStock] [smallint] NULL, [UnitsOnOrder] [smallint] NULL, [ReorderLevel] [smallint] NULL, [Discontinued] [bit] NOT NULL, CONSTRAINT [PK_Products] PRIMARY KEY CLUSTERED ( [ProductID] ASC ) ) GO
The source XML file matches the table, with the column data defined as elements, and looks like this.
1 Bardudollantor 1 1 plurissimum transit. 338.3698 23645 -7654 26584 1
2 Grojubor 2 2 Id imaginator 284.4184 5101 24330 -21976 1
Now, using T-SQL, you need to define an XML variable, load it from the source file, then use Xquery to parse and load the file contents into the Products table. Here's the T-SQL code.
DECLARE @messagebody XML SELECT @messagebody = BulkColumn FROM OPENROWSET(BULK 'C:\Work\Products1000.XML', SINGLE_CLOB) AS X INSERT INTO [dbo].[Products] select a.value(N'(./ProductID)', N'int') as [ProductID], a.value(N'(./ProductName)', N'nvarchar(40)') as [ProductName], a.value(N'(./SupplierID)', N'int') as [SupplierID], a.value(N'(./CategoryID)', N'int') as [CategoryID], a.value(N'(./QuantityPerUnit)', N'nvarchar(20)') as [QuantityPerUnit], a.value(N'(./UnitPrice)', N'money') as [UnitPrice], a.value(N'(./UnitsInStock)', N'smallint') as [UnitsInStock], a.value(N'(./UnitsOnOrder)', N'smallint') as [UnitsOnOrder], a.value(N'(./ReorderLevel)', N'smallint') as [ReorderLevel], a.value(N'(./Discontinued)', N'bit') as [Discontinued] from @messagebody.nodes('/Products/row') as r(a);
Using SET STATISTICS TIME ON, you see that if the XML data has 1,000 rows, these are the times, first to load the file into the XML variable, then to parse and load the XML data into the Products table.
SQL Server Execution Times: CPU time = 0 ms, elapsed time = 11 ms. SQL Server Execution Times: CPU time = 156 ms, elapsed time = 171 ms.
Not too bad, but it's only 1,000 rows. For comparison, let's look at the same process in PowerShell.
To begin, you need an XML schema file, to define the data types for each of the columns. If you don't define the schema, the columns will all be defined as String values, and you won't easily be able to load the data into the table. Here're the contents of the XML schema file for the Products.xml file.
Now, this may look complicated, but you can let PowerShell do most of the work to generate this schema file. Start by connecting to SQL Server and defining a query against the Products table that returns no results.
$cn = new-object System.Data.SqlClient.SqlConnection("Data Source=WS12SQL;Integrated Security=SSPI;Initial Catalog=TestDB"); $cn.Open() $cmd = New-Object System.Data.SqlClient.SqlCommand $cmd.CommandText = "SELECT * FROM dbo.Products WHERE 1 = 2" $cmd.Connection = $cn $da = New-Object System.Data.SqlClient.SqlDataAdapter $da.SelectCommand = $cmd
Then, fill the DataSet object using the query, and use the WriteXmlSchema() method to write out the file.
$ds = New-Object System.Data.DataSet $da.Fill($ds, "Products") | Out-Null $ds.WriteXmlSchema('C:\Work\ProdSchema.xsd')
You'll have to modify the xsd file to match the XML structure in your data file. The generated xsd file has a schema ID and top level element value of "NewDataSet," and that needs to be changed in this case to "Products." Then, change the element name of "Products" to a name of "row" for the generated schema to work. The column definitions will all work for you.
Use the same method to bulk load the data into the table used in the article "Bulk Copy Data into SQL Server with PowerShell." To begin, you need to have the data in an ADO.NET DataTable—this works in your favor, as the DataSet object, which contains the DataTable, has two useful methods called ReadXML() and ReadXMLSchema().
Next, you need to create a DataSet object and use the ReadXMLSchema() method to load the schema into the DataSet.
$ds = new-object "System.Data.DataSet" "dsServers" $ds.ReadXmlSchema('C:\Work\ProdSchema.xsd')
Then, you can use the ReadXML() method to load the XML data into the DataSet. Note that because you've already loaded the XML schema, the process ReadXML() uses to determine the schema is ignored. You'll then load the lone table in the DataSet into the $dtProd variable.
$ds.ReadXml("C:\Work\Products1000.XML") $dtProd = $ds.Tables
Now, you have a DataTable with our Product data and you can use the SqlBulkCopy method to quickly load the data into the table.
$cn = new-object System.Data.SqlClient.SqlConnection("Data Source=WS12SQL;Integrated Security=SSPI;Initial Catalog=TestDB"); $cn.Open() $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn $bc.DestinationTableName = "dbo.Products" $bc.WriteToServer($dtProd) $cn.Close()
Now, you need to measure the time it took to load the data. PowerShell provides a way to do this using the Measure-Command cmdlet. To simplify the process, I put the preceding PowerShell code into a script called Load-XmlData.ps1, and I executed the following command.
This returned some interesting results.
Days : 0 Hours : 0 Minutes : 0 Seconds : 0 Milliseconds : 62 Ticks : 628365 TotalDays : 7.27274305555556E-07 TotalHours : 1.74545833333333E-05 TotalMinutes : 0.001047275 TotalSeconds : 0.0628365 TotalMilliseconds : 62.8365
This shows that the total time for loading the 1,000 row XML file using T-SQL was 182ms (11ms to load the file, 171ms to parse and load the table), where the total time for running the script to load the XML file to the DataSet, then bulk copy the data into SQL Server was 63ms, roughly one third of the time, to do the same thing.
If you bump up the size of the XML file from 1,000 rows to 10,000 rows, the T-SQL timing returns these results.
SQL Server Execution Times: CPU time = 593 ms, elapsed time = 775 ms. SQL Server Execution Times: CPU time = 1607 ms, elapsed time = 1728 ms.
Running the same data through the PowerShell script using Measure-Command returns these times.
Days : 0 Hours : 0 Minutes : 0 Seconds : 0 Milliseconds : 742 Ticks : 7429740 TotalDays : 8.59923611111111E-06 TotalHours : 0.000206381666666667 TotalMinutes : 0.0123829 TotalSeconds : 0.742974 TotalMilliseconds : 742.974
So, the T-SQL solution ran in 2503ms total, where the PowerShell solution ran in 743ms, more than three-and-a-half times faster.
As you can see, PowerShell is more appropriate for loading large data sets into SQL Server from XML than T-SQL.