Use a VB COM component to break large XML files into manageable chunks
XML is the hottest new technology for copying or moving data from one system to another. With this universal markup language, an application can generate in an XML format a data stream that is readable by any other application that understands XML. In addition, the application reading the XML file or stream can be running on any OS, including minimal OSs that run on handheld devices.
However, as more organizations implement XML, they're finding that the applications they use are receiving XML streams that contain anywhere from a dozen to a million entries.
The applications need to read the XML code from a file, use the Microsoft XML Parser (MSXML) to parse the stream into individual fields, and insert that data into a SQL Server database. Because the applications must process such a wide range of data and the XML stream has no limit on the number of records it can contain, parsing the data can be difficult. An application can probably read between 20 and 1000 records at a time into the XML parser without taking too long or exceeding the system's memory. The parser takes an entire XML file or stream and loads the file into the XML Document Object Model (DOM). A program can then use the parser's methods and properties to work with the XML document. If the incoming XML file is large—say 200,000 records—the parser still loads the entire file, which can result in long delays and can even use up all the system's memory.
I attacked the problem of processing large XML files and streams by creating a Visual Basic (VB) COM component to read the XML stream into the parser in chunks. The component reads the XML stream, line by line, from a file, parses the stream into a specified number of lines, then loads those lines into the XML parser. The VB application uses the XML parser to walk through the data, extract individual fields, and insert the fields into a SQL Server database.
Defining the XML Stream
Listing 1 shows an example of an XML file. Organization of the XML data is the key to fast and accurate processing. The first two lines in the file are the XML version and opening tag for the stream:
The closing tag () identifies the end of the file. Each record in the file is a single XML entry named Book:
Attributes such as TitleID, Title, and Notes convey information about each book.
Now, let's look at the VB code that processes this file. Listing 2 shows the class TitleXMLLoader.cls, which processes the XML stream, and Listing 3 shows the VB form frmXMLTester.frm, which tests the class to make sure it works. In production, when you want to process the XML file, an Active Server Pages (ASP) script in a Web application calls the TitleXMLLoader class. When a user tests the application by clicking the Process File button, the VB class's cmdProcessFile_Click event procedure executes the class's ProcessInComingFile method, passing the XML filename to the method to load into the database. The ASP script later calls the same method to process the incoming XML file.
Callout A in Listing 3 shows the ProcessInComingFile method, which uses the FileSystemObject (ofs) to read the XML file from disk. The code in Callout B creates an instance of the FileSystemObject and opens the XML input file. The code then sets the sAllText variable to the XMLVersion and BookStartTag constants, which places the following text in sAllText:
This version tag sets the XML version of the stream that the VB program will load into the XML parser. You need this tag to create a well-formed XML stream; if you omit the tag from the XML stream that you feed to the parser, an error will occur. The Books tag is the parent tag for all items in the incoming XML stream. Setting both of these tags this way lets you break the incoming file into manageable chunks that you can feed into the parser. The ProcessInComingFile method will strip the tags from the incoming stream, then add them back to each chunk of XML that you pass to the parser.
The next few lines of code open the incoming XML file and start the Do While loop that processes the file. The If statement in Callout C compares the current record counter (i) against the iRecordsToProcess constant, which controls the number of records that the program processes from the file. The constant's default setting is five records, but a more reasonable production setting is probably between 100 and 1000 records, depending on each record's content, how much data is in each record, and how fast you can process that data and send it to SQL Server. (You can change the default by changing the iRecordsToProcess variable in Listing 3.) When the current record counter equals the constant you've set, the code resets the record counter (i) and appends the BookEndTag to the text in the sAllText variable. At this point in the application, the sAllText variable contains all the XML code for one chunk. The code then sets the XML parser's async property to False to force synchronous loading of the chunk into the parser and calls the loadXML method to load the XML chunk. After loadXML loads the XML chunk, the code resets the sAllText variable to the version and start tags. The rest of the code in ProcessInComingFile reads the incoming line from the file and checks the line for the version and start tag strings. If neither string appears in the line, the method appends the line to the sAllText variable.
The code in Callout D processes the XML chunk. First, the code sets the sNodeToFind variable to the constant BookTagName, which contains the name of the tag (in this example, Book) in the XML chunk. Next, the code searches for the first tag that matches the sNodeToFind text:
Set objNodeList = xmlDoc.getElementsByTagName(sNodeToFind)
When it finds this text, the code creates an IXMLDOMNodeList object that represents all the Book tags in the XML chunk. Now, the code can process the list of nodes and extract all the attributes.
The first line in the code's For loop creates objNodeList.nextNode, which represents the next node in the list. The second line creates an instance of IXMLDOM NamedNodeMap (objNode.Attributes) that represents the node's attributes. Next, the code executes the getNamedItem method of the IXMLDOMNamedNodeMap object, which extracts data for a specific node and creates a reference of IXMLDOMNode that represents the named item:
Set objNamedItem = objNodeMap.getNamedItem("TitleID")
The code then retrieves the Text property of IXMLDOMNode, extracting the text and setting it to the appropriate variable (sTitleID, in this example):
sTitleID = objNamedItem.Text
The program executes these last two lines of code for each attribute it extracts. The code then calls stored procedure IinsertTitleNew, which Listing 4 shows, to insert the extracted data into the SQL Server table.
New Ways to Move Data
This article's sample VB program shows one way to process XML data from a large XML file or stream into a SQL Sever database. The solution doesn't require any particular SQL Server version, although I used SQL Server 7.0 for testing. (Web Listing 1 shows the command file I used to create this example's sample tables. To access this listing, enter InstantDoc ID 9716 at http://www.sqlmag.com/.)
Related: Importing an XML File
XML is quickly becoming the dominant way to get data into and out of applications, requiring us to think of new ways to handle both small and large amounts of data. You should also investigate the Microsoft SAX2 API, which is better suited to processing large files than the standard MSXML parser. For more information about this API, see the Microsoft article "SAX2 Jumpstart for XML Developers" (http://msdn.microsoft.com/workshop/xml/articles/sax2jumpstart.asp). Microsoft is addressing the need for better XML management by introducing BizTalk Server 2000, which can process incoming and outgoing XML streams, works with a variety of different applications, and should ship later this year. In the meantime, you can use VB, ASP, and a parser (such as MSXML) to generate, process, and save XML data.
Learn more from XML Updategrams.