ScrapeText is a simple function that I use to pull plain text out of strings that contain markup-language formatting. I wrote this function to help build a report for a questionnaire system. The report needed to display only the plain text of each question. However, the text of each question was stored in heavily formatted HTML.
I searched online for a solution but didn’t find anything useful.Then, in a moment of clarity,a solution came to me:Write a function that accepts an HTML string as input and returns a plain text string by simply skipping over any text contained within the <and> tags. For example, if you run the code
SELECT dbo.ScrapeText (‘<a href=”www.sqlserver.com”> SQL Server</a>
the result is "SQL Server".
I wrote the ScrapeText function for SQL Server 2005 and SQL Server 2000. Listing 1 shows the code that does the main processing.This code first wraps the input string (which can be a string or an alphanumeric column in a table) with the > and <tags. The code then searches the input string for any nonbreaking spaces ( ), replacing them with standard spaces.
After the initial data cleansing is complete, the code examines the input string one character at a time from left to right. An imaginary pen is ready to start copying the desired characters to an output string. When a > tag is encountered, the pen is put to paper because it has reached the beginning of a desired string of text. The pen then writes each character to the output string until a <tag is encountered, at which point the pen stops writing. This process continues until the end of the input string is reached.