When working with SOAP web services or RSS feeds you often are interacting with XML documents. Some services may send you XML content with values enclosed within <![CDATA[ some text ]]>. For me, I recently worked on an integration project with syncing FogBugz, whose API returns XML, with Service Cloud and I had to parse CDATA tags. It was not trivial but I got it to work!
Here’s example of what XML looks like with CDATA tags:
<?xml version="1.0" encoding="UTF-8"?> <Root> <SomeNode><![CDATA[ <b>contains HTML tags</b> ]]></SomeNode> </Root>
In XML documents, CDATA (Character Data) is content within a tag that itself could be interpreted as XML tags but that you want to be interpreted as plain text.
In the above example, when parsing the value of <SomeNode> I would expect to retrieve the value <b>contains HTML tags</b>. But without the CDATA tags enclosing it then an XML parser would treat the <b> </b> as more XML tags, not as the plain text content to return to me, and I would retrieve the value contains HTML tags (no <b> </b> tags).
Reading XML in Apex
Salesforce provides us two capabilities to read XML in Apex:
DOM parsing, in my opinion, is by far the easiest and least verbose option. You ask the document to give you elements by name and most of the heavy lifting of parsing the content is done for you. A developer’s paradise!
Stream reading is more low-level and your code reacts to events like the start or end of a tag, the start or end of content, etc. And your code must maintain context to know what to do with the content it just consumed. This is too low level for my patience threshold.
Trying to Read CDATA in XML in Apex
When it comes to parsing CDATA sections in XML there is feature disparity between the DOM and Stream options. DOM parsing (likely a bug in the platform) ignores the CDATA sections and you cannot retrieve the values of those nodes. Stream parsing, however, does recognize the CDATA sections and your code can parse the value out.
But as I mentioned earlier, DOM parsing is much easier and my preferred method of reading XML. So we push on trying to solve this dilemma.
In fact, this is an old problem and has plagued developers for several years.
Nearly once the feature was available, Thomas Hoban submitted to IdeaExchange request to support CDATA with Dom.Document and Dom.XMLNode classes:
Just a few years ago, Martin Verdejo asked on Salesforce Stack Exchange for help in his struggle to read CDATA using Dom.XMLNode class:
Is there any hope?
ARE WE DOOMED TO PARSE ALL XML USING STREAMS?!?!
But alas, no we are not doomed. Otherwise I would not be writing this blog post =)
There Must Be a Simple Solution!
Martin’s approach was to replace the <![CDATA[ and ]]> text with empty string so that it is removed from the XML content. That’s a straight forward approach, however it forgets the purpose and protections of the CDATA tags to begin with.
Remember, in XML documents, CDATA (Character Data) is content within a tag that itself could be interpreted as XML tags but that you want to be interpreted as plain text.
As long as the values within the CDATA sections do not themselves contain any HTML or XML tags then you have nothing to worry about. But if you do, as I did in my project, then a simple search & replace doesn’t suffice as my XML document now appears to have more tags in it than matches the schema.
Regular Expressions to the Rescue
In my approach I extend upon Martin’s idea of replacing the CDATA sections but I then escape any XML tags that were enclosed by the CDATA protections.
In this manner I remove the CDATA that Dom.XMLNode class can’t parse but still maintain some protections that the content that could be interpreted by the parser as more XML tags instead is treated as plain text. Win!