When working with SOAP web services or RSS feeds you often are interacting with XML documents. Some services may send you XML content with values enclosed within <![CDATA[ some text ]]>. For me, I recently worked on an integration project with syncing FogBugz, whose API returns XML, with Service Cloud and I had to parse CDATA tags. It was not trivial but I got it to work!
Here’s example of what XML looks like with CDATA tags:
<?xml version="1.0" encoding="UTF-8"?> <Root> <SomeNode><![CDATA[ <b>contains HTML tags</b> ]]></SomeNode> </Root>
In XML documents, CDATA (Character Data) is content within a tag that itself could be interpreted as XML tags but that you want to be interpreted as plain text.
In the above example, when parsing the value of <SomeNode> I would expect to retrieve the value <b>contains HTML tags</b>. But without the CDATA tags enclosing it then an XML parser would treat the <b> </b> as more XML tags, not as the plain text content to return to me, and I would retrieve the value contains HTML tags (no <b> </b> tags).
Reading XML in Apex
Salesforce provides us two capabilities to read XML in Apex:
DOM parsing, in my opinion, is by far the easiest and least verbose option. You ask the document to give you elements by name and most of the heavy lifting of parsing the content is done for you. A developer’s paradise!
Stream reading is more low-level and your code reacts to events like the start or end of a tag, the start or end of content, etc. And your code must maintain context to know what to do with the content it just consumed. This is too low level for my patience threshold.
Trying to Read CDATA in XML in Apex
When it comes to parsing CDATA sections in XML there is feature disparity between the DOM and Stream options. DOM parsing (likely a bug in the platform) ignores the CDATA sections and you cannot retrieve the values of those nodes. Stream parsing, however, does recognize the CDATA sections and your code can parse the value out.
But as I mentioned earlier, DOM parsing is much easier and my preferred method of reading XML. So we push on trying to solve this dilemma.
In fact, this is an old problem and has plagued developers for several years.
Nearly once the feature was available, Thomas Hoban submitted to IdeaExchange request to support CDATA with Dom.Document and Dom.XMLNode classes:
A little while later, Salesforce MVP Abhinav Gupta shared on his blog how this bug hampered his open source projects:
Just a few years ago, Martin Verdejo asked on Salesforce Stack Exchange for help in his struggle to read CDATA using Dom.XMLNode class:
Is there any hope?
ARE WE DOOMED TO PARSE ALL XML USING STREAMS?!?!
But alas, no we are not doomed. Otherwise I would not be writing this blog post =)
There Must Be a Simple Solution!
Martin’s approach was to replace the <![CDATA[ and ]]> text with empty string so that it is removed from the XML content. That’s a straight forward approach, however it forgets the purpose and protections of the CDATA tags to begin with.
Remember, in XML documents, CDATA (Character Data) is content within a tag that itself could be interpreted as XML tags but that you want to be interpreted as plain text.
As long as the values within the CDATA sections do not themselves contain any HTML or XML tags then you have nothing to worry about. But if you do, as I did in my project, then a simple search & replace doesn’t suffice as my XML document now appears to have more tags in it than matches the schema.
Regular Expressions to the Rescue
In my approach I extend upon Martin’s idea of replacing the CDATA sections but I then escape any XML tags that were enclosed by the CDATA protections.
In this manner I remove the CDATA that Dom.XMLNode class can’t parse but still maintain some protections that the content that could be interpreted by the parser as more XML tags instead is treated as plain text. Win!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
String xml = | |
'<?xml version="1.0" encoding="UTF-8"?>' + | |
'<root>' + | |
' <SomeNode><![CDATA[<b>contains html tags</b>]]></SomeNode>' + | |
'</root>'; | |
// replace CDATA sections with parseable tokens | |
xml = xml.replaceAll( '<!\\[CDATA\\[', 'XML_CDATA_START' ).replaceAll( ']]>', 'XML_CDATA_END' ); | |
// we will build up a map of original text and replacement text | |
Map<String, String> replacementMap = new Map<String, String>(); | |
// regular expression to match sections we want to replace | |
Pattern myPattern = Pattern.compile( '(XML_CDATA_START)(.*?)(XML_CDATA_END)' ); | |
Matcher myMatcher = myPattern.matcher( xml ); | |
while ( myMatcher.find() ) { | |
// the regex was too complicated for Matcher.replaceFirst(..) | |
// so have to do it manually so just put in this map the | |
// original text and the replacement text, we do replacing later | |
replacementMap.put( myMatcher.group(), myMatcher.group(2).escapeXML() ); | |
} | |
// replace in the xml each CDATA section with the escaped XML of its inner content | |
for ( String key : replacementMap.keySet() ) { | |
xml = xml.replace( key, replacementMap.get( key ) ); | |
} | |
// parse the xml like normal | |
Dom.Document doc = new Dom.Document(); | |
doc.load( xml ); | |
Dom.XMLNode rootNode = doc.getRootElement(); | |
String text = rootNode.getChildElement( 'SomeNode', null ).getText(); | |
System.debug( text ); // prints: <b>contains html tags</b> |
sadly, this does not work for large request bodies – line 8 will give you the classic ‘Regex too complicated’ if the xml string is sufficiently long. still trying to sort out a way around that.
LikeLiked by 1 person
Yeah, it’s certainly a workaround and not a silver bullet. Worst case is you have to go the streaming route 😦
LikeLike
This doesn’t work for XML that contains line breaks. The pattern must be modified as follows: (?s)(XML_CDATA_START)(.*?)(XML_CDATA_END)
LikeLiked by 1 person
Hi Doug – thanks for the blog. Works perfectly. For my challenge I had an embedded XML doc inside the CDATA. So with your bit of code, I was able to create a second DOM Document that contained the contents of the XML inside this CDATA and parse that.
Bill Convis
Integration Architect
ServiceMax
LikeLiked by 1 person