Removing CDATA tags from rss (xml) feed

bartlby37 · April 12, 2012, 12:00am

I'm displaying an rss feed from my blog in a block on some pages. By default the rss block only shows titles from the feed. I changed it to also show the text, or some of it anyway. The problem is that in the feed the text (description) is inside CDATA tags and so doesn't display properly inside the block - for instance rather than an image in the block I get http://www.blog.double-b-books.com/images/9084.jpg” (see attached image)

The code that gets the rss feed is in core/fn.cms.php

```php function fn_get_rss_feed($data)

{

if (!empty($data['feed_url'])) {

$data_key = 'rss_data_cache_' . (isset($data['block_data']['block_id']) ? $data['block_data']['block_id'] : 0);

if (!empty($data['cache_time'])) {

Registry::register_cache($data_key, $data['cache_time'], CACHE_LEVEL_TIME);

}

if (Registry::is_exist($data_key) == false) {

$limit = !empty($data['max_item']) ? $data['max_item'] : 3;

$rss_data = array();

$rss = simplexml_load_string(fn_get_contents($data['feed_url']));

if (!empty($rss)) {

$it = 0;

$items = array();

foreach ($rss->channel->item as $item) {

if ($it > $limit) {

break;

}

$items[] = array(

'title' => (string)$item->title,

'description'=>(string)->description,

'pubDate' => (string)$item->pubDate,

'link' => (string)$item->link

);

$it++;

}

$rss_data = array(array(

$items,

(string)$rss->channel->link,

$data['feed_url']

));

Registry::set($data_key, $rss_data);

}

return $rss_data;

} else {

return Registry::get($data_key);

}

}

return array();

}

```

I added 'description'=>(string)->description, to get the description into the feed. Can anyone tell me how I strip the CDATA tags from that.

I also wonder why a block that displays an rss feed isn't doing this automatically.

All help gratefully recieved.

blgscrn.png

tbirnseth · April 13, 2012, 12:00am

Try adding LIBXML_NOCDATA as an option to the simplexml_load_string() function as the 3rd argument. See the php manual for simplexml for what the default value for the 2nd argument should be so you can get to the 3rd.

You probably also want to 'trim' the resulting string since I believe that the LIBXML_NOCDATA preserves the newlines and tabs related to the tag which you probably don't want.

P.S… This took only a few minutes to read through a google search for “php simplexml cdata” search.

bartlby37 · April 13, 2012, 12:00am

Well I've spent more than a few minutes searching Google.

I'd already tried replacing this

```php $rss = simplexml_load_string(fn_get_contents($data['feed_url'])); ```

with this

```php $rss = simplexml_load_string(fn_get_contents($data['feed_url']), 'SimpleXMLElement', LIBXML_NOCDATA); ```

I also found a function someone had written to strip CDATA tags and tried that but couldn't make it work. I know absolutely nothing about php and so possibly the syntax in the statement above is wrong or I'm looking in the wrong place.

Thanks

tbirnseth · April 13, 2012, 12:00am

Your simplexml_load_string() looks correct. You are saying that you sill have ![CDATA[whatever the data is]] returned as the string value of your reference of the element (i.e. (string)$rss->item->description).

Have you dumped the results of the simplexml pointer as in

echo “

”.print_r($rss,true).“

” to see what you have?

It is possible that you provider of the rss has the data encoded in a different format than what you're epecting or what's correct for the xml file.

Have you tried just referencing the URL in a browser to verify that the CDATA is in fact stripped out of the feed by a browser?

bartlby37 · April 13, 2012, 12:00am

Thanks for taking the time to respond.

The feed url is 6165cc金沙总站-永久网址 and it seems to display fine in a browser. But in the rss block it shows all the html as text - so rather than starting a paragraph I get

abcd… and rather than displaying an image I get

The only place I've found to view the xml file is if I put the feed through feedburner and what it shows is this

[xml]

Did that make a difference
http://www.blog.double-b-books.com/index.php/about-authors/item/26-did-that-make-a-difference
Lets see.

]]>
bruce@double-b-books.co.uk (Super User)
http://www.blog.double-b-books.com/index.php/about-authors/item/26-did-that-make-a-difference
[/xml]

The display in the block isn't changed in any way by passing the feed through feedburner so I assume that the code is the same.

I assume this "Have you dumped the results of the simplexml pointer as in
echo "

".print_r($rss,true)."

" to see what you have?" is to show the contents of the xml feed.

Where would I need to place that code to make that happen?

Bruce

tbirnseth · April 13, 2012, 12:00am

You can just place the dump code after the call to the xml parser. Should be a test site. If this is a production site, modifiy it to look like:

@file_put_contents("./debug.txt", print_r($rss,true));

This will put the data in a file in the root of your store called debug.txt.

Since you say that all the HTML is visible, it could be that there is further encoding of the data. You might have to wrap the input xml data in html_entities_decode() function. And that would explain why the xml parser is not getting rid of the cdata because it doesn't see it as cdata. But your site should behave the same as a browser.

tony

tbirnseth · April 13, 2012, 12:00am

What you're looking for is a bunch of < and > characters.

bartlby37 · April 13, 2012, 12:00am

Hi Tony

Many thanks for that - I shall give it a go 1st thing in the morning and see what it produces. I'm also going to try and find if there's any other encoding on the xml file - It seems unlikely though - I can get the date and title to show fine.

Bruce

tbirnseth · April 13, 2012, 12:00am

I don’t see any encoding in the link you provided. It should just work!

bartlby37 · April 14, 2012, 12:00am

I ran this

echo "<pre>".print_r($rss,true)."</pre>"
```<br />
and everything displayed fine - images were there, no html showing up, no strange code in the xml file. Took that line out and went back to look at the site and there's the rss block showing all the html tags.<br />
<br />
As for technology just working.... well I've told myself that so often that I almost believe it <img src="upload://ssT9V5t45yjlgXqiFRXL04eXtqw.gif" class="bbc_emoticon" alt=";-)">.<br />
<br />
Could it be something in the template file. ```php
<li><br />
  {if $item.pubDate}<br />
  <strong>{$item.pubDate|date_format:$settings.Appearance.date_format}</strong><br />
  {/if}<br />
  <a href="{$item.link}" target="_blank">{$item.title}</a><br />
  <p>{$item.description}</p><br />
 <br />
</li>
```<br />
<br />
All I've added to that is the <p>{$item.description}</p><br />
<br />
I'm baffled.

tbirnseth · April 14, 2012, 12:00am

The data has to be being escaped then within the cart. I have no idea where and I've never looked at any of the rss code.

If the rss is being displayed within a 'description field within products or something, that would explain the escape.

You can use the 'unescape' smarty modifier (look in the smarty docs) to essentially do an html_entitiy_decode() on the data.

bartlby37 · April 15, 2012, 12:00am

FANTASTIC - thanks so much for your help. Its working now and you can see it here

http://www.double-b-books.com