How to avoid php breaking umlaut characters when loading a xml file? - php

I've got an xml to parse with php, that contains some umlaut characters.
Every node that contains a string has the string wrapped in a cdata tag, but my problem starts before parsing the xml: when I load the file (I've also tried to print out the contents of the file with file_get_contests, same result), the umlaut characters get broken, so for example ü becomes ü. Running a htmlentities() is futile, as the characters are already broken at that point. The xml encode is utf-8, so I don't know what else to do to avoid this problem. Anyone can help me?
Edit:
xml sample 'locations.xml':
<?xml version="1.0" encoding="utf-8"?>
<locations>
<location>
<id>481</id>
<city><![CDATA[Zürich]]></city>
</location>
</locations>
php code:
function parseLocations(){
$xml = new DOMDocument();
$xml->load('locations.xml');
$xml->preserveWhiteSpace = false;
$data = array();
$locations = $xml->childNodes->item(0);
for($i=0; $i<$locations->childNodes->length; $i++){
$location = $locations->childNodes->item($i);
if($location->nodeName=="location"){
$tmp = parseVenue($location);
$data[] = $tmp;
}
}
echo var_export($data, true);
}
function parseVenue($location){
//I need to exclude some of the nodes
$exclude = array('#text');
$data = array();
for($i=0; $i<$location->childNodes->length; $i++){
$tag = $location->childNodes->item($i);
if(!in_array($tag->nodeName, $exclude)){
$data[$tag->nodeName] = $tag->nodeValue;
}
}
return $data;
}
echoed output:
array ( 0 => array ( 'id' => '481', 'city' => 'Zürich'), )

Related

Trouble creating a valid RSS feed in PHP

I'm trying to get an RSS feed, change some text, and then serve it again as an RSS feed. However, the code I've written doesn't validate properly. I get these errors:
line 3, column 0: Missing rss attribute: version
line 14, column 6: Undefined item element: content (10 occurrences)
Here is my code:
<?php
header("Content-type: text/xml");
echo "<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type='text/xsl'?>
<?xml-stylesheet type='text/xsl' media='screen'
href='/~d/styles/rss2full.xsl'?>
<rss xmlns:content='http://purl.org/rss/1.0/modules/content/'>
<channel>
<title>Blaakdeer</title>
<description>Blog RSS</description>
<language>en-us</language>
";
$html = "";
$url = "http://feeds.feedburner.com/vga4a/mPSm";
$xml = simplexml_load_file($url);
for ($i = 0; $i < 10; $i++){
$title = $xml->channel->item[$i]->title;
$description = $xml->channel->item[$i]->description;
$content = $xml->channel->item[$i]->children("content", true);
$content = preg_replace("/The post.*/","", $content);
echo "<item>
<title>$title</title>
<description>$description</description>
<content>$content</content>
</item>";
}
echo "</channel></rss>";
Just as you don't treat XML as a string when parsing it, you don't treat it as as string when you create it. Use the proper tools to create your XML; in this case, the DomDocument class.
You had a number of problems with your XML; biggest is that you were creating a <content> element, but the original RSS had a <content:encoded> element. That means the element name is encoded but it's in the content namespace. Big difference between that and an element named content. I've added comments to explain the other steps.
<?php
// create the XML document with version and encoding
$xml = new DomDocument("1.0", "UTF-8");
$xml->formatOutput = true;
// add the stylesheet PI
$xml->appendChild(
$xml->createProcessingInstruction(
'xml-stylesheet',
'type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"'
)
);
// create the root element
$root = $xml->appendChild($xml->createElement('rss'));
// add the version attribute
$v = $root->appendChild($xml->createAttribute('version'));
$v->appendChild($xml->createTextNode('2.0'));
// add the namespace
$root->setAttributeNS(
'http://www.w3.org/2000/xmlns/',
'xmlns:content',
'http://purl.org/rss/1.0/modules/content/'
);
// create some child elements
$ch = $root->appendChild($xml->createElement('channel'));
// specify the text directly as second argument to
// createElement because it doesn't need escaping
$ch->appendChild($xml->createElement('title', 'Blaakdeer'));
$ch->appendChild($xml->createElement('description', 'Blog RSS'));
$ch->appendChild($xml->createElement('language', 'en-us'));
$url = "http://feeds.feedburner.com/vga4a/mPSm";
$rss = simplexml_load_file($url);
for ($i = 0; $i < 10; $i++) {
if (empty($rss->channel->item[$i])) {
continue;
}
$title = $rss->channel->item[$i]->title;
$description = $rss->channel->item[$i]->description;
$content = $rss->channel->item[$i]->children("content", true);
$content = preg_replace("/The post.*/","", $content);
$item_el = $ch->appendChild($xml->createElement('item'));
$title_el = $item_el->appendChild($xml->createElement('title'));
// this stuff is unknown so it has to be escaped
// so have to create a separate text node
$title_el->appendChild($xml->createTextNode($title));
$desc_el = $item_el->appendChild($xml->createElement('description'));
// the other alternative is to create a cdata section
$desc_el->appendChild($xml->createCDataSection($description));
// the content:encoded element is not the same as a content element
// the element must be created with the proper namespace prefix
$cont_el = $item_el->appendChild(
$xml->createElementNS(
'http://purl.org/rss/1.0/modules/content/',
'content:encoded'
)
);
$cont_el->appendChild($xml->createCDataSection($content));
}
header("Content-type: text/xml");
echo $xml->saveXML();
The first error is just a missing attribute, easy enough:
<rss version="2.0" ...>
For the <p> and other HTML elements, you need to escape them. The file should look like this:
<p>...
There are other ways, but this is the easiest way. In PHP you can just call a function to encode entities.
$output .= htmlspecialchars(" <p>Paragraph</p> ");
As for the <content> tag problem, it should be <description> instead. The <content> tag currently generates two errors. Changing it to <description> in both places should fix both errors.
Otherwise it looks like you understand the basics. You <open> and </close> tags and those have to match. You can also use what is called empty tags: <empty/> which exist on their own but to not include content and no closing tag.

Writing a string with tags in an XML file turns into giberrish

I tried writing into my XML file with simpleXML and I wanted to write a string with the value "<test>asd</test>" then it turned into total giberrish (I know this is related with encoding formats but I don't know the solution to fix this, I tried turning into encoding="UTF-8" but it still yield a similar result)
My XML File:
<?xml version="1.0"?>
<userinfos>
<userinfo>
<account>
<user>TIGERBOY-PC</user>
<toDump>2014-02-04 22:17:22</toDump>
<nextToDump>2014-02-05 00:17:22</nextToDump>
<lastChecked>2014-02-04 16:17:22</lastChecked>
<isActive>0</isActive>
<upTime>2014-02-04 16:17:22</upTime>
<toDumpDone>1</toDumpDone>
<systemInfo><test>asd</test></systemInfo>
</account>
<account>
<user>TIGERBOY-PCV</user>
<toDump>2014-02-04 22:17:22</toDump>
<nextToDump>2014-02-05 00:17:22</nextToDump>
<lastChecked>2014-02-04 16:17:22</lastChecked>
<isActive>1</isActive>
<upTime>2014-02-04 16:17:22</upTime>
<toDumpDone>1</toDumpDone>
</account>
</userinfo>
</userinfos>
My PHP File:
<?php
//Start of Functions
function changeAgentInfo()
{
$userorig = $_POST['user'];
$userinfos = simplexml_load_file('userInfo.xml'); // Opens the user XML file
$flag = false;
foreach ($userinfos->userinfo->account as $account)
{
// Checks if the user in this iteration of the loop is the same as $userorig (the user i want to find)
if($account->user == $userorig)
{
$flag = true; // Flag that user is found
$meow = "<test>asd</test>";
$account->addChild('systemInfo',$meow);
}
}
$userinfos->saveXML('userInfo.xml');
echo "Success";
}
//End of Functions
// Start of Program
changeAgentInfo();
?>
Thank you and have a nice day =)
This isn't gibberish; it is simply the XML entities for < (<) and > (>). To add nested XML elements with SimpleXML, you can do the following:
$node = $account->addChild('systemInfo');
$node->addChild('test', 'asd');
You'll see that first we add a node to <account>, then add a child to that newly created node.
If you plan on adding several children to the <systemInfo> element, you could perhaps do the following:
$items = array(
'os' => 'Windows 7',
'ram' => '8GB',
'browser' => 'Google Chrome'
);
$node = $account->addChild('systemInfo');
foreach ($items as $key => $value) {
$node->addChild($key, $value);
}
The addChild function is used to add the child element to an Xml node. You are trying to add xml instead of text.
You have
$meow = "<test>asd</test>";
$account->addChild('systemInfo',$meow);
You should change it to
$account->addChild('systemInfo','my system info text');

Can not parse xml File with Html Umlauts

I parse an xml file with this code:
$file = file_get_contents('test.xml');
$xml = $file;
echo '<pre>';
$xml = htmlentities_decode ($xml);
print_r (simplexml_load_string($xml));
function htmlentities_decode( $string ){
$trans = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
$trans = array_flip($trans);
return strtr($string, $trans);
}
My xml File has Umlauts like this decoded: &amul; or ß.
How do I have to decode/encode my output, that I have to decode/encode them, that they are shown in the same way like above? ( &amul; or ß).
Simple xml can not read them directly, so I have to decode them first, that simple xml can work with it.
Afterwards (after the pasring) I want to save the as utf8 to the database.
What is the best way, to do that?

Sitemap creation with DOMDocument throws parsing error

I'm creating a sitemap in XML, it works well with one record displayed, but when including 1+ records, it throws an error:
XML Parsing Error: junk after document element
Which shows this code here:
<?xml version="1.0" encoding="UTF-8"?>
<url><loc>http://www.mywebsite.com/page/1</loc><changefreq>daily</changefreq><priority>0.6</priority></url>
<url><loc>http://www.mywebsite.com/page/2</loc><changefreq>daily</changefreq><priority>0.6</priority></url>
My code:
$xml = new DOMDocument('1.0', 'UTF-8');
for($i = 0; $i < 2; $i++)
{
$url = $xml->createElement('url');
$xml->appendChild($url);
$website_url = 'http://www.mywebsite.com/page/' . $i;
$loc = $xml->createElement('loc', $website_url);
$url->appendChild($loc);
$change = $xml->createElement('changefreq', 'daily');
$url->appendChild($change);
$priority = $xml->createElement('priority', '0.6');
$url->appendChild($priority);
}
header('Content-type: text/xml');
echo $xml->saveXML();
Why is it throwing this kind of error when the XML seems valid to me?
At least in your example, you have two root nodes (<url>), as this is not allowed in xml, the second is the junk after document element.
You're missing the <urlset> root node, see: http://www.sitemaps.org/protocol.php

Numeric Values in xml tag

I am trying to convert array to xml data in php. I am using xmlserializer pear package for this. My array is:
$arr=array(1000=>'name is john');
When I convert it to xml using this code:
options=array ('mode'=>'simplexml','addDecl'=>true,'indent'=>' ','rootName'=>'names');
$serializer = new XML_Serializer($options);
$result = $serializer->serialize($arr);
if($result == true)
$data=$serializer->getSerializedData();
echo $data;
I get following response:
<?xml version="1.0"?>
<names>name is john</names>
But I want this kind of response:
<?xml version="1.0"?>
<names>
<1000>name is john</1000>
</names>
can anyone tell where my mistake is?
I guess this is because numeric values are not allowed element names in XML. However, if you really want to have "xml-style" output like above (beside it is not real xml) you must bypass the library and code it by hand. I think this will do it for you:
public function xml_encode($array, $tag = "root"){
$result = '<'.$tag.'>';
foreach($array as $key => $value){
if(is_array($value)){
$result.=xml_encode($value, $key);
}else{
$result .= '<'.$key.'>'.$value.'</'.$key.'>';
}
}
$result .= '</'.$tag.'>';
return $result;
}

Categories