PHP: Converting xml to array - php

I have an xml string. That xml string has to be converted into PHP array in order to be processed by other parts of software my team is working on.
For xml -> array conversion i'm using something like this:
if(get_class($xmlString) != 'SimpleXMLElement') {
$xml = simplexml_load_string($xmlString);
}
if(!$xml) {
return false;
}
It works fine - most of the time :) The problem arises when my "xmlString" contains something like this:
<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>
Then, simplexml_load_string won't do it's job (and i know that's because of character "<").
As i can't influence any other part of the code (i can't open up a module that's generating XML string and tell it "encode special characters, please!") i need your suggestions on how to fix that problem BEFORE calling "simplexml_load_string".
Do you have some ideas? I've tried
str_replace("<","<",$xmlString)
but, that simply ruins entire "xmlString"... :(

Well, then you can just replace the special characters in the $xmlString to the HTML entity counterparts using htmlspecialchars() and preg_replace_callback().
I know this is not performance friendly, but it does the job :)
<?php
$xmlString = '<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>';
$xmlString = preg_replace_callback('~(?:").*?(?:")~',
function ($matches) {
return htmlspecialchars($matches[0], ENT_NOQUOTES);
},
$xmlString
);
header('Content-Type: text/plain');
echo $xmlString; // you will see the special characters are converted to HTML entities :)
echo PHP_EOL . PHP_EOL; // tidy :)
$xmlobj = simplexml_load_string($xmlString);
var_dump($xmlobj);
?>

Related

simplexml_load_string and the unwelcome parse error

Update: Casting as an array does the trick. See this response, since I don't have enough clout to upvote :)
I started on this problem with many potential culprits, but after lots of diagnostics the problem is still there and no obvious answers remain.
I want to print the placename "Gaborone", which is located at the first tag under the first tag under the first tag of this API-loaded XML file. How can I parse this to return that content?
<?php
# load the XML file
$test1 = (string)file_get_contents('http://www.afdb.org/fileadmin/uploads/afdb/Documents/Generic-Documents/IATIBotswanaData.xml');
#throw it into simplexml for parsing
$xmlfile = simplexml_load_string($test1);
#output the parsed text
echo $xmlfile->iati-activity[0]->location[0]->gazetteer-entry;
?>
Which never fails to return this:
Parse error: syntax error, unexpected '[', expecting ',' or ';'
I've tried changing the syntax to avoid the hyphens in the tag names as such:
echo $xmlfile["iati-activity"][0]["location"][0]["gazetteer-entry"];
. . . but that returns complete nothingness; no error, no source.
I've also tried debugging based on these otherwise-helpful threads, but none of the solutions have worked. Is there an obvious error in my simplexml addressing?
I've tried changing the syntax to avoid the hyphens in the tag names
as such: echo
$xmlfile["iati-activity"][0]["location"][0]["gazetteer-entry"];
Your problem here is that, object native casting to an array isn't recursive, so that you did that for primary keys only. And yes, your guess is correct - you shouldn't deal with object properties when working with returned value of simplexml_load_string() because of the syntax issues. Instead, you should cast a returned value of it (stdclass) into an array recursively. You can use this function for that:
function object2array($object) {
return json_decode(json_encode($object), true);
}
The rest:
// load the XML file
$test1 = file_get_contents('http://www.afdb.org/fileadmin/uploads/afdb/Documents/Generic-Documents/IATIBotswanaData.xml');
$xml = simplexml_load_string($test1);
// Cast an object into array, that makes it much easier to work with
$data = object2array($xml);
$data = $data['iati-activity'][0]['location'][0]['gazetteer-entry']; // Works
var_dump($data); // string(8) "Gaborone"
I had a similar problem parsing XML using the simpleXML command until I did the following string replacements:
//$response contains the XML string
$response = str_replace(array("\n", "\r", "\t"), '', $response); //eliminate newlines, carriage returns and tabs
$response = trim(str_replace('"', "'", $response)); // turn double quotes into single quotes
$simpleXml = simplexml_load_string($response);
$json = json_decode(json_encode($simpleXml)); // an extra step I took so I got it into a nice object that is easy to parse and navigate
If that doesn't work, there's some talk over at PHP about CDATA not always being handled properly - PHP version dependent.
You could try this code prior to calling the simplexml_load_string function:
if(strpos($content, '<![CDATA[')) {
function parseCDATA($data) {
return htmlentities($data[1]);
}
$content = preg_replace_callback(
'#<!\[CDATA\[(.*)\]\]>#',
'parseCDATA',
str_replace("\n", " ", $content)
);
}
I've reread this, and I think your error is happening on your final line - try this:
echo $xmlfile->{'iati-activity'}[0]->location[0]->{'gazetteer-entry'};

I can't get asXml to return like it should

I have some straightforward code, which should return an xml string, but when i run this it returns only:'132963660013292910001330196400' which is all three of the second chile of all three concerts concatenated together. This is literally copied straight out of a textbook so i don't see how i could be doing it wrong to get this result. What am I misunderstanding here?
$simplexml = new SimpleXMLElement(
'<?xml version="1.0"?><concerts />');
$concert1 = $simplexml->addChild('concert');
$concert1->addChild("title", "The Magic Flute");
$concert1->addChild("time", 1329636600);
$concert2 = $simplexml->addChild('concert');
$concert2->addChild("title", "Vivaldi Four Seasons");
$concert2->addChild("time", 1329291000);
$concert3 = $simplexml->addChild('concert');
$concert3->addChild("title", "Mozart's Requiem");
$concert3->addChild("time", 1330196400);
echo $simplexml->asXML();
/* SHOULD output:
<concerts><concert><title>The Magic Flute</title><time>1329636600➥
</time></concert><concert><title>Vivaldi Four Seasons</title>➥
<time>1329291000</time></concert><concert><title>Mozart's Requiem➥
</title><time>1330196400</time></concert></concerts>
*/
Sounds like you're viewing the output of your script via a browser and it's attempting to interpret the document as HTML (the default content-type for PHP scripts).
Add this anywhere before your last line (the echo line)
header('Content-type: text/xml');
Alternatively, if you would like to see this as an HTML document, try this one
echo '<pre>', htmlspecialchars($simplexml->asXML()), '</pre>';
Demo here - http://codepad.viper-7.com/BJ2adH

Ampersands in database

I am trying to write a php function that goes to my database and pulls a list of URLS and arranges them into an xml structure and creates an xml file.
Problem is, Some of these urls will contain an ampersand that ARE HTML encoded. So, the database is good, but currently, when my function tries to grab these URLS, the script will stop at the ampersands and not finish.
One example link from database:
http://www.mysite.com/myfile.php?select=on&league_id=8&sport=15
function buildXML($con) {
//build xml file
$sql = "SELECT * FROM url_links";
$res = mysql_query($sql,$con);
$gameArray = array ();
while ($row = mysql_fetch_array($res))
{
array_push($row['form_link']);
}
$xml = '<?xml version="1.0" encoding="utf-8"?><channel>';
foreach ($gameArray as $link)
{
$xml .= "<item><link>".$link."</link></item>";
}
$xml .= '</channel>';
file_put_contents('../xml/full_rankings.xml',$xml);
}
mysql_close($con);
session_write_close();
If i need to alter the links in the database, that can be done.
You can use PHP's html_entity_decode() on the $link to convert & back to &.
In your XML, you could also wrap the link in <![CDATA[]]> to allow it to contain the characters.
$xml .= "<item><link><![CDATA[" . html_entity_decode($link) . "]]></link></item>";
UPDATE
Just noticed you're actually not putting anything into the $gameArray:
array_push($row['form_link']);
Try:
$gameArray[] = $row['form_link'];
* #Musa looks to have noticed it first, for due credit.
Look at this line
array_push($row['form_link']);
you never put anything in the $gameArray array, it should be
array_push($gameArray, $row['form_link']);
You need to use htmlspecialchars_decode. It will decode any encoded special characters in string passed to it.
This is most likely what you are looking for:
http://www.php.net/manual/en/function.mysql-real-escape-string.php
Read the documentation, there are examples at the bottom of the page...
'&' in oracleSQL and MySQL are used in queries as a logical operator which is why it is tossing an error.
You may also want to decode the HTML...

Reading php files with special tags in php

I have a file which reads as follows
<<row>> 1|test|20110404<</row>>
<<row>> 1|test|20110404<</row>>
<<row>><</row>> indicates start and end of line.I want to read line between this tags and also check whether this tags are present.
The first thing you need to do is locate the position of this "tag". The strpos() function does just that.
$tag_pos=strpos('<> 1|test|20110404<> <> 1|test|20110404<>', '<>');
if ($tag_pos===false) {
//The tag was not found!
} else {
//$tag_pos equals the numeric position of the first character of your tag
}
If these are truly lines, an efficient way to get them all is just to split on <>.
$lines=explode('<>', '<> 1|test|20110404<> <> 1|test|20110404<>');
$lines=array_filter($lines); //Removes blank strings from array
You could improve this by adding a callback function to the array_filter() call that uses trim() to remove any whitespace and then see if it is blank or not.
Edit: Great, I see that your "tags" were missing from your post. Since your start and end tags do not match, the code above will be of little use to you. Let me try again...
function strbetweenstrs($source, $tag1, $tag2, $casesensitive=true) {
$whatsleft=$source;
while ($whatsleft<>'') {
if ($casesensitive) {
$pos1=strpos($whatsleft, $str1);
$pos2=strpos($whatsleft, $str2, $pos1+strlen($str1));
} else {
$pos1=strpos(strtoupper($whatsleft), strtoupper($str1));
$pos2=strpos(strtoupper($whatsleft), strtoupper($str2), $pos1+strlen($str1));
}
if (($pos1===false) || ($pos2===false)) {
break;
}
array_push($results, substr($whatsleft, $pos1+strlen($str1), $pos2-($pos1_strlen($str1))));
$whatsleft=substr($whatsleft, $pos2+strlen($str2));
}
}
Note that I haven't tested this... but you get the generally idea. There is probably a much more efficient way to go about doing it.
Creating your own format is not so hard, but creating a script to read it can be difficult.
The advantage of using standardized formats is that most programming languages has support for them already. For example:
XML: You can use the simplexml_load_string() function and it can make you navigate easily through your content.
$str = "<?xml version="1.0" encoding="utf-8"?>
<data>
<row>1|test|20110404</row>
<row>1|test|20110404</row>
</data>";
$xml = simplexml_load_string($str);
Now you can access your data
echo $xml->row[0];
echo $xml->row[1];
i'm sure you get the idea,
there is also a very good support for JSON (Javascript Object Notation) using the jsondecode() function;
Check it on php.net for more details
i would suggest to use preg_match :-
preg_match( '#<< row>>(.*)<< /row>>#', $line, $matches);
if( ! empty($matches))
{
// line was found
print_r( $matches[1] ); // will contain the content between the start and end row tags
}

PHP SimpleXML get innerXML

I need to get the HTML contents of answer in this bit of XML:
<qa>
<question>Who are you?</question>
<answer>Who who, <strong>who who</strong>, <em>me</em></answer>
</qa>
So I want to get the string "Who who, <strong>who who</strong>, <em>me</em>".
If I have the answer as a SimpleXMLElement, I can call asXML() to get "<answer>Who who, <strong>who who</strong>, <em>me</em></answer>", but how to get the inner XML of an element without the element itself wrapped around it?
I'd prefer ways that don't involve string functions, but if that's the only way, so be it.
function SimpleXMLElement_innerXML($xml)
{
$innerXML= '';
foreach (dom_import_simplexml($xml)->childNodes as $child)
{
$innerXML .= $child->ownerDocument->saveXML( $child );
}
return $innerXML;
};
This works (although it seems really lame):
echo (string)$qa->answer;
To the best of my knowledge, there is not built-in way to get that. I'd recommend trying SimpleDOM, which is a PHP class extending SimpleXMLElement that offers convenience methods for most of the common problems.
include 'SimpleDOM.php';
$qa = simpledom_load_string(
'<qa>
<question>Who are you?</question>
<answer>Who who, <strong>who who</strong>, <em>me</em></answer>
</qa>'
);
echo $qa->answer->innerXML();
Otherwise, I see two ways of doing that. The first would be to convert your SimpleXMLElement to a DOMNode then loop over its childNodes to build the XML. The other would be to call asXML() then use string functions to remove the root node. Attention though, asXML() may sometimes return markup that is actually outside of the node it was called from, such as XML prolog or Processing Instructions.
most straightforward solution is to implement custom get innerXML with simple XML:
function simplexml_innerXML($node)
{
$content="";
foreach($node->children() as $child)
$content .= $child->asXml();
return $content;
}
In your code, replace $body_content = $el->asXml(); with $body_content = simplexml_innerXML($el);
However, you could also switch to another API that offers distinction between innerXML (what you are looking for) and outerXML (what you get for now). Microsoft Dom libary offers this distinction but unfortunately PHP DOM doesn't.
I found that PHP XMLReader API offers this distintion. See readInnerXML(). Though this API has quite a different approach to processing XML. Try it.
Finally, I would stress that XML is not meant to extract data as subtrees but rather as value. That's why you running into trouble finding the right API. It would be more 'standard' to store HTML subtree as a value (and escape all tags) rather than XML subtree. Also beware that some HTML synthax are not always XML compatible ( i.e. vs , ). Anyway in practice, you approach is definitely more convenient for editing the xml file.
I would have extend the SimpleXmlElement class:
class MyXmlElement extends SimpleXMLElement{
final public function innerXML(){
$tag = $this->getName();
$value = $this->__toString();
if('' === $value){
return null;
}
return preg_replace('!<'. $tag .'(?:[^>]*)>(.*)</'. $tag .'>!Ums', '$1', $this->asXml());
}
}
and then use it like this:
echo $qa->answer->innerXML();
<?php
function getInnerXml($xml_text) {
//strip the first element
//check if the strip tag is empty also
$xml_text = trim($xml_text);
$s1 = strpos($xml_text,">");
$s2 = trim(substr($xml_text,0,$s1)); //get the head with ">" and trim (note that string is indexed from 0)
if ($s2[strlen($s2)-1]=="/") //tag is empty
return "";
$s3 = strrpos($xml_text,"<"); //get last closing "<"
return substr($xml_text,$s1+1,$s3-$s1-1);
}
var_dump(getInnerXml("<xml />"));
var_dump(getInnerXml("<xml / >faf < / xml>"));
var_dump(getInnerXml("<xml >< / xml>"));
var_dump(getInnerXml("<xml>faf < / xml>"));
var_dump(getInnerXml("<xml > faf < / xml>"));
?>
After I search for a while, I got no satisfy solution. So I wrote my own function.
This function will get exact the innerXml content (including white-space, of course).
To use it, pass the result of the function asXML(), like this getInnerXml($e->asXML()). This function work for elements with many prefixes as well (as my case, as I could not find any current methods that do conversion on all child node of different prefixes).
Output:
string '' (length=0)
string '' (length=0)
string '' (length=0)
string 'faf ' (length=4)
string ' faf ' (length=6)
function get_inner_xml(SimpleXMLElement $SimpleXMLElement)
{
$element_name = $SimpleXMLElement->getName();
$inner_xml = $SimpleXMLElement->asXML();
$inner_xml = str_replace('<'.$element_name.'>', '', $inner_xml);
$inner_xml = str_replace('</'.$element_name.'>', '', $inner_xml);
$inner_xml = trim($inner_xml);
return $inner_xml;
}
If you don't want to strip CDATA section, comment out lines 6-8.
function innerXML($i){
$text=$i->asXML();
$sp=strpos($text,">");
$ep=strrpos($text,"<");
$text=trim(($sp!==false && $sp<=$ep)?substr($text,$sp+1,$ep-$sp-1):'');
$sp=strpos($text,'<![CDATA[');
$ep=strrpos($text,"]]>");
$text=trim(($sp==0 && $ep==strlen($text)-3)?substr($text,$sp+9,-3):$text);
return($text);
}
You can just use this function :)
function innerXML( $node )
{
$name = $node->getName();
return preg_replace( '/((<'.$name.'[^>]*>)|(<\/'.$name.'>))/UD', "", $node->asXML() );
}
Here is a very fast solution i created:
function InnerHTML($Text)
{
return SubStr($Text, ($PosStart = strpos($Text,'>')+1), strpos($Text,'<',-1)-1-$PosStart);
}
echo InnerHTML($yourXML->qa->answer->asXML());
using regex you could do this
preg_match(’/<answer(.*)?>(.*)?<\/answer>/’, $xml, $match);
$result=$match[0];
print_r($result);

Categories