Parsing XML with keys that have spaces in their names - php

Does anyone know how to parse an xml string in php using SimpleXMLElement when the key has a space in it ?
For example,
$xmlString = "<test><this is>a</this is></test>";
$xml = new SimpleXMLElement($xmlString);
in the above example, 'this is' causes the parser to go bananas. I'm guessing its because it thinks its a property as is expecting like ??
(also if the key is a number.. like '1', the same thing happens)..

That's because having numbers as elements and elements with spaces is not part of valid XML. You're probably better off running a replacement function on your XML string, converting <(\d+)> to <el_$1>, and replacing spaces in nodes with underscores as well.


php regular expression breaks

I have the following string in an html.
BookSelector.load([{"index":25,"label":"Science","booktype":"pdf","payload":"<script type=\"text\/javascript\" charset=\"utf-8\" src=\"\/\/www.\/libs\/js\/books.min.js\" publisher_id=\"890\"><\/script>"}]);
i want to find the src and the publisher_id from the string.
for this im trying the following code
$regex = '#\BookSelector.load\(.*?src=\"(.*?)\"}]\)#s';
preg_match($regex, $html, $matches);
$match = $matches[1];
but its always returning null.
what would be my regex to select the src only ?
what would be my regex if i need to parse the whole string between BookSelector.load ();
Why your regex isn't working?
First, I'll answer why your regex isn't working:
You're using \B in your regex. It matches any position not matched by a word boundary (\b), which is not what you want. This condition fails, and causes the entire regex to fail.
Your original text contains escaped quotes, but your regex doesn't account for those.
The correct approach to solve this problem
Split this task into several parts, and solve it one by one, using the best tool available.
The data you need is encapsulated within a JSON structure. So the first step is obviously to extract the JSON content. For this purpose, you can use a regex.
Once you have the JSON content, you need to decode it to get the data in it. PHP has a built-in function for that purpose: json_decode(). Use it with the input string and set the second parameter as true, and you'll have a nice associative array.
Once you have the associative array, you can easily get the payload string, which contains the <script> tag contents.
If you're absolutely sure that the order of attributes will always be the same, you can use a regex to extract the required information. If not, it's better to use an HTML parser such as PHP's DOMDocument to do this.
The whole code for this looks like:
// Extract the JSON string from the whole block of text
if (preg_match('/BookSelector\.load\((.*?)\);/s', $text, $matches)) {
// Get the JSON string and decode it using json_decode()
$json = $matches[1];
$content = json_decode($json, true)[0]['payload'];
$dom = new DOMDocument;
// Use DOMDocument to load the string, and get the required values
$script_tag = $dom->getElementsByTagName('script')->item(0);
$script_src = $tag->getAttribute('src');
$publisher_id = $tag->getAttribute('publisher_id');
var_dump($src, $publisher_id);
string(40) "//www."
string(3) "890"

regex to explode a string json into values

I try build a php regex that validate this type of input string:
{name:'something name here',type:'',id:''},{name:'other name',type:'small',id:34},{name:'orange',type:'weight',id:28}
So, it is a list of json that each contain 3 field: name,type,id.Field name is always present, instead type and id can be together empty string ( '' ). Then I can explode it by comma if it has valid format and obtain a array of json string.
How can I do?
it isn't a valid json as you can say but I have a input field where user put tags, and I want track a name, type and id of that tags.
tag1 (has name,type,id), tags2 (has only name), tags3(has name, type,id).
So, I think that I can post a string in that format:
{'name':'test','type':'first','id':3},{'name':'other','type':'second','id':45}, etc
But I must validate this string with a regex. I can do
$data = explode(',',$list);
and then I do:
foreach($data as $d){
$tmp = json_decode($d);
if($tmp == false) echo 'error invalid data';
As Gubo pointed out: this is not a valid JSON encoded string. If the actual data you want to process in your script ís valid however, you're barking up the wrong tree looking for a regular expression... PHP has tons of functions that will parse JSON strings much faster than a regular expression.
$string1 = "{name:'something name here',type:'',id:''},{name:'othername',type:'small',id:34},{name:'orange',type:'weight',id:28}";
$string2 = '[{"name":"something name here","type":"","id":""},{"name":"othername","type":"small","id":"34"},{"name":"orange","type":"weight","id":"28"}]';
Where $string2 is the data in valid JSON formar. If your data is a valid JSON string, the following code will suffice:
$parsed = json_decode($string2);
//$parsed[0]['name'] return 'something name here'
If, however you're dealing with invalid JSON strings, things get a bit more complicated... First off: if you're lacking your object properties (or array keys as they will become in PHP) are quoted, a quick fix would be this:
$parsed = json_decode('['.$string1.']');
If you really want to parse them seperatly:
$separated= preg_split('/(?<=[\}]),/',$string1);
But I can't see why you would want to do that. The biggest issue here is the absence of quotes on the property strings (or keys). I have put together a regex (untested) that could quote those strings:
$parsed = json_decode(preg_replace('/(?<=[\{,])([a-z]+)/',str_replace('\'','"',$string1)));
Keep in mind, the last regex is untested, so it might not perform as you expect it to... but it should help you on your way... for the last example, the same rules apply for all the other examples I gave: if the quotes and brackets are there, just use json_decode, if the brackets are missing, add them, too...
It's getting rather late here, so I'm off to bed now... I hope this answer isn't packed with typo's and sentences that nobody can understand. If it is, I do apologize.
You don't need a regex for that. Just use this:
var_dump(json_decode($json, true));

How do I troubleshoot "simplexml_load_file() parser error: Entity 'nbsp' not defined"?

I use PHP to generate XML files. I have use some code below to avoid error.
$str = str_ireplace(array('<','>','&','\'','"'),array('<','>','&','&apos;','"'),$str);
but still cause fault.
simplexml_load_file() [function.simplexml-load-file] *[file name]* parser error : Entity 'nbsp' not defined in *[file name] [line]*
The error text here:
Dallas Dallas () is the third-largest city in Texas and the ninth-largest in the United States.
In IE8, it seems to fault in (). So how many symbols should I notice?
HTML specific entities - in this case - are not valid xml entities, and that is what simplexml complains about; it reads the file as xml (not html) and finds entities which are not valid. You need to convert HTML entities back to their character representation first (you can use html_entity_decode() to do that)
$str = "some string containing html";
// this line will convert back html entities to regular characters
$str = html_entity_decode($str, ...);
// now convert special character to their xml entities
$str = str_ireplace(array('<','>','&','\'','"'),array('<','>','&','&apos;','"'),$str);
Note that if you use htmlentities() on your string before saving it in the xml, then that is the source of your problem (as you are converting html character to their respective html entities, which are not recognized by simplexml as xml entities).
// this won't work, the html entities it will uses are not valid xml entities
$str = htmlentities($str, ...)
If you have troubles understanding it, think of it as two different languages, like spanish (html) and english (xml), a valid word in spanish ( ) doesn't mean it is also valid in english, no matter the similarities between the two languages.
is a HTML entity, but doesn't exist in XML.
Either get rid of it (you're not saying where it comes from, so it's hard to give any more specific advice), or wrap your HTML data in CDATA blocks so the parser ignores them.
is no-breaking space. You have to replace it.

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

I'm trying to write a regular expression using the PCRE library in PHP.
I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.
Input XML:
<cnode>This string contains > and < and & chars.</cnode>
The idea is to to a search and replace these chars and convert them to XML entities equivalents.
If I was to convert the entire XML to entities the XML would look like this:
Entire XML converted to entities
<cnode>This string contains > and < and & chars.</cnode>
I need it to look like this:
Correct XML
<cnode>This string contains > and &lt and & chars.</cnode>
I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols):
Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.
In the end I've opted to use the Tidy library in PHP. The code I used is shown below:
// Specify configuration
$config = array(
'input-xml' => true,
'show-warnings' => false,
'numeric-entities' => true,
'output-xml' => true);
$tidy = new tidy();
$tidy->parseFile('feed.xml', $config, 'latin1');
This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.
Classic example of garbage in, garbage out. The real solution is to fix the broken XML exporter, but obviously that's out of the scope of your problem. Sounds like you might have to manually parse the XML, run htmlentites() on the contents, then put the XML tags back.
I'm reasonably certain it's simply not possible. You need something that keeps track of nesting, and there's no way to get a regular expression to track nesting. Your choices are to fix the text first (when you probably can use an RE) or use something that's at least vaguely like an XML parser, specifically to the extent of keeping track of how the tags are nested.
There's a reason XML demands that these characters be escaped though -- without that, you can only guess about whether something is really a tag or not. For example, given something like:
<tag>Text containing < and > characters</tag>
you and I can probably guess that the result should be: ...containing < and >... but I'm pretty sure the XML specification allows the extra whitespace, so officially "< and >" should be treated as a tag. You could, I suppose, assume that anything that looks like an un-matched tag really isn't intended to be a tag, but that's going to take some work too.
Would it be possible to intercept the text before it tries to become part of your XML? A few ounces of prevention might be worth pounds of cure.
This should do it for ampersands:
This means you're only looking for those characters when they have whitespace characters on both sides.
Just make sure the replacement expression is "$1$2amp;$3";
The others would go like this, with their replacement expressions on the right
/(\s+)(>)(\s+)/gim "$1>$2"
/(\s+)(<)(\s+)/gim "$1<$2"
As stated by others, regular expressions don't do well with hierarchical data. Besides, if the data is improperly formatted, you can't guarantee that you'll get it right. Consider:
<tag>Something<br/>Something Else</tag>
Is that <br/> supposed to read <br/>? There's no way to know because it's validly formatted XML.
If you have arbitrary data that you wish to include in your XML tree, consider using a <![CDATA[ ... ]]> block instead. It's treated the same as a text node, and the only thing you don't have to escape is the character sequence ]]>.
What you have there is not, of course, XML. In XML, the characters '<' and '&' may not occur (unescaped) inside text: only inside a comment, CDATA section, or processing instruction. Actually, '>' can occur in text, except as part of the string ']]>'. In well-formed XML, literal '<' and '&' characters signal the start of markup: '<' signals the start of a start tag, end tag, or empty element tag, and '&' signals the start of an entity reference. In both these cases, the next character may NOT be whitespace. So using an RE like Robusto's suggestion would find all such occurrences. You might also need to catch corner cases like '<<', '<\', or '&<'. In this case you don't need to try to parse your input, an RE will work fine.
If the source contains strings like '<something ' where 'something' matches the production for a Name:
Name ::= NameStartChar (NameChar)*
Then you have more of a problem. You are going to have to (try to) parse your input as if it were real XML, and detect the error cases of malformed Names, non-matching start & end tags, malformed attributes, and undefined entity references (to name a few). Unfortunately the error condition isn't guaranteed to happen at the location of the error.
Your best bet may be to use an RE to catch 90% of the error and fix the rest manually. You need to look for a '<' or '&' followed by anything other than a NameStartChar

xsl to php array

I got a xml file that contains hierarchical data. Now I need to get some of that data into a php array. I use xsl to get the data I want formatted as a php array. But when I print it it leaves all the tabs and extra spaces and line breaks etc which I need to get rid of to turn it into a flat string (I suppose!) and then convert that string into a array.
In the xsl I output as text and have indent="no" (which does nothing). I've tried to strip \t \n \r etc but it doesn't affect the output at all.
Is there a really good php function out there that can strip out all formatting except single spaces? Or is there more going on here I don't know about or another way of doing the same thing?
First off, using xsl output to form your PHP array is fairly inelegant and inefficient. I would highly suggest going with something like the domdocument class available in PHP ( If you must stick with your current method, try using regular expressions to remove any unnecessary whitespace.
$string = preg_replace('/\s+/', '', $string);
$string = preg_replace('/\s\s+/', ' ', $string);
to preserve single white space.
I've created a class for opensource library that your welcome to use, and look at as an example on how to create an array from XML (and just take out the "good" parts).
So the crux of the problem is probably keeping the data in XML as long as possible. Therefore the after the XSL translation you would have something like:
with newline
<data>with lots of whitespace</data>
Then you could loop trough that data like:
$xml = simplexml_load_string($xml_string);
foreach($xml as $data)
// use str_replace or a regular expression to replace the values...
$data_array[] = str_replace(array(" ", "\n"), "", $data);
// $data_array is the array you want!
However if you can't output the XSL into XML then loop through it. Then you may want to use XSL to create a JSON string object and convert that to an array so the xsl would look like:
"0" : "value
with newline",
"1" : "with lots of whitespace"
Then you could loop trough that data like:
$json_array = json_encode($json_string, TRUE); // the TRUE is to make an array
foreach($json_array as $key => $value)
// use str_replace or a regular expression to replace the values...
$json_array[$key] = str_replace(array(" ", "\n"), "", $value);
Either way you'll have to pull the values in PHP because XSLT's handling of spaces and newlines is pretty rudimentary.
