I got a xml file that contains hierarchical data. Now I need to get some of that data into a php array. I use xsl to get the data I want formatted as a php array. But when I print it it leaves all the tabs and extra spaces and line breaks etc which I need to get rid of to turn it into a flat string (I suppose!) and then convert that string into a array.
In the xsl I output as text and have indent="no" (which does nothing). I've tried to strip \t \n \r etc but it doesn't affect the output at all.
Is there a really good php function out there that can strip out all formatting except single spaces? Or is there more going on here I don't know about or another way of doing the same thing?
First off, using xsl output to form your PHP array is fairly inelegant and inefficient. I would highly suggest going with something like the domdocument class available in PHP (http://www.php.net/manual/en/class.domdocument.php). If you must stick with your current method, try using regular expressions to remove any unnecessary whitespace.
$string = preg_replace('/\s+/', '', $string);
or
$string = preg_replace('/\s\s+/', ' ', $string);
to preserve single white space.
I've created a class for opensource library that your welcome to use, and look at as an example on how to create an array from XML (and just take out the "good" parts).
USING XML
So the crux of the problem is probably keeping the data in XML as long as possible. Therefore the after the XSL translation you would have something like:
<xml>
<data>value
with newline
</data>
<data>with lots of whitespace</data>
</xml>
Then you could loop trough that data like:
$xml = simplexml_load_string($xml_string);
foreach($xml as $data)
{
// use str_replace or a regular expression to replace the values...
$data_array[] = str_replace(array(" ", "\n"), "", $data);
}
// $data_array is the array you want!
USING JSON
However if you can't output the XSL into XML then loop through it. Then you may want to use XSL to create a JSON string object and convert that to an array so the xsl would look like:
{
"0" : "value
with newline",
"1" : "with lots of whitespace"
}
Then you could loop trough that data like:
$json_array = json_encode($json_string, TRUE); // the TRUE is to make an array
foreach($json_array as $key => $value)
{
// use str_replace or a regular expression to replace the values...
$json_array[$key] = str_replace(array(" ", "\n"), "", $value);
}
Either way you'll have to pull the values in PHP because XSLT's handling of spaces and newlines is pretty rudimentary.
Related
I have the following string in an html.
BookSelector.load([{"index":25,"label":"Science","booktype":"pdf","payload":"<script type=\"text\/javascript\" charset=\"utf-8\" src=\"\/\/www.192.168.10.85\/libs\/js\/books.min.js\" publisher_id=\"890\"><\/script>"}]);
i want to find the src and the publisher_id from the string.
for this im trying the following code
$regex = '#\BookSelector.load\(.*?src=\"(.*?)\"}]\)#s';
preg_match($regex, $html, $matches);
$match = $matches[1];
but its always returning null.
what would be my regex to select the src only ?
what would be my regex if i need to parse the whole string between BookSelector.load ();
Why your regex isn't working?
First, I'll answer why your regex isn't working:
You're using \B in your regex. It matches any position not matched by a word boundary (\b), which is not what you want. This condition fails, and causes the entire regex to fail.
Your original text contains escaped quotes, but your regex doesn't account for those.
The correct approach to solve this problem
Split this task into several parts, and solve it one by one, using the best tool available.
The data you need is encapsulated within a JSON structure. So the first step is obviously to extract the JSON content. For this purpose, you can use a regex.
Once you have the JSON content, you need to decode it to get the data in it. PHP has a built-in function for that purpose: json_decode(). Use it with the input string and set the second parameter as true, and you'll have a nice associative array.
Once you have the associative array, you can easily get the payload string, which contains the <script> tag contents.
If you're absolutely sure that the order of attributes will always be the same, you can use a regex to extract the required information. If not, it's better to use an HTML parser such as PHP's DOMDocument to do this.
The whole code for this looks like:
// Extract the JSON string from the whole block of text
if (preg_match('/BookSelector\.load\((.*?)\);/s', $text, $matches)) {
// Get the JSON string and decode it using json_decode()
$json = $matches[1];
$content = json_decode($json, true)[0]['payload'];
$dom = new DOMDocument;
$dom->loadHTML($content);
// Use DOMDocument to load the string, and get the required values
$script_tag = $dom->getElementsByTagName('script')->item(0);
$script_src = $tag->getAttribute('src');
$publisher_id = $tag->getAttribute('publisher_id');
var_dump($src, $publisher_id);
}
Output:
string(40) "//www.192.168.10.85/libs/js/books.min.js"
string(3) "890"
<?php
$html = file_get_contents('http://hypermedia.ids-mannheim.de/');
?>
this code returns me the html of the website in a string. How do I separate the string into different words? After getting the individual words in an array I would like to detect which one is in German...
$words = explode(' ', strip_tags($html));
or
$words = preg_split("/[\s,]+/", strip_tags($html));
The second one will consider not just the space character as a delimiter, but tabs and commas as well.
work with a regex, something like this
#([\w]+)#i
A code example:
if(preg_match_all('#([\w]+)\b#i', $text, $matches)) {
foreach($matches[1] as $key => $word) {
echo $word."\n";
}
}
Then you have to compare each with some kind of dictionary.
I think you need to separate your problem into steps.
First parse your returned html string to find which part is html tags and structure. You can use DOM for such purpose.
Then, you can separate your innerHTML data from tags and split innerHTML text into tokens to obtain an array. Dunno the best way but a simple array regex split can do the job.
The interesting part of finding german words, could be done matching your wordlist against a dictionary, again using arrays or maps.. or, better, using a DB (SQLlite maybe could be better than a real rdbms like mysql)..
Does anyone know how to parse an xml string in php using SimpleXMLElement when the key has a space in it ?
For example,
$xmlString = "<test><this is>a</this is></test>";
$xml = new SimpleXMLElement($xmlString);
print_r($xml);
in the above example, 'this is' causes the parser to go bananas. I'm guessing its because it thinks its a property as is expecting like ??
FOR A BONUS PT,
(also if the key is a number.. like '1', the same thing happens)..
That's because having numbers as elements and elements with spaces is not part of valid XML. You're probably better off running a replacement function on your XML string, converting <(\d+)> to <el_$1>, and replacing spaces in nodes with underscores as well.
I'm trying to parse an XML file and one of the fields looks like the following:
<link>http://foo.com/this-platform/scripts/click.php?var_a=a&var_b=b&varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee</link>
This seems to break the parser. i think it might be something to do with the & in the link?
My code is quite simple:
<?
$xml = simplexml_load_file("files/this.xml");
echo $xml->getName() . "<br />";
foreach($xml->children() as $child) {
echo $child->getName() . ": " . $child . "<br />";
}
?>
any ideas how i can resolve this?
The XML snippet you posted is not valid. Ampersands have to be escaped, this is why the parser complaints.
Your XML feed is not valid XML : the & should be escaped as &
This means you cannot use an XML parser on it :-(
A possible "solution" (feels wrong, but should work) would be to replace '&' that are not part of an entity by '&', to get a valid XML string before loading it with an XML parser.
In your case, considering this :
$str = <<<STR
<xml>
<link>http://foo.com/this-platform/scripts/click.php?var_a=a&var_b=b&varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee</link>
</xml>
STR;
You might use a simple call to str_replace, like this :
$str = str_replace('&', '&', $str);
And, then, parse the string (now XML-valid) that's in $str :
$xml = simplexml_load_string($str);
var_dump($xml);
In this case, it should work...
But note that you must take care about entities : if you already have an entity like '>', you must not replace it to '>' !
Which means that such a simple call to str_replace is not the right solution : it will probably break stuff on many XML feeds !
Up to you to find the right way to do that replacement -- maybe with some kind of regex...
It breaks the parser because your XML is invalid - & should be encoded as &.
If your XML already has some escaping, this way it will be preserved and unescaped ampersands will be fixed:
$brokenXmlText = file_get_contents("files/this.xml");
$fixed = preg_replace('/&(?!lt;|gt;|quot;|apos;|amp;|#)/', '&', $brokenXmlText);
$xml = simplexml_load_string($fixed);
The comment by mjv resolved it:
Alternatively to using &, you may
consider putting the urls and other
XML-unfriendly content in
, i.e. a
Character Data block
I think this will help you
http://www.php.net/manual/en/simplexml.examples-errors.php#96218
I'm writing a function that fishes out the src from the first image tag it finds in an html file. Following the instructions in this thread on here, I got something that seemed to be working:
preg_match_all('#<img[^>]*>#i', $content, $match);
foreach ($match as $value) {
$img = $value[0];
}
$stuff = simplexml_load_string($img);
$stuff = $stuff[src];
return $stuff;
But after a few minutes of using the function, it started returning errors like this:
warning: simplexml_load_string() [0function.simplexml-load-string0]: Entity: line 1: parser error : Premature end of data in tag img line 1 in path/to/script on line 42.
and
warning: simplexml_load_string() [0function.simplexml-load-string0]: tp://feeds.feedburner.com/~f/ChicagobusinesscomBreakingNews?i=KiStN" border="0"> in path/to/script on line 42.
I'm kind of new to PHP but it seems like my regex is chopping up the HTML incorrectly. How can I make it more "airtight"?
These two lines of PHP code should give you a list of all the values of the src attribute in all img tags in an HTML file:
preg_match_all('/<img\s+[^<>]*src=["\']?([^"\'<>\s]+)["\']?/i', $content, $result, PREG_PATTERN_ORDER);
$result = $result[1];
To keep the regex simple, I'm not allowing file names to have spaces in them. If you want to allow this, you need to use separate alternatives for quoted attribute values (which can have spaces), and unquoted attribute values (which can't have spaces).
Most likely because the "XML" being picked up by the regex isn't proper XML for whatever reason. I would probably go for a more complicated regex that would pull out the src attribute, instead of using SimpleXML to get the src. This REGEX might be close to what you need.
<img[^>]*src\s*=\s*['|"]?([^>]*?)['|"]?[^>]*>
You could also use a real HTML Parsing library, but I'm not sure which options exist in PHP.
An ampersand by itself in an attribute is invalid XML (it should be encoded as “&”), but some people still put it that way on URLs on HTML pages (and all browsers support it). Maybe there lies your problem.
If that is the case, you can sanitize your string before parsing it, substituting “&(?!amp;)” by “&”.
On a different subject:
foreach ($match as $value) {
$img = $value[0];
}
can be replaced with
$img = $match[count($match) - 1][0];
Something like this:
if (preg_match('#<img\s[^>]*>#i', $content, $match)) {
$img = $match[0]; //first image in file only
$stuff = simplexml_load_string($img);
$stuff = $stuff[src];
return $stuff;
} else {
return null; //no match found
}