I have the following string in an html.
BookSelector.load([{"index":25,"label":"Science","booktype":"pdf","payload":"<script type=\"text\/javascript\" charset=\"utf-8\" src=\"\/\/www.192.168.10.85\/libs\/js\/books.min.js\" publisher_id=\"890\"><\/script>"}]);
i want to find the src and the publisher_id from the string.
for this im trying the following code
$regex = '#\BookSelector.load\(.*?src=\"(.*?)\"}]\)#s';
preg_match($regex, $html, $matches);
$match = $matches[1];
but its always returning null.
what would be my regex to select the src only ?
what would be my regex if i need to parse the whole string between BookSelector.load ();
Why your regex isn't working?
First, I'll answer why your regex isn't working:
You're using \B in your regex. It matches any position not matched by a word boundary (\b), which is not what you want. This condition fails, and causes the entire regex to fail.
Your original text contains escaped quotes, but your regex doesn't account for those.
The correct approach to solve this problem
Split this task into several parts, and solve it one by one, using the best tool available.
The data you need is encapsulated within a JSON structure. So the first step is obviously to extract the JSON content. For this purpose, you can use a regex.
Once you have the JSON content, you need to decode it to get the data in it. PHP has a built-in function for that purpose: json_decode(). Use it with the input string and set the second parameter as true, and you'll have a nice associative array.
Once you have the associative array, you can easily get the payload string, which contains the <script> tag contents.
If you're absolutely sure that the order of attributes will always be the same, you can use a regex to extract the required information. If not, it's better to use an HTML parser such as PHP's DOMDocument to do this.
The whole code for this looks like:
// Extract the JSON string from the whole block of text
if (preg_match('/BookSelector\.load\((.*?)\);/s', $text, $matches)) {
// Get the JSON string and decode it using json_decode()
$json = $matches[1];
$content = json_decode($json, true)[0]['payload'];
$dom = new DOMDocument;
$dom->loadHTML($content);
// Use DOMDocument to load the string, and get the required values
$script_tag = $dom->getElementsByTagName('script')->item(0);
$script_src = $tag->getAttribute('src');
$publisher_id = $tag->getAttribute('publisher_id');
var_dump($src, $publisher_id);
}
Output:
string(40) "//www.192.168.10.85/libs/js/books.min.js"
string(3) "890"
Related
I could use some help writing a regular expression for this dictionary string (I don't use them all that often).
This is an example of the string dictionary:
O:8:"stdClass":5:{s:4:"sent";i:0;s:6:"graded";i:0;s:5:"score";i:0;s:6:"answer";s:14:"<p>Johnson</p>";s:8:"response";s:0:"";}
I want to extract Johnson from the string dictionary.
Any help would be appreciated, thanks.
This is a PHP serialized object. Don't use a regular expression. unserialize() the data and display the answer property accordingly.
unserialize($data);
echo $data->answer;
$str = 'O:8:"stdClass":5:{s:4:"sent";i:0;s:6:"graded";i:0;s:5:"score";i:0;s:6:"answer";s:14:"<p>Johnson</p>";s:8:"response";s:0:"";}';
$obj = unserialize($str);
echo $obj->answer;
This would be the correct answer, no regex needed. You may need some additional HTML parsing if you'd want the <p> tags removed. If the format will always remain the same (and only then!) simply remove the <p> and </p> tags.
It looks like you should be using unserialize() instead and then you can use preg_match to remove the <p> tags.
$obj = (unserialize('O:8:"stdClass":5:{s:4:"sent";i:0;s:6:"graded";i:0;s:5:"score";i:0;s:6:"answer";s:14:"<p>Johnson</p>";s:8:"response";s:0:"";}'));
preg_match('~<p>([^<]*)</p>~', $obj->answer, $ans);
print_r($ans[1]); //prints Johnson
I try build a php regex that validate this type of input string:
{name:'something name here',type:'',id:''},{name:'other name',type:'small',id:34},{name:'orange',type:'weight',id:28}
etc...
So, it is a list of json that each contain 3 field: name,type,id.Field name is always present, instead type and id can be together empty string ( '' ). Then I can explode it by comma if it has valid format and obtain a array of json string.
How can I do?
UPDATE
it isn't a valid json as you can say but I have a input field where user put tags, and I want track a name, type and id of that tags.
example:
tag1 (has name,type,id), tags2 (has only name), tags3(has name, type,id).
So, I think that I can post a string in that format:
{'name':'test','type':'first','id':3},{'name':'other','type':'second','id':45}, etc
But I must validate this string with a regex. I can do
$data = explode(',',$list);
and then I do:
foreach($data as $d){
$tmp = json_decode($d);
if($tmp == false) echo 'error invalid data';
}
As Gubo pointed out: this is not a valid JSON encoded string. If the actual data you want to process in your script ís valid however, you're barking up the wrong tree looking for a regular expression... PHP has tons of functions that will parse JSON strings much faster than a regular expression.
$string1 = "{name:'something name here',type:'',id:''},{name:'othername',type:'small',id:34},{name:'orange',type:'weight',id:28}";
$string2 = '[{"name":"something name here","type":"","id":""},{"name":"othername","type":"small","id":"34"},{"name":"orange","type":"weight","id":"28"}]';
Where $string2 is the data in valid JSON formar. If your data is a valid JSON string, the following code will suffice:
$parsed = json_decode($string2);
//$parsed[0]['name'] return 'something name here'
If, however you're dealing with invalid JSON strings, things get a bit more complicated... First off: if you're lacking your object properties (or array keys as they will become in PHP) are quoted, a quick fix would be this:
$parsed = json_decode('['.$string1.']');
If you really want to parse them seperatly:
$separated= preg_split('/(?<=[\}]),/',$string1);
But I can't see why you would want to do that. The biggest issue here is the absence of quotes on the property strings (or keys). I have put together a regex (untested) that could quote those strings:
$parsed = json_decode(preg_replace('/(?<=[\{,])([a-z]+)/',str_replace('\'','"',$string1)));
Keep in mind, the last regex is untested, so it might not perform as you expect it to... but it should help you on your way... for the last example, the same rules apply for all the other examples I gave: if the quotes and brackets are there, just use json_decode, if the brackets are missing, add them, too...
It's getting rather late here, so I'm off to bed now... I hope this answer isn't packed with typo's and sentences that nobody can understand. If it is, I do apologize.
You don't need a regex for that. Just use this:
var_dump(json_decode($json, true));
See: http://us.php.net/manual/en/function.json-decode.php
<?php
$html = file_get_contents('http://hypermedia.ids-mannheim.de/');
?>
this code returns me the html of the website in a string. How do I separate the string into different words? After getting the individual words in an array I would like to detect which one is in German...
$words = explode(' ', strip_tags($html));
or
$words = preg_split("/[\s,]+/", strip_tags($html));
The second one will consider not just the space character as a delimiter, but tabs and commas as well.
work with a regex, something like this
#([\w]+)#i
A code example:
if(preg_match_all('#([\w]+)\b#i', $text, $matches)) {
foreach($matches[1] as $key => $word) {
echo $word."\n";
}
}
Then you have to compare each with some kind of dictionary.
I think you need to separate your problem into steps.
First parse your returned html string to find which part is html tags and structure. You can use DOM for such purpose.
Then, you can separate your innerHTML data from tags and split innerHTML text into tokens to obtain an array. Dunno the best way but a simple array regex split can do the job.
The interesting part of finding german words, could be done matching your wordlist against a dictionary, again using arrays or maps.. or, better, using a DB (SQLlite maybe could be better than a real rdbms like mysql)..
I got a xml file that contains hierarchical data. Now I need to get some of that data into a php array. I use xsl to get the data I want formatted as a php array. But when I print it it leaves all the tabs and extra spaces and line breaks etc which I need to get rid of to turn it into a flat string (I suppose!) and then convert that string into a array.
In the xsl I output as text and have indent="no" (which does nothing). I've tried to strip \t \n \r etc but it doesn't affect the output at all.
Is there a really good php function out there that can strip out all formatting except single spaces? Or is there more going on here I don't know about or another way of doing the same thing?
First off, using xsl output to form your PHP array is fairly inelegant and inefficient. I would highly suggest going with something like the domdocument class available in PHP (http://www.php.net/manual/en/class.domdocument.php). If you must stick with your current method, try using regular expressions to remove any unnecessary whitespace.
$string = preg_replace('/\s+/', '', $string);
or
$string = preg_replace('/\s\s+/', ' ', $string);
to preserve single white space.
I've created a class for opensource library that your welcome to use, and look at as an example on how to create an array from XML (and just take out the "good" parts).
USING XML
So the crux of the problem is probably keeping the data in XML as long as possible. Therefore the after the XSL translation you would have something like:
<xml>
<data>value
with newline
</data>
<data>with lots of whitespace</data>
</xml>
Then you could loop trough that data like:
$xml = simplexml_load_string($xml_string);
foreach($xml as $data)
{
// use str_replace or a regular expression to replace the values...
$data_array[] = str_replace(array(" ", "\n"), "", $data);
}
// $data_array is the array you want!
USING JSON
However if you can't output the XSL into XML then loop through it. Then you may want to use XSL to create a JSON string object and convert that to an array so the xsl would look like:
{
"0" : "value
with newline",
"1" : "with lots of whitespace"
}
Then you could loop trough that data like:
$json_array = json_encode($json_string, TRUE); // the TRUE is to make an array
foreach($json_array as $key => $value)
{
// use str_replace or a regular expression to replace the values...
$json_array[$key] = str_replace(array(" ", "\n"), "", $value);
}
Either way you'll have to pull the values in PHP because XSLT's handling of spaces and newlines is pretty rudimentary.
I am trying to parse a string of HTML tag attributes in php. There can be 3 cases:
attribute="value" //inside the quotes there can be everything also other escaped quotes
attribute //without the value
attribute=value //without quotes so there are only alphanumeric characters
can someone help me to find a regex that can get in the first match the attribute name and in the second the attribute value (if it's present)?
Never ever use regular expressions for processing html, especially if you're writing a library and don't know what your input will look like. Take a look at simplexml, for example.
Give this a try and see if it is what you want to extract from the tags.
preg_match_all('/( \\w{1,}="\\w{1,}"| \\w{1,}=\\w{1,}| \\w{1,})/i',
$content,
$result,
PREG_PATTERN_ORDER);
$result = $result[0];
The regex pulls each attribute, excludes the tag name, and puts the results in an array so you will be able to loop over the first and second attributes.