I have a sting that is in this format.
<span class="amount">$25</span>–<span class="amount">$100</span>
What I need to do is split that into two strings. The string will remain in the same format but the prices will change. I tried using str_split() but because the price changes I wouldn't be able to always know how many characters to split the string at.
What I am trying to get is something like this.
String 1
<span class="amount">$25</span>–
String 2
<span class="amount">$100</span>
It seems the best option I have found is to use preg_split() but I don't know anything about regex so I'm not sure how to format the expression. There may also be a better way to handle this and I just don't know of it.
Could someone please help me format the regex, or let me know of a better way to split that string.
Edit
Thanks to #rm-vanda for helping me figure out that I don't need to use preg_split for this. I was able to split the string using explode(). The issue I was having was because the '-' was encoded weird and therefore not returning correctly.
It might be better to translate this problem into DOM:
$html = <<<HTML
<span class="amount">$25</span>–<span class="amount">$100</span>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('span') as $span) {
// do stuff with $span
// e.g. this is how you would get the outer html
echo $doc->saveXML($span);
}
If it always has the "-" then this would be the most simple way:
$span = explode("-", $spans);
echo $span[0];
echo $span[1];
Related
I have the following string in an html.
BookSelector.load([{"index":25,"label":"Science","booktype":"pdf","payload":"<script type=\"text\/javascript\" charset=\"utf-8\" src=\"\/\/www.192.168.10.85\/libs\/js\/books.min.js\" publisher_id=\"890\"><\/script>"}]);
i want to find the src and the publisher_id from the string.
for this im trying the following code
$regex = '#\BookSelector.load\(.*?src=\"(.*?)\"}]\)#s';
preg_match($regex, $html, $matches);
$match = $matches[1];
but its always returning null.
what would be my regex to select the src only ?
what would be my regex if i need to parse the whole string between BookSelector.load ();
Why your regex isn't working?
First, I'll answer why your regex isn't working:
You're using \B in your regex. It matches any position not matched by a word boundary (\b), which is not what you want. This condition fails, and causes the entire regex to fail.
Your original text contains escaped quotes, but your regex doesn't account for those.
The correct approach to solve this problem
Split this task into several parts, and solve it one by one, using the best tool available.
The data you need is encapsulated within a JSON structure. So the first step is obviously to extract the JSON content. For this purpose, you can use a regex.
Once you have the JSON content, you need to decode it to get the data in it. PHP has a built-in function for that purpose: json_decode(). Use it with the input string and set the second parameter as true, and you'll have a nice associative array.
Once you have the associative array, you can easily get the payload string, which contains the <script> tag contents.
If you're absolutely sure that the order of attributes will always be the same, you can use a regex to extract the required information. If not, it's better to use an HTML parser such as PHP's DOMDocument to do this.
The whole code for this looks like:
// Extract the JSON string from the whole block of text
if (preg_match('/BookSelector\.load\((.*?)\);/s', $text, $matches)) {
// Get the JSON string and decode it using json_decode()
$json = $matches[1];
$content = json_decode($json, true)[0]['payload'];
$dom = new DOMDocument;
$dom->loadHTML($content);
// Use DOMDocument to load the string, and get the required values
$script_tag = $dom->getElementsByTagName('script')->item(0);
$script_src = $tag->getAttribute('src');
$publisher_id = $tag->getAttribute('publisher_id');
var_dump($src, $publisher_id);
}
Output:
string(40) "//www.192.168.10.85/libs/js/books.min.js"
string(3) "890"
I need to do some cleanup on strings that look like this:
$author_name = '<a href="http://en.wikipedia.org/wiki/Robert_Jones_Burdette>Robert Jones Burdette </a>';
Notice the href tag doesn't have closing quotes - I'm using the DOMParser on a large table of these to extract the text, and it borks on this.
I would like to look at the string in $author_name;
IF the first > does NOT have a " before it, replace it with "> to close the tag correctly. If it is okay, just skip and do the next step. Be sure not to replace the second > at all.
Using php regex, I haven't been able to find a working solution - I could chop up the whole thing and check its parts, but that would be slow and I think there must be a regex that can do what I want.
TIA
What you can do is, find the first closing tag, with or without the double-quote ("), and replace it with (">):
$author_name = preg_replace('/(.+?)"?>(.+?)/', '$1">$2', $author_name);
http://www.barattalo.it/html-fixer/
Download that, then include it in your php.
The rest is quite easy:
$dirty_html = ".....bad html here......";
$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);
It's common for people to want to use regular expressions, but you must remember that HTML is not regular.
I could use some help writing a regular expression for this dictionary string (I don't use them all that often).
This is an example of the string dictionary:
O:8:"stdClass":5:{s:4:"sent";i:0;s:6:"graded";i:0;s:5:"score";i:0;s:6:"answer";s:14:"<p>Johnson</p>";s:8:"response";s:0:"";}
I want to extract Johnson from the string dictionary.
Any help would be appreciated, thanks.
This is a PHP serialized object. Don't use a regular expression. unserialize() the data and display the answer property accordingly.
unserialize($data);
echo $data->answer;
$str = 'O:8:"stdClass":5:{s:4:"sent";i:0;s:6:"graded";i:0;s:5:"score";i:0;s:6:"answer";s:14:"<p>Johnson</p>";s:8:"response";s:0:"";}';
$obj = unserialize($str);
echo $obj->answer;
This would be the correct answer, no regex needed. You may need some additional HTML parsing if you'd want the <p> tags removed. If the format will always remain the same (and only then!) simply remove the <p> and </p> tags.
It looks like you should be using unserialize() instead and then you can use preg_match to remove the <p> tags.
$obj = (unserialize('O:8:"stdClass":5:{s:4:"sent";i:0;s:6:"graded";i:0;s:5:"score";i:0;s:6:"answer";s:14:"<p>Johnson</p>";s:8:"response";s:0:"";}'));
preg_match('~<p>([^<]*)</p>~', $obj->answer, $ans);
print_r($ans[1]); //prints Johnson
I am trying to grab what is the h4 text
$regex = '/<h4>([A-Za-z0-9\,\.])/';
I am just getting the first letter back, I cannot figure out how to use * to keep grabbing everything to the first < character.
I have made countless attempts and know I am overlooking something simple.
So I was making that much harder than I needed to, the following works:
$regex = '/<h4>.*?<\/h4>/';
If you can trust that grabbing all characters up to the first < is a good enough rule then use this:
$regex = '/<h4>([^<]*?)</';
Of course that definition will only grab 'The ' from <h4>The <b>Best</b> Book</h4> You can fix that be changing it to:
$regex = '/<h4>(.*?)<\/h4>/';
Which will grab everything between a <h4> and a </h4>, but still isn't perfect because anything like <h4 > or <h4 style="..."> will break it, along with a million other valid HTML examples. If you know that the contents won't have any < though, and you know your tag will always be exactly <h4> the first one works well enough for your situation.
If your situation is more complex you will want to use something like PHP's DOM extension (DOMDocument) which is meant for parsing HTML and XML, since neither are regular languages and cannot be parsed error free with regex.
You can use the below function to accomplish this task.
**function getTextBetweenTags($string, $tagname) {
$pattern = "/<$tagname ?.*>(.*)<\/$tagname>/";
preg_match($pattern, $string, $matches);
return $matches;
}**
In the first parameter you have to pass the complete string, and in the second parameter you have to pass the tagname ("h4")..
<?php
$html = file_get_contents('http://hypermedia.ids-mannheim.de/');
?>
this code returns me the html of the website in a string. How do I separate the string into different words? After getting the individual words in an array I would like to detect which one is in German...
$words = explode(' ', strip_tags($html));
or
$words = preg_split("/[\s,]+/", strip_tags($html));
The second one will consider not just the space character as a delimiter, but tabs and commas as well.
work with a regex, something like this
#([\w]+)#i
A code example:
if(preg_match_all('#([\w]+)\b#i', $text, $matches)) {
foreach($matches[1] as $key => $word) {
echo $word."\n";
}
}
Then you have to compare each with some kind of dictionary.
I think you need to separate your problem into steps.
First parse your returned html string to find which part is html tags and structure. You can use DOM for such purpose.
Then, you can separate your innerHTML data from tags and split innerHTML text into tokens to obtain an array. Dunno the best way but a simple array regex split can do the job.
The interesting part of finding german words, could be done matching your wordlist against a dictionary, again using arrays or maps.. or, better, using a DB (SQLlite maybe could be better than a real rdbms like mysql)..