This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am using a regular expression to extract the price on the right from the following HTML:
<p class="pricing ats-product-price"><em class="old_price">$99.99</em>$94.99</p>
Using preg match in PHP:
preg_match_all('!<p class="pricing ats-product-price"><em class="old_price">.*?<\/em>(.*?)<\/p>!', $output, $prices);
Except, I noticed that sometimes the HTML doesn't include an old price. So sometimes the HTML looks like this:
<p class="pricing ats-product-price">$129.99</p>
It seems like my goal should be to extract the last price from the expression, or in other words the text that directly follows after the last question mark and before the </p>. This sort of expression is way out of my league though - hoping for some help here. Thanks.
Use a regular expression in combination with a parser:
<?php
$data = <<<DATA
<p class="pricing ats-product-price">
<em class="old_price">$99.99</em>
$94.99
</p>
<p class="pricing ats-product-price">$129.99</p>
DATA;
# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
# set up the xpath
$xpath = new DOMXPath($dom);
$regex = '~\$\d+[\d.]*\b\s*\Z~';
foreach ($xpath->query("//p") as $line) {
if (preg_match($regex, $line->nodeValue, $match)) {
echo $match[0] . "\n";
}
}
This yields
$129.99
$129.99
The snippet sets up the DOM, queries it for p tags and searches for the last price within.
See a demo for the expression on regex101.com.
Related
This question already has answers here:
Regex select all text between tags
(23 answers)
Closed 2 years ago.
I'm trying to write a function, which will find each substring in string, where substring is some html tag, for example
<li>
But my regular expression don't work and i can't finde my mistake.
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
$items = preg_match_all('/(<li>\w+<\/li>)', $str, $matches);
$items must be an array of the desired substrings
Consider using DOMDocument to parse and manipulate HTML or XML tags. Do not reinvent the wheel with Regex.
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$li = $dom->getElementsByTagName('li');
$value = $li->item(0)->nodeValue;
echo $value;
' hello'
Or if you want to iterate over all
foreach($li as $item)
echo $item->nodeValue, PHP_EOL;
' hello'
'how are you?'
Markus' answer is correct but in case you just want the fast and dirty regex one, here it is:
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
preg_match_all('/(<li>.+<\/li>)/U', $str, $items);
U makes it ungreedy.
This question already has answers here:
get everything between <tag> and </tag> with php [duplicate]
(7 answers)
Regex for script tag in PHP
(1 answer)
Closed 3 years ago.
I would like to read everything JavaScript out of a string with preg_match_all.
$pattern = '~<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>~su';
$success = preg_match_all($pattern, $str, $matches, PREG_SET_ORDER);
array(0 => '<script>alert("Hallo Welt 1");</script>');
The result now contains the script tag as well.
I would like to exclude this tag.
My Sample Online Regex with Sample Code.
Regex is the wrong tool for parsing XML/HTML. You should use a DOM parser instead. XPath expressions is a language specialized on parsing DOM structures.
$html = <<<_EOS_
<script>alert("Hallo Welt 1");</script>
<div>Hallo Welt</div>
<script type ="text/javascript">alert("Hallo Welt 2");</script>
<div>Hallo Welt 2</div>
<script type ="text/javascript">
alert("Hallo Welt 2");
</script>
_EOS_;
$doc = new DOMDocument();
$doc->loadHTML("<!DOCTYPE html><html>$html</html>");
$xpath = new DOMXPath($doc);
$scripts = $xpath->query('//script/text()');
foreach ($scripts as $script)
var_dump($script->data);
This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 6 years ago.
I want to extract data from a web source but i am getting error in preg match
<?php
$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];
echo $title;
?>
This is the error i get
Parse error: syntax error, unexpected 'instapp' (T_STRING) in
/home/ubuntu/workspace/test.php on line 4
Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?
The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:
preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);
While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/# symbols.
Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".
HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.
Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:
$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#content]');
$res = array();
foreach($metas as $m) {
array_push($res, $m->getAttribute('content'));
}
print_r($res);
See the PHP demo
And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
$id = $match[1];
}
See another PHP demo
This question already has answers here:
Find an element by id and replace its contents with php
(3 answers)
Closed 8 years ago.
So i have a string that i want to search through using regex, and not any other method like domDocument etc.
Example:
<div class="form-item form-type-textarea form-item-answer2">
<div class="form-textarea-wrapper resizable"><textarea id="edit-answer2" name="answer2" cols="60" rows="5" class="form-textarea">
this is some text
</textarea>
</div>
</div>
Desired:
this is some text
So what i want to do from this is using 1 regex line be left with 'this is some text', which is not fixed and will be dynamic. I will then pass this through a preg_replace to get desired outcome.
Current regex is
div class="form-item.*class="form-textarea">$\A<\/textarea>.*<\/div>/gU
I have tried using the end of string and start of string anchors, but to no avail.
Don't parse HTML with regexes. Use a DOM parser:
$doc = new DOMDocument();
$doc->loadHTML($html);
$textarea = $doc->getElementById("edit-answer2");
echo $textarea->nodeValue;
if you want to modify the value:
$textarea->nodeValue = "foo bar";
$html = $doc->saveHTML();
Your regex would be,
/<textarea id[^>]*>\n([^\n]*)/gs
DEMO
OR
/<textarea id[^>]*>(.*?)(?=<\/textarea>)/gs
DEMO
Captured group1 conatins the string this is some text
OR
you could use the below regex to match only the string this is some text.
/div class="form-item.*class="form-textarea">[^\n]*\n\K[^\n]*/s
DEMO
This question already has answers here:
Parse All Links That Contain A Specific Word In "href" Tag [duplicate]
(4 answers)
Closed 9 years ago.
Hello I have a problem with my Regex code I use to get a value out of a HTML-tag using PHP. I have the following strings possible:
<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>
And I have the following preg_match command:
preg_match('#<span class="last_position.*?">(.+)</span>#', $string, $matches);
Which pretty much just covers case #3. So I was wondering what I would need to add in front of last_position to get all cases possible..?
Thanks a lot..
Edit: For all who are wondering what value is to be matched: "xyz"
Avoid using regex to parse HTML as it can be error prone. Your specific UseCase is better solved with a DOM parser:
$html = <<< EOF
<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query("//span[contains(#class, 'last_position')]/text()");
for($i=0; $i < $nodeList->length; $i++) {
$node = $nodeList->item($i);
var_dump($node->nodeValue);
}
OUTPUT:
string(3) "xyz"
string(3) "xyz"
string(3) "xyz"
Try the following (and yes you can use regex to match data from HTML):
$string = '<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>';
preg_match_all('#<span\s.*?class=".*?last_position.*?".*?>(.+?)</span>#i', $string, $m);
print_r($m);
Online demo.
Try to use this
preg_match('#<span class="?(.*)last_position.*?">(.+)</span>#', $string, $matches);
You could try this:
preg_match_all('#<span class="[^"]*last_position[^"]*">(.+)</span>#', $string, $matches, PREG_PATTERN_ORDER);
You'll then find the values in $matches[1][0], $matches[1][1], $matches[1][2] ....
The part I added in the class attributes value [^"]* matches any number of characters that does not match a doublequote. Thus it matches anything inside the attributes value.
Sure, parsing XML is not possible using RegEx, because XML is not regular. But in many real-world cases, XML documents used as input are limited and predictable enough to simply be treated as text.
Something like this should work for you:
preg_match('#<span class="[^>"]*?last_position[^>"]*">(.+)</span>#', $string, $matches);