This question already has answers here:
Parse All Links That Contain A Specific Word In "href" Tag [duplicate]
(4 answers)
Closed 9 years ago.
Hello I have a problem with my Regex code I use to get a value out of a HTML-tag using PHP. I have the following strings possible:
<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>
And I have the following preg_match command:
preg_match('#<span class="last_position.*?">(.+)</span>#', $string, $matches);
Which pretty much just covers case #3. So I was wondering what I would need to add in front of last_position to get all cases possible..?
Thanks a lot..
Edit: For all who are wondering what value is to be matched: "xyz"
Avoid using regex to parse HTML as it can be error prone. Your specific UseCase is better solved with a DOM parser:
$html = <<< EOF
<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query("//span[contains(#class, 'last_position')]/text()");
for($i=0; $i < $nodeList->length; $i++) {
$node = $nodeList->item($i);
var_dump($node->nodeValue);
}
OUTPUT:
string(3) "xyz"
string(3) "xyz"
string(3) "xyz"
Try the following (and yes you can use regex to match data from HTML):
$string = '<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>';
preg_match_all('#<span\s.*?class=".*?last_position.*?".*?>(.+?)</span>#i', $string, $m);
print_r($m);
Online demo.
Try to use this
preg_match('#<span class="?(.*)last_position.*?">(.+)</span>#', $string, $matches);
You could try this:
preg_match_all('#<span class="[^"]*last_position[^"]*">(.+)</span>#', $string, $matches, PREG_PATTERN_ORDER);
You'll then find the values in $matches[1][0], $matches[1][1], $matches[1][2] ....
The part I added in the class attributes value [^"]* matches any number of characters that does not match a doublequote. Thus it matches anything inside the attributes value.
Sure, parsing XML is not possible using RegEx, because XML is not regular. But in many real-world cases, XML documents used as input are limited and predictable enough to simply be treated as text.
Something like this should work for you:
preg_match('#<span class="[^>"]*?last_position[^>"]*">(.+)</span>#', $string, $matches);
Related
I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);
You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.
If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1
Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.
This question already has answers here:
Regex select all text between tags
(23 answers)
Closed 2 years ago.
I'm trying to write a function, which will find each substring in string, where substring is some html tag, for example
<li>
But my regular expression don't work and i can't finde my mistake.
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
$items = preg_match_all('/(<li>\w+<\/li>)', $str, $matches);
$items must be an array of the desired substrings
Consider using DOMDocument to parse and manipulate HTML or XML tags. Do not reinvent the wheel with Regex.
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$li = $dom->getElementsByTagName('li');
$value = $li->item(0)->nodeValue;
echo $value;
' hello'
Or if you want to iterate over all
foreach($li as $item)
echo $item->nodeValue, PHP_EOL;
' hello'
'how are you?'
Markus' answer is correct but in case you just want the fast and dirty regex one, here it is:
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
preg_match_all('/(<li>.+<\/li>)/U', $str, $items);
U makes it ungreedy.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am using a regular expression to extract the price on the right from the following HTML:
<p class="pricing ats-product-price"><em class="old_price">$99.99</em>$94.99</p>
Using preg match in PHP:
preg_match_all('!<p class="pricing ats-product-price"><em class="old_price">.*?<\/em>(.*?)<\/p>!', $output, $prices);
Except, I noticed that sometimes the HTML doesn't include an old price. So sometimes the HTML looks like this:
<p class="pricing ats-product-price">$129.99</p>
It seems like my goal should be to extract the last price from the expression, or in other words the text that directly follows after the last question mark and before the </p>. This sort of expression is way out of my league though - hoping for some help here. Thanks.
Use a regular expression in combination with a parser:
<?php
$data = <<<DATA
<p class="pricing ats-product-price">
<em class="old_price">$99.99</em>
$94.99
</p>
<p class="pricing ats-product-price">$129.99</p>
DATA;
# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
# set up the xpath
$xpath = new DOMXPath($dom);
$regex = '~\$\d+[\d.]*\b\s*\Z~';
foreach ($xpath->query("//p") as $line) {
if (preg_match($regex, $line->nodeValue, $match)) {
echo $match[0] . "\n";
}
}
This yields
$129.99
$129.99
The snippet sets up the DOM, queries it for p tags and searches for the last price within.
See a demo for the expression on regex101.com.
This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 6 years ago.
I want to extract data from a web source but i am getting error in preg match
<?php
$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];
echo $title;
?>
This is the error i get
Parse error: syntax error, unexpected 'instapp' (T_STRING) in
/home/ubuntu/workspace/test.php on line 4
Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?
The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:
preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);
While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/# symbols.
Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".
HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.
Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:
$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#content]');
$res = array();
foreach($metas as $m) {
array_push($res, $m->getAttribute('content'));
}
print_r($res);
See the PHP demo
And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
$id = $match[1];
}
See another PHP demo
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I'd love some help with this small problem.
I need PHP to collect HTML. Lets say, this is a part of the full HTML code:
<div class="inner">
<p>Hi there. I am text! I'm playing hide and seek with PHP.</p>
</div>
My goal is to collect everything between <p> and </p>. This is the PHP I've got so far:
$file = file_get_contents($link); //Import le HTML
preg_match('<div class="inner">
<p>(.*?)</p>
</div>si', $file, $k); //Play find & seek
$k_out = $k[1];
$name = strtok($k , '#'); //Remove everything behind the hashtags
echo $name;
But - sadly - PHP error'd me:
*Warning: preg_match(): Unknown modifier '<' in /home/fourwonders/alexstuff/vinedownloader/public_html/v/index.php on line 131*
Can you help me out? At least, thanks for reading!
In this case it's because you don't specify delimiters (you always need delimiters, and you need to always escape the delimiter character if it is in your expression:
preg_match('#<div class="inner">
<p>(.*?)</p>
</div>#si', $file, $k);
Don't use regular expressions to parse HTML. Use a DOM Parser instead:
$doc = new DOMDocument();
$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('p');
foreach ($tags as $tag) {
echo $tag->nodeValue;
}
Output:
Hi there. I am text! I'm playing hide and seek with PHP.
Demo!