This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I'd love some help with this small problem.
I need PHP to collect HTML. Lets say, this is a part of the full HTML code:
<div class="inner">
<p>Hi there. I am text! I'm playing hide and seek with PHP.</p>
</div>
My goal is to collect everything between <p> and </p>. This is the PHP I've got so far:
$file = file_get_contents($link); //Import le HTML
preg_match('<div class="inner">
<p>(.*?)</p>
</div>si', $file, $k); //Play find & seek
$k_out = $k[1];
$name = strtok($k , '#'); //Remove everything behind the hashtags
echo $name;
But - sadly - PHP error'd me:
*Warning: preg_match(): Unknown modifier '<' in /home/fourwonders/alexstuff/vinedownloader/public_html/v/index.php on line 131*
Can you help me out? At least, thanks for reading!
In this case it's because you don't specify delimiters (you always need delimiters, and you need to always escape the delimiter character if it is in your expression:
preg_match('#<div class="inner">
<p>(.*?)</p>
</div>#si', $file, $k);
Don't use regular expressions to parse HTML. Use a DOM Parser instead:
$doc = new DOMDocument();
$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('p');
foreach ($tags as $tag) {
echo $tag->nodeValue;
}
Output:
Hi there. I am text! I'm playing hide and seek with PHP.
Demo!
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Replace URLs in text with HTML links
(17 answers)
Closed 2 years ago.
I am using the following regex to replace plain URLs with html links in a text:
preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', '$1 ', $text_msg);
Now I want to modify the regex in a way that, it only replaces the URL only if there is no double quotes behind it and therefore is not part of a tag (i.e. the url is at the start of the string, start of a line or after a space).
Examples:
This is the link <a href="http://test.com"> ... (URL should not be replaced)
http://test.com (at the begenning of a line or the whole multi-line string should be replaced)
This is the site: http://test.com (URL should be replaced)
Thanks.
Your question actually breaks down into two smaller problems. You've already solved one of them, which is parsing the URL with a regular expression. The second part is extracting text from HTML, which isn't easily solved by a regular expression at all. The confusion you have is in trying to do both at the same with a regular expression (parsing HTML and parsing the URL). See the parsing HTML with regex SO Answer for more details on why this is a bad idea.
So instead, let's just use an HTML parser (like DOMDocument) to extract text nodes from the HTML and parse URLs inside those text nodes.
Here's an example
<?php
$html = <<<'HTML'
<p>This is a URL http://abcd/ims in text</p>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Let's walk the entire DOM tree looking for text nodes
function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from walk($n);
}
}
}
foreach (walk($dom->firstChild) as $node) {
if ($node instanceof DOMText) {
// lets find any links and change them to HTML
if (preg_match('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', $node->nodeValue, $match)) {
$node->nodeValue = preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', "\xff ",
$node->nodeValue);
$nodeSplit = explode("\xff", $node->nodeValue, 2);
$node->nodeValue = $nodeSplit[1];
$newNode = $dom->createTextNode($nodeSplit[0]);
$href = $dom->createElement('a', $match[1]);
$href->setAttribute('href', $match[1]);
$node->parentNode->insertBefore($newNode, $node);
$node->parentNode->insertBefore($href, $node);
}
}
}
echo $dom->saveHTML();
Which gives you the desired HTML as output:
<p>This is a URL http://abcd/ims in text</p>
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am using a regular expression to extract the price on the right from the following HTML:
<p class="pricing ats-product-price"><em class="old_price">$99.99</em>$94.99</p>
Using preg match in PHP:
preg_match_all('!<p class="pricing ats-product-price"><em class="old_price">.*?<\/em>(.*?)<\/p>!', $output, $prices);
Except, I noticed that sometimes the HTML doesn't include an old price. So sometimes the HTML looks like this:
<p class="pricing ats-product-price">$129.99</p>
It seems like my goal should be to extract the last price from the expression, or in other words the text that directly follows after the last question mark and before the </p>. This sort of expression is way out of my league though - hoping for some help here. Thanks.
Use a regular expression in combination with a parser:
<?php
$data = <<<DATA
<p class="pricing ats-product-price">
<em class="old_price">$99.99</em>
$94.99
</p>
<p class="pricing ats-product-price">$129.99</p>
DATA;
# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
# set up the xpath
$xpath = new DOMXPath($dom);
$regex = '~\$\d+[\d.]*\b\s*\Z~';
foreach ($xpath->query("//p") as $line) {
if (preg_match($regex, $line->nodeValue, $match)) {
echo $match[0] . "\n";
}
}
This yields
$129.99
$129.99
The snippet sets up the DOM, queries it for p tags and searches for the last price within.
See a demo for the expression on regex101.com.
This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 6 years ago.
I want to extract data from a web source but i am getting error in preg match
<?php
$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];
echo $title;
?>
This is the error i get
Parse error: syntax error, unexpected 'instapp' (T_STRING) in
/home/ubuntu/workspace/test.php on line 4
Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?
The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:
preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);
While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/# symbols.
Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".
HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.
Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:
$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#content]');
$res = array();
foreach($metas as $m) {
array_push($res, $m->getAttribute('content'));
}
print_r($res);
See the PHP demo
And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
$id = $match[1];
}
See another PHP demo
This question already has answers here:
Parse All Links That Contain A Specific Word In "href" Tag [duplicate]
(4 answers)
Closed 9 years ago.
Hello I have a problem with my Regex code I use to get a value out of a HTML-tag using PHP. I have the following strings possible:
<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>
And I have the following preg_match command:
preg_match('#<span class="last_position.*?">(.+)</span>#', $string, $matches);
Which pretty much just covers case #3. So I was wondering what I would need to add in front of last_position to get all cases possible..?
Thanks a lot..
Edit: For all who are wondering what value is to be matched: "xyz"
Avoid using regex to parse HTML as it can be error prone. Your specific UseCase is better solved with a DOM parser:
$html = <<< EOF
<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query("//span[contains(#class, 'last_position')]/text()");
for($i=0; $i < $nodeList->length; $i++) {
$node = $nodeList->item($i);
var_dump($node->nodeValue);
}
OUTPUT:
string(3) "xyz"
string(3) "xyz"
string(3) "xyz"
Try the following (and yes you can use regex to match data from HTML):
$string = '<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>';
preg_match_all('#<span\s.*?class=".*?last_position.*?".*?>(.+?)</span>#i', $string, $m);
print_r($m);
Online demo.
Try to use this
preg_match('#<span class="?(.*)last_position.*?">(.+)</span>#', $string, $matches);
You could try this:
preg_match_all('#<span class="[^"]*last_position[^"]*">(.+)</span>#', $string, $matches, PREG_PATTERN_ORDER);
You'll then find the values in $matches[1][0], $matches[1][1], $matches[1][2] ....
The part I added in the class attributes value [^"]* matches any number of characters that does not match a doublequote. Thus it matches anything inside the attributes value.
Sure, parsing XML is not possible using RegEx, because XML is not regular. But in many real-world cases, XML documents used as input are limited and predictable enough to simply be treated as text.
Something like this should work for you:
preg_match('#<span class="[^>"]*?last_position[^>"]*">(.+)</span>#', $string, $matches);
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
How to parse and process HTML with PHP?
How can I convert ereg expressions to preg in PHP?
here is an example
echo "<div id='spaced' class='romaji'><span class='spaced orig word'>neko</span><span class='space'>";
please ignore the "echos" its the only way i could get the html to show
i need a reg express that can select whatever is between the
echo "<span class='spaced orig word'>";
tag and its ending tag
echo "</span>";
i tried
$pattern = "span class='spaced orig word'>(.+?)</s";
preg_match_all ($pattern, $jp_page, $result_ro);
if ($result_ro[1])
$results[] = implode(' ', $result_ro[1]);
else
return null; // Failed to retrieve Hiragana, so abort
and some other things, but i cant get it right, i get nothing most of the time because i dont really know what im doing with reg expressions
currently getting a warning with this code
Warning: preg_match_all(): Delimiter must not be alphanumeric or backslash
THE PONY HE COMES!
Instead, try using a DOM parser:
$dom = new DOMDocument();
$dom->loadHTML($jp_page);
$xpath = new DOMXPath($dom);
$spans = $xpath->query("//span[#class='spaced orig word']");
$results = "";
foreach($spans as $span) {
$results = " ".$span->textContent;
}
$results = trim($results);
return $results;
No delimiters
try this reg
<?php
$pattern = '#<span.*?>(.*?)</span>#';