Parse HTML in PHP and extract value [duplicate]

Parse HTML in PHP and extract value [duplicate] - php

This question already has answers here:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
(10 answers)
What to do Regular expression pattern doesn't match anywhere in string?
(8 answers)
Closed 6 years ago.
I'm trying to extract some information from a website.
There is a section looking like that:
<th>Some text here</th><td>text to extract</td>
I would like to find (with regexp or other solution) the part starting with some text here and extract the text to extract from that.
I was trying to use following regexp solution:
$reg = '/<th>Some text here<\/th><td>(.*)<\/td>/';
preg_match_all($reg, $content, $result, PREG_PATTERN_ORDER);
print_r($result);
but it gives me just empty array:
Array ( [0] => Array ( ) [1] => Array ( ) )
How should I construct my regular expression to extract wanted value? Or what other solution can I use to extract it?

Using XPath:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$xp = new DOMXPath($dom);
$content = $xp->evaluate('string(//th[.="Some text here"]/following-sibling::*[1][name()="td"])');
echo $content;
XPath query details:
string( # return a string instead of a node list
// # anywhere in the DOM tree
th # a th node
[.="Some text here"] # predicate: its content is "Some text here"
/following-sibling::*[1] # first following sibling
[name()="td"] # predicate: must be a td node
)
The reason your pattern doesn't work is probably because the td content contains newlines characters (that are not matched by the dot .).

you could use a DOMDocument for this.
$domd=#DOMDocument::loadHTML($content);
$extractedText=NULL;
foreach($domd->getElementsByTagName("th") as $ele){
if($ele->textContent!=='Some text here'){continue;}
$extractedText=$ele->nextSibling->textContent;
break;
}
if($extractedText===NULL){
//extraction failed
} else {
//extracted text is in $extractedText
}
(regex is generally a bad tool for parsing HTML, as someone in comments have already pointed out)

Related

Regex to match URLs in a text which are not part of an html tag [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Replace URLs in text with HTML links
(17 answers)
Closed 2 years ago.
I am using the following regex to replace plain URLs with html links in a text:
preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', '$1 ', $text_msg);
Now I want to modify the regex in a way that, it only replaces the URL only if there is no double quotes behind it and therefore is not part of a tag (i.e. the url is at the start of the string, start of a line or after a space).
Examples:
This is the link <a href="http://test.com"> ... (URL should not be replaced)
http://test.com (at the begenning of a line or the whole multi-line string should be replaced)
This is the site: http://test.com (URL should be replaced)
Thanks.

Your question actually breaks down into two smaller problems. You've already solved one of them, which is parsing the URL with a regular expression. The second part is extracting text from HTML, which isn't easily solved by a regular expression at all. The confusion you have is in trying to do both at the same with a regular expression (parsing HTML and parsing the URL). See the parsing HTML with regex SO Answer for more details on why this is a bad idea.
So instead, let's just use an HTML parser (like DOMDocument) to extract text nodes from the HTML and parse URLs inside those text nodes.
Here's an example
<?php
$html = <<<'HTML'
<p>This is a URL http://abcd/ims in text</p>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Let's walk the entire DOM tree looking for text nodes
function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from walk($n);
}
}
}
foreach (walk($dom->firstChild) as $node) {
if ($node instanceof DOMText) {
// lets find any links and change them to HTML
if (preg_match('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', $node->nodeValue, $match)) {
$node->nodeValue = preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', "\xff ",
$node->nodeValue);
$nodeSplit = explode("\xff", $node->nodeValue, 2);
$node->nodeValue = $nodeSplit[1];
$newNode = $dom->createTextNode($nodeSplit[0]);
$href = $dom->createElement('a', $match[1]);
$href->setAttribute('href', $match[1]);
$node->parentNode->insertBefore($newNode, $node);
$node->parentNode->insertBefore($href, $node);
}
}
}
echo $dom->saveHTML();
Which gives you the desired HTML as output:
<p>This is a URL http://abcd/ims in text</p>

Select text following last $ in an expression? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am using a regular expression to extract the price on the right from the following HTML:
<p class="pricing ats-product-price"><em class="old_price">$99.99</em>$94.99</p>
Using preg match in PHP:
preg_match_all('!<p class="pricing ats-product-price"><em class="old_price">.*?<\/em>(.*?)<\/p>!', $output, $prices);
Except, I noticed that sometimes the HTML doesn't include an old price. So sometimes the HTML looks like this:
<p class="pricing ats-product-price">$129.99</p>
It seems like my goal should be to extract the last price from the expression, or in other words the text that directly follows after the last question mark and before the </p>. This sort of expression is way out of my league though - hoping for some help here. Thanks.

Use a regular expression in combination with a parser:
<?php
$data = <<<DATA
<p class="pricing ats-product-price">
<em class="old_price">$99.99</em>
$94.99
</p>
<p class="pricing ats-product-price">$129.99</p>
DATA;
# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
# set up the xpath
$xpath = new DOMXPath($dom);
$regex = '~\$\d+[\d.]*\b\s*\Z~';
foreach ($xpath->query("//p") as $line) {
if (preg_match($regex, $line->nodeValue, $match)) {
echo $match[0] . "\n";
}
}
This yields
$129.99
$129.99
The snippet sets up the DOM, queries it for p tags and searches for the last price within.
See a demo for the expression on regex101.com.

PHP PregMatch Error with spaces on extract [duplicate]

This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 6 years ago.
I want to extract data from a web source but i am getting error in preg match
<?php
$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];
echo $title;
?>
This is the error i get
Parse error: syntax error, unexpected 'instapp' (T_STRING) in
/home/ubuntu/workspace/test.php on line 4
Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?

The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:
preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);
While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/# symbols.
Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".
HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.
Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:
$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#content]');
$res = array();
foreach($metas as $m) {
array_push($res, $m->getAttribute('content'));
}
print_r($res);
See the PHP demo
And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
$id = $match[1];
}
See another PHP demo

Regular expression to convert mailto links

I've been using this regular express (probably found on stackoverflow a few years back) to convert mailto tags in PHP:
preg_match_all("/<a([ ]+)href=([\"']*)mailto:(([[:alnum:]._\-]+)#([[:alnum:]._\-]+\.[[:alnum:]._\-]+))([\"']*)([[:space:][:alnum:]=\"_]*)>([^<|#]*)(#?)([^<]*)<\/a>/i",$content,$matches);
I pass it $content = 'somename#domain.com'
It returns these matched pieces:
0 somename#domain.com
1
2 "
3 name#domain.com
4 name
5 domain.com
6 "
7
8 somename
9 #
10 domain.com
Example usage: ucwords($matches[8][0])
My problem is, some links contain nested tags. Since the preg expression is looking for "<" to get pieces 8,9,10 and nested tags are throwing it off...
Example:
<span><b>somename#domain.com</b></span>
I need to ignore the nested tags and just extract the "some name" piece:
match part 8 = <span><b>
match part 9 = somename
match part 10 = #
match part 11 = domain.com
match part 12 = </b></span>
I've tried to get it to work by tweaking ([^<|#]*)(#?)([^<]*) but I can't figure out the right syntax to match or ignore the nested tags.

You could just replace the whole match between the <a> tag with a .*?. Replace ([^<|#]*)(#?)([^<]*) with (.*?) and it would include everything within the <a> tag including nested tags. You can remove the nested tags after that with striptags or another regex.
However, regular expressions are not very good at html nested tags. You are better off using something like DOMDocument, which is made exactly for parsing html. Something like:
<?php
$DOM = new DOMDocument();
$DOM->loadXML('<span><b>somename#domain.com</b></span>');
$list = $DOM->getElementsByTagName('a');
foreach($list as $link){
$href = $link->getAttribute('href');
$text = $link->nodeValue;
//only match if href starts with mailto:
if(stripos($href, 'mailto:') === 0){
var_dump($href);
var_dump($text);
}
}
http://codepad.viper-7.com/SqDKgr

To only get access to the part within the link, try
[^>]*>([^>]+)#.*
What you need should be in the first group of the result.

You can try this pattern:
$pattern = '~\bhref\s*+=\s*+(["\'])mailto:\K(?<mail>(?<name>[^#]++)#(?<domain>.*?))\1[^>]*+>(?:\s*+</?(?!a\b)[^>]*+>\s*+)*+(?<content>[^<]++)~i';
preg_match_all($pattern, $html, $matches, PREG_SET_ORDER);
echo '<pre>' . print_r($matches, true) . '</pre>';
and you can access your data like that:
echo $matches[0]['name'];

Try this regex
/^(<.*>)(.*)(#)/
/^/- Start of string
/(<.*>)/ - First match group, starts with < then anything in between until it hits >
/(.*)(#)/ - Match anything up to the parenthesis

How to get the value using Regex? [duplicate]

This question already has answers here:
Parse All Links That Contain A Specific Word In "href" Tag [duplicate]
(4 answers)
Closed 9 years ago.
Hello I have a problem with my Regex code I use to get a value out of a HTML-tag using PHP. I have the following strings possible:
<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>
And I have the following preg_match command:
preg_match('#<span class="last_position.*?">(.+)</span>#', $string, $matches);
Which pretty much just covers case #3. So I was wondering what I would need to add in front of last_position to get all cases possible..?
Thanks a lot..
Edit: For all who are wondering what value is to be matched: "xyz"

Avoid using regex to parse HTML as it can be error prone. Your specific UseCase is better solved with a DOM parser:
$html = <<< EOF
<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query("//span[contains(#class, 'last_position')]/text()");
for($i=0; $i < $nodeList->length; $i++) {
$node = $nodeList->item($i);
var_dump($node->nodeValue);
}
OUTPUT:
string(3) "xyz"
string(3) "xyz"
string(3) "xyz"

Try the following (and yes you can use regex to match data from HTML):
$string = '<span class="down last_position">xyz</span>
<span class="up last_position">xyz</span>
<span class="last_position new">xyz</span>';
preg_match_all('#<span\s.*?class=".*?last_position.*?".*?>(.+?)</span>#i', $string, $m);
print_r($m);
Online demo.

Try to use this
preg_match('#<span class="?(.*)last_position.*?">(.+)</span>#', $string, $matches);

You could try this:
preg_match_all('#<span class="[^"]*last_position[^"]*">(.+)</span>#', $string, $matches, PREG_PATTERN_ORDER);
You'll then find the values in $matches[1][0], $matches[1][1], $matches[1][2] ....
The part I added in the class attributes value [^"]* matches any number of characters that does not match a doublequote. Thus it matches anything inside the attributes value.

Sure, parsing XML is not possible using RegEx, because XML is not regular. But in many real-world cases, XML documents used as input are limited and predictable enough to simply be treated as text.
Something like this should work for you:
preg_match('#<span class="[^>"]*?last_position[^>"]*">(.+)</span>#', $string, $matches);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse HTML in PHP and extract value [duplicate] - php

Related

Regex to match URLs in a text which are not part of an html tag [duplicate]

Select text following last $ in an expression? [duplicate]

PHP PregMatch Error with spaces on extract [duplicate]

Regular expression to convert mailto links

How to get the value using Regex? [duplicate]

Categories

Resources