PHP PregMatch Error with spaces on extract [duplicate] - php

This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 6 years ago.
I want to extract data from a web source but i am getting error in preg match
<?php
$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];
echo $title;
?>
This is the error i get
Parse error: syntax error, unexpected 'instapp' (T_STRING) in
/home/ubuntu/workspace/test.php on line 4
Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?

The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:
preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);
While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/# symbols.
Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".
HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.
Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:
$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#content]');
$res = array();
foreach($metas as $m) {
array_push($res, $m->getAttribute('content'));
}
print_r($res);
See the PHP demo
And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
$id = $match[1];
}
See another PHP demo

Related

Select text following last $ in an expression? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am using a regular expression to extract the price on the right from the following HTML:
<p class="pricing ats-product-price"><em class="old_price">$99.99</em>$94.99</p>
Using preg match in PHP:
preg_match_all('!<p class="pricing ats-product-price"><em class="old_price">.*?<\/em>(.*?)<\/p>!', $output, $prices);
Except, I noticed that sometimes the HTML doesn't include an old price. So sometimes the HTML looks like this:
<p class="pricing ats-product-price">$129.99</p>
It seems like my goal should be to extract the last price from the expression, or in other words the text that directly follows after the last question mark and before the </p>. This sort of expression is way out of my league though - hoping for some help here. Thanks.
Use a regular expression in combination with a parser:
<?php
$data = <<<DATA
<p class="pricing ats-product-price">
<em class="old_price">$99.99</em>
$94.99
</p>
<p class="pricing ats-product-price">$129.99</p>
DATA;
# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
# set up the xpath
$xpath = new DOMXPath($dom);
$regex = '~\$\d+[\d.]*\b\s*\Z~';
foreach ($xpath->query("//p") as $line) {
if (preg_match($regex, $line->nodeValue, $match)) {
echo $match[0] . "\n";
}
}
This yields
$129.99
$129.99
The snippet sets up the DOM, queries it for p tags and searches for the last price within.
See a demo for the expression on regex101.com.

Parse HTML in PHP and extract value [duplicate]

This question already has answers here:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
(10 answers)
What to do Regular expression pattern doesn't match anywhere in string?
(8 answers)
Closed 6 years ago.
I'm trying to extract some information from a website.
There is a section looking like that:
<th>Some text here</th><td>text to extract</td>
I would like to find (with regexp or other solution) the part starting with some text here and extract the text to extract from that.
I was trying to use following regexp solution:
$reg = '/<th>Some text here<\/th><td>(.*)<\/td>/';
preg_match_all($reg, $content, $result, PREG_PATTERN_ORDER);
print_r($result);
but it gives me just empty array:
Array ( [0] => Array ( ) [1] => Array ( ) )
How should I construct my regular expression to extract wanted value? Or what other solution can I use to extract it?
Using XPath:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$xp = new DOMXPath($dom);
$content = $xp->evaluate('string(//th[.="Some text here"]/following-sibling::*[1][name()="td"])');
echo $content;
XPath query details:
string( # return a string instead of a node list
// # anywhere in the DOM tree
th # a th node
[.="Some text here"] # predicate: its content is "Some text here"
/following-sibling::*[1] # first following sibling
[name()="td"] # predicate: must be a td node
)
The reason your pattern doesn't work is probably because the td content contains newlines characters (that are not matched by the dot .).
you could use a DOMDocument for this.
$domd=#DOMDocument::loadHTML($content);
$extractedText=NULL;
foreach($domd->getElementsByTagName("th") as $ele){
if($ele->textContent!=='Some text here'){continue;}
$extractedText=$ele->nextSibling->textContent;
break;
}
if($extractedText===NULL){
//extraction failed
} else {
//extracted text is in $extractedText
}
(regex is generally a bad tool for parsing HTML, as someone in comments have already pointed out)

php regex for parsing stock symbols in html code [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
i have following php regex code.. i want to extract the stock symbol in some html output.
The stock symbol i want to extract is /q?s=XXXX -- XXXX (the stock symbol) could be 1 to 5 characters long.
if(preg_match_all('~(?<=q\?s=)[-A-Z.]{1,5}~', $html, $out))
{
$out[0] = array_unique($out[0]);
} else {
echo "FAIL";
}
HTML code below (case 1 and case that i applied this to)
case #1 (does *not* work)
Bellicum Pharmaceuticals, Inc.
case #2 (does work correctly)
NYLD
Looking for suggestions on how i can update my php regex code to make it work for both case 1 and case 2. Thanks.
Instead of using regex, make effective use of DOM and XPath to do this for you.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a[substring(#href, 1, 5) = "/q?s="]');
foreach ($links as $link) {
$results[] = str_replace('/q?s=', '', $link->getAttribute('href'));
}
print_r($results);
eval.in
The answer seems nice, but it seems like a lot of work and code to maintain, no?
if (preg_match_all('/q\?s=(\S{1,5})\"/', $html, $match)) {
$symbols = array_unique($match[1]);
}
or even shorter... '/q\?s=(\S+)\"/'

what reg expression would select all the text between these tags? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
How to parse and process HTML with PHP?
How can I convert ereg expressions to preg in PHP?
here is an example
echo "<div id='spaced' class='romaji'><span class='spaced orig word'>neko</span><span class='space'>";
please ignore the "echos" its the only way i could get the html to show
i need a reg express that can select whatever is between the
echo "<span class='spaced orig word'>";
tag and its ending tag
echo "</span>";
i tried
$pattern = "span class='spaced orig word'>(.+?)</s";
preg_match_all ($pattern, $jp_page, $result_ro);
if ($result_ro[1])
$results[] = implode(' ', $result_ro[1]);
else
return null; // Failed to retrieve Hiragana, so abort
and some other things, but i cant get it right, i get nothing most of the time because i dont really know what im doing with reg expressions
currently getting a warning with this code
Warning: preg_match_all(): Delimiter must not be alphanumeric or backslash
THE PONY HE COMES!
Instead, try using a DOM parser:
$dom = new DOMDocument();
$dom->loadHTML($jp_page);
$xpath = new DOMXPath($dom);
$spans = $xpath->query("//span[#class='spaced orig word']");
$results = "";
foreach($spans as $span) {
$results = " ".$span->textContent;
}
$results = trim($results);
return $results;
No delimiters
try this reg
<?php
$pattern = '#<span.*?>(.*?)</span>#';

PHP parser ASP page [duplicate]

This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
PHP : Parser asp page
I have this tag into asp page
<a class='Lp' href="javascript:prodotto('Prodotto.asp?C=3')">AMARETTI VICENZI GR. 200</a>
how can i parser this asp page for to have the text AMARETTI VICENZI GR. 200 ?
This is the code that I use but don't work :
<?php
$page = file_get_contents('http://www.prontospesa.it/Home/prodotti.asp?c=12');
preg_match_all('#(.*?)#is', $page, $matches);
$count = count($matches[1]);
for($i = 0; $i < $count; $i++){
echo $matches[2][$i];
}
?>
You're regular expression (in preg_match_all) is wrong. It should be #<a class='Lp' href="(.*?)">(.*?)</a>#is since the class attribute comes first, not last and is wrapped in single quotes, not double quotes.
You should highly consider using DOMDocument and DOMXPath to parse your document instead of regular expressions.
DOMDocument/DOMXPath Example:
<?php
// ...
$doc = new DOMDocument;
$doc->loadHTML($html); // $html is the content of the website you're trying to parse.
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//a[#class="Lp"]');
foreach ( $nodes as $node )
echo $node->textContent . PHP_EOL;
You have to modify the regular expression a little based on the HTML code of the page you are getting the content from:
'#<a class=\'Lp\' href="(.*?)">(.*?)</a>#is'
Note that the class is first and it is surrounded by single quotes not double. I tested and it works for me.

Categories