PHP parser ASP page [duplicate] - php

This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
PHP : Parser asp page
I have this tag into asp page
<a class='Lp' href="javascript:prodotto('Prodotto.asp?C=3')">AMARETTI VICENZI GR. 200</a>
how can i parser this asp page for to have the text AMARETTI VICENZI GR. 200 ?
This is the code that I use but don't work :
<?php
$page = file_get_contents('http://www.prontospesa.it/Home/prodotti.asp?c=12');
preg_match_all('#(.*?)#is', $page, $matches);
$count = count($matches[1]);
for($i = 0; $i < $count; $i++){
echo $matches[2][$i];
}
?>

You're regular expression (in preg_match_all) is wrong. It should be #<a class='Lp' href="(.*?)">(.*?)</a>#is since the class attribute comes first, not last and is wrapped in single quotes, not double quotes.
You should highly consider using DOMDocument and DOMXPath to parse your document instead of regular expressions.
DOMDocument/DOMXPath Example:
<?php
// ...
$doc = new DOMDocument;
$doc->loadHTML($html); // $html is the content of the website you're trying to parse.
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//a[#class="Lp"]');
foreach ( $nodes as $node )
echo $node->textContent . PHP_EOL;

You have to modify the regular expression a little based on the HTML code of the page you are getting the content from:
'#<a class=\'Lp\' href="(.*?)">(.*?)</a>#is'
Note that the class is first and it is surrounded by single quotes not double. I tested and it works for me.

Related

Difficulties with the function preg_match_all

I would like to get back the number which is between span HTML tags. The number may change!
<span class="topic-count">
::before
"
24
"
::after
</span>
I've tried the following code:
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
But it doesn't work.
Entire code:
$result=array();
$page = 201;
while ($page>=1) {
$source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
$result = array_merge($result, $nombre[$i][1]);
print("Page : ".$page ."\n");
$page-=25;
}
print_r ($nombre);
Can do with
preg_match_all(
'#<span class="topic-count">[^\d]*(\d+)[^\d]*?</span>#s',
$html,
$matches
);
which would capture any digits before the end of the span.
However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.
Hence the recommendation to use a DOM parser instead, e.g.
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(#class, "topic-count")]') as $node) {
if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
echo $topics[0][0], PHP_EOL;
}
}
DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression
//span[contains(#class, "topic-count")]
which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.

PHP PregMatch Error with spaces on extract [duplicate]

This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 6 years ago.
I want to extract data from a web source but i am getting error in preg match
<?php
$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];
echo $title;
?>
This is the error i get
Parse error: syntax error, unexpected 'instapp' (T_STRING) in
/home/ubuntu/workspace/test.php on line 4
Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?
The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:
preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);
While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/# symbols.
Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".
HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.
Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:
$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#content]');
$res = array();
foreach($metas as $m) {
array_push($res, $m->getAttribute('content'));
}
print_r($res);
See the PHP demo
And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
$id = $match[1];
}
See another PHP demo

php: Extract text between specific tags from a webpage [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best methods to parse HTML with PHP
I understand I should be using a html parser like php domdocument (http://docs.php.net/manual/en/domdocument.loadhtml.php) or tagsoup.
How would I use php domdocument to extract text between specific tags, for example get text between h1,h2,h3,p,table? It seems I can only do this for one tag only with getelementbytagname.
Is there a better html parser for such task? Or how would I loop through the php domdocument?
You are correct, use DomDocument (since regex is NOT a good idea for parsing HTML. Why? See here and here for reasons why).
getElementsByTagName gives you a DOMNodeList that you can iterate over to get the text of all the found elements. So, your code could look something like:
$document = new \DOMDocument();
$document->loadHTML($html);
$tags = array ('h1', 'h2', 'h3', 'h4', 'p');
$texts = array ();
foreach($tags as $tag)
{
$elementList = $document->getElementsByTagName($tag);
foreach($elementList as $element)
{
$texts[$element->tagName][] = $element->textContent;
}
}
return $texts;
Note that you should probably have some error handling in there, and you will also lose the context of the texts, but you can probably edit this code as you see fit.
You can doing so with a regex.
preg_match_all('#<h1>([^<]*)</h1>#Usi', $html_string, $matches);
foreach ($matches as $match)
{
// do something with $match
}
I am not sure what is your source so I added a function to get the content via the URL.
$file = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($file);
$body = $doc->getElementsByTagName('body');
$h1 = $body->getElementsByTagName('h1');
I am not sure of this part:
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "\n";
}
Or:
foreach ($items as $item) {
echo $item->nodeValue . "\n";
}
Here is more info on nodeValue: http://docs.php.net/manual/en/function.domnode-node-value.php
Hope it helps!

extract url using PHP [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Grabbing the href attribute of an A element
hi, i have this string in PHP
<iframe frameborder="0" width="320" height="179" src="http://www.dailymotion.com/embed/video/xinpy5?width=320&wmode=transparent"></iframe><br />Le buzz Pippa Middleton agace la Reine ! <i>par direct8</i>
i would like to extract the url from the anchor href attribute using preg_match or other php functins
Don't use regexes to parse HTML. Use the PHP DOM:
$DOM = new DOMDocument;
$DOM->loadHTML($str); // Your string
//get all anchors
$anchors = $DOM->getElementsByTagName('a');
//display all hrefs
for ($i = 0; $i < $anchors->length; $i++)
echo $anchors->item($i)->getAttribute('href') . "<br />";
You can check if the node has a href using hasAttribute() first if necessary.
You can use
if (preg_match('#<a\s*[^>]*href="([^"]+)"#i', $string, $matches))
echo $matches[0];
try this regex
(?<=href=\")[\w://\.\-]+

Matching everything between html <body> tags using PHP

I have a script that returns the following in a variable called $content
<body>
<p><span class=\"c-sc\">dgdfgdf</span></p>
</body>
I however need to place everything between the body tag inside an array called matches
I do the following to match the stuff between the body tag
preg_match('/<body>(.*)<\/body>/',$content,$matches);
but the $mathces array is empty, how could I get it to return everything inside the body tag
Don't try to process html with regular expressions! Use PHP's builtin parser instead:
$dom = new DOMDocument;
$dom->loadHTML($string);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$string = $dom->saveHTML();
You should not use regular expressions to parse HTML.
Your particular problem in this case is you need to add the DOTALL modifier so that the dot matches newlines.
preg_match('/<body>(.*)<\/body>/s', $content, $matches);
But seriously, use an HTML parser instead. There are so many ways that the above regular expression can break.
If for some reason you don't have DOMDocument installed, try this
Step 1. Download simple_html_dom
Step 2. Read the documentation about how to use its selectors
require_once("simple_html_dom.php");
$doc = new simple_html_dom();
$doc->load($someHtmlString);
$body = $doc->find("body")->innertext;

Categories