I am making a PHP scraper and have the following piece of code that grabs the title from the page by looking inside the span uiButtonText. However I want to now scan for a hyperlink and have it pregmatch (.*).
The stars I want to be wild cards so that I can get the hyperlink from the page even if the href and onclick changes for each one.
if (preg_match("/<span class=\"uiButtonText\">(.*)<\/span>/i", $cache, $matches)){print($matches[1] . "\n");}else {}
My Full Code:
<?php
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$url = "http://www.facebook.com/MauiNuiBotanicalGardens/info";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$cache = $html;
if (preg_match("/<span class=\"uiButtonText\">(.*)<\/span>/i", $cache, $matches)) {print($matches[1] . "\n");}else {}
?>`
if you want to stick with your regex, try this:
$html = '<span class="uiButtonText">Google!</span>';
preg_match("/<span class=\"uiButtonText\"><a href=\".*\" class=\"thelink\" onclick=\".*\">(.*)<\/a><\/span>/i", $html, $matches);
print_r($matches[1]);
Output: Google!
A better way would be to use PHP Simple HTML DOM Parser and doing something like this:
$html = file_get_html("http://www.facebook.com/MauiNuiBotanicalGardens/info");
foreach($html->find("a.thelink") as $link){
echo $link->innertext . "<BR>";
}
Above is not tested, but should work
Related
I'm trying to scraping a HTML page using this PHP code
<?php
ini_set('display_errors', 1);
$url = 'http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
All works fine (no error and in my browser console I can see '200' as return code ...), but nothing is printed in my HTML page .... .
Probably the problem is about the xpath /html/body/div/div/div[4]/div[3]/section/p that refers to the first green line in the source HTML page, but this is my Firefox Firebug tells me for that page section ....
Suggestions / examples?
!!! UPDATE !!!!
As Santosh Sapkota suggest in his reply, the first problem is that the text inside that green box, is loaded from iFrame ... I've seen the url of the HTML page inside the IFrame ad so I've tried to use this one in my code that now is ...
<?php
ini_set('display_errors', 1);
$url = 'http://listeps.cittadellasalute.to.it/?id=01090101';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
but unfortunately nothing is still printed in my output HTML page ....
Other suggestions / examples?
Must be problem with you xpath. As well as check if there is content laded from iFrame or not.
I'm trying to find a regular expression that is able to change all URLs of a curl'ed document from relative to absolute.
One of the way I found is the post here but it works only for the first URL and not for all.
This is the code I'm using:
$url="http://www.example.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_DNS_USE_GLOBAL_CACHE, 0);
curl_setopt($ch, CURLOPT_DNS_CACHE_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result=curl_exec($ch);
curl_close($ch);
$result = preg_replace('~(href|src)=(["\'])(?!#)(?!http://)([^\2]*)\2~i','$1="http://www.example.com$3"', $result);
echo $result;
Where am I doing wrong?
EDIT
Just to explain better. I haven't an array of urls, but I have an entire document gathered from curl so I need a preg replace method.
I'm not exactley sure why it replaces it just one time (maybe it has something to do with the backreference), but when you wrap it in a while loop, it should work.
$pattern = '~(href|src)=(["\'])(?!#|//|http)([^\2]*)\2~i';
while (preg_match($pattern, $result)) {
$result = preg_replace($pattern,'$1="http://www.example.com$3"', $result);
}
(I also changed the pattern slightly.)
I've been playing with PHP Simple HTML DOM Parser Manual found here http://simplehtmldom.sourceforge.net/manual.htm and I got success with some tests except this one:
It got nested tables and spans and I would like to parse the outer text of span with class of mynum.
<?php
require_once 'simple_html_dom.php';
$url = 'http://relumastudio.com/test/target.html';
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
$DEBUG = 1;
if($DEBUG){
$html = new simple_html_dom();
$html->load($url);
echo $html->find('span[class=mynum]',0)->outertext; // I should get 123456
}else{
echo $result;
}
curl_close($ch);
I thought I could get away with just once call to echo $html->find('span[class=mynum]',0)->outertext; to get the text 123456 but I can't.
Any ideas? Any help is greatly appreciated. Thank You.
Load the url properly first. Then use ->innertext in this case:
$url = 'http://relumastudio.com/test/target.html';
$html = file_get_html($url);
$num = $html->find('span.mynum', 0)->innertext;
echo $num;
You need innertext.
$html = new simple_html_dom();
$html->load_file($url);
echo $html->find('span[class=mynum]',0)->innertext;
outertext returns <span class="mynum">123456</span>
How can I get the link address after a URL has been redirected?
Take for example this URL: http://www.boligsiden.dk/viderestilling/992cff55882a40f79e64b0a25e847a69
How can I make a PHP script echo the final URL? (http://www.eltoftnielsen.dk/default.aspx?side=sagsvisning&AutoID=125125&DID=140 in this case)
Note: The following solution isn't ideal for high traffic situations.
$url = 'http://www.boligsiden.dk/viderestilling/992cff55882a40f79e64b0a25e847a69';
file_get_contents($url);
preg_match('/(Location:|URI:)(.*?)\n/', implode("\n", $http_response_header), $matches);
if (isset($matches[0]))
{
echo $matches[0];
}
Here's what happens: file_get_contents() redirects and downloads the target website but writes the original response header into $http_response_header.
the preg_match tries to find the first "Location: x" match and returns it.
use this
<?php
$name="19875379";
$url = "http://www.ikea.co.il/default.asp?strSearch=".$name;
$ch = curl_init();
$timeout = 0;
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$header = curl_exec($ch);
$redir = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
//print_r($header);
$x = preg_match("/<script>location.href=(.|\n)*?<\/script>/", $header, $matches);
$script = $matches[0];
$redirect = str_replace("<script>location.href='", "", $script);
$redirect = "http://www.ikea.co.il" . str_replace("';</script>", "", $redirect);
echo $redirect;
?>
enter link description here
I try to practice CURL,but it doesn't go well
Pleasw tell me what's wrong
here is my code
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://xxxxxxx.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, "Google Bot");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$downloaded_page = curl_exec($ch);
curl_close($ch);
preg_match_all('/<div\s* class =\"abc\">(.*)<\/div>/', $downloaded_page, $title);
echo "<pre>";
print($title[1]);
echo "</pre>";
and the warning is Notice: Array to string conversion
the html I want to parse is like this
<div class="abc">
<ul> blablabla </ul>
<ul> blablabla </ul>
<ul> blablabla </ul>
</div>
preg_match_all returns an array of arrays.
If your code is:
preg_match_all('/<div\s+class="abc">(.*)<\/div>/', $downloaded_page, $title);
you actually want to do the following:
echo "<pre>";
foreach ($title[1] as $realtitle) {
echo $realtitle . "\n";
}
echo "</pre>";
Since it will search all div's that have class "abc". I also suggest you harden your regex to be more robust.
preg_match_all('/<div[^>]+class="abc"[^>]*>(.*)<\/div>/', $downloaded_page, $title);
This will match as well as
BTW: DomDocument is slow as hell, I found out that regexes sometimes (depending on the size of your document) can give 40x speed increase. Just keep it simple.
Best,
Nicolas
Don't parse HTML with regex.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.lipsum.com/');
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
# foreach ($xpath->query('//div') as $div) { // all div's in html
foreach ($xpath->query('//div[contains(#class, "abc")]') as $div) { // all div's that have "abc" classname
// $div->nodeValue contains fetched DIV content
}