Scrape external web page and get all outbound links - php

What I want is simple
Get webpage HTML and scrape all outbound links
what I have so far is
<?php
function get_content($URL){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $URL);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = get_content('http://example.com');
?>

Make use of DOMDocument
$dom = new DOMDocument;
$dom->loadHTML($html); // <----------- Pass the HTML content you retrieved from get_content()
foreach ($dom->getElementsByTagName('a') as $tag) {
echo $tag->getAttribute('href');
}

Related

Cant use loadHTMLfile or file_get_contents for external URL

I want to know Groupon active deals so I write a scraper like:
libxml_use_internal_errors(true);
$dom = new DOMDocument();
#$dom->loadHTMLFile('https://www.groupon.com/browse/new-york?category=food-and-drink&minPrice=1&maxPrice=999');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[#class='slot']//a/#href");
foreach($entries as $e) {
echo $e->textContent . '<br />';
}
but when I run this function browser loading all time, just loading something but don't show any error.
How can I fix it? Not just case with Groupon - I also try other websites but also don't work. WHy?
What about using CURL to loading page data.
Not just case with Groupon - I also try other websites but also don't work
I think this code will help you but you should expect unexpected situations for each website which you want to scrap.
<?php
$dom = new DOMDocument();
$data = get_url_content('https://www.groupon.com', true);
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//label");
foreach($entries as $e) {
echo $e->textContent . '<br />';
}
function get_url_content($url = null, $justBody = true)
{
/* Init CURL */
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_HTTPHEADER, []);
$data = curl_exec($ch);
if ($justBody)
$data = #(explode("\r\n\r\n", $data, 2))[1];
var_dump($data);
return $data;
}

Getting title of a webpage issue

I want to get the title of a webpage with file_get_contents(),
I tried:
$get=file_get_meta_tags("http://example.com");
echo $get["title"];
but it doesn't match.
What is wrong with it?
Title tag is not part of match in get_meta_tags() function and it is also not a meta tag.
Try this:
$get=file_get_contents("http://example.com");
preg_match("#<title>(.*?)</title>#i,$get,$matches);
print_r($matches);
Regex #<title>(.*?)</title>#i matches the title string.
Use the Below Code snipet to get the webpage title.
<?php
function curl_file_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$targetUrl = "http://google.com/";
$html = curl_file_get_contents($targetUrl);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$page_title = $nodes->item(0)->nodeValue;
echo "Title: $page_title". '<br/><br/>';
?>

Getting site title in unknown format using Php Curl and Dom-Document

I want to get site title using site url with most of the site it is working but it is getting some not readable text with japennese and chinnese site.
Here is my function
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Use
use--------
$html = $this->file_get_contents_curl($url);
Parsing
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
I am getting this ouput "ã¢ã¡ã¼ãIDç»é² ã¡ã¼ã«ã®ç¢ºèªï½Ameba(ã¢ã¡ã¼ã)"
Site URL : https://user.ameba.jp/regist/registerIntro.do?campaignId=0053&frmid=3051
Please help me out suggest some way to get exact site title in any language.
//example
/* MEthod----------4 */
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$uurl="http://www.piaohua.com/html/xuannian/index.html";
$html = file_get_contents_curl($uurl);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
if(!empty($nodes->item(0)->nodeValue)){
$title = utf8_decode($nodes->item(0)->nodeValue);
}else{
$title =$uurl;
}
echo $title;
Make sure your script is using utf-8 encoding by adding following line to the begining of the file
mb_internal_encoding('UTF-8');
After doing so, remove utf8_decode function from your code. Everything should work fine without it
[DOMDocument::loadHtml]1 function gets encoding from html page meta tag. So you could have problems if page do not excplicitly specifies its encoding.
Simply add this line on top of your PHP Code.
header('Content-Type: text/html;charset=utf-8');
The code..
<?php
header('Content-Type: text/html;charset=utf-8');
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl('http://www.piaohua.com/html/lianxuju/2013/1108/27730.html');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
echo $title = $nodes->item(0)->nodeValue;

how can i parse images with cURL?

how can I parse images on this site with cURL?
with this code I can show the whole site's html, but I need only images:
$ch = curl_init('http://www.lamoda.ru/shoes/sapogi/?sitelink=leftmenu&sf=16&rdr565=1#sf=16');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, '1');
$text = curl_exec($ch);
curl_close($ch);
if (!preg_match('/src="https?:\/\/"/', $text))
$text = preg_replace('/src="(.*)"/', "src=\"$MY_BASE_URL\\1\"", $text);
echo $text;
thank you!
I tried this:
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, '1');
$text = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
#$doc->loadHTML($text->content);
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img)
{
$imgarray[] = $img -> getAttribute('src');
}
return $imgarray;
BUT: on this site images uploaded via JS and it doesn't show images at all =((
You can use a DOM Parser to achieve this:
$ch = curl_init('URL_GOES_HERE');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, '1');
$text = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
$dom->loadHTML($text);
foreach ($dom->getElementsByTagName('img') as $img) {
echo $img->getAttribute('src');
}
you can use html parse simple_html_dom:
http://simplehtmldom.sourceforge.net/manual.htm
// Create DOM from URL or file
$url = 'http://www.lamoda.ru/shoes/sapogi/?sitelink=leftmenu&sf=16&rdr565=1#sf=16';
$html = file_get_html($url);
// Find all images
foreach($html->find('img') as $element)
echo $element->src;

php dom not accepting url

I am trying to create a program that will open a text file with urls seperated by |. It will then take the first line of the text document, crawl that url and remove it from the text file. Each url is to be scraped by a basic crawler. I know the crawler part works because if I enter in one of the urls in quotations, rather than a variable from the text file, it will work. I am at the point where it will not return anything because the url simply will not be accepted.
this is a basic version of my code because I had to break it down alot to iscolate the problem.
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}`
The code below works like a champ tested with your example data:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
?>
ALMOST FORGOT: LETS NOW PUT IT IN A LOOP TO LOOP THROUGH ALL URLS:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
foreach($urlarray as $url) {
if(!empty($url)) {
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,trim($url));
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
echo '<hr />';
}
}
?>
So if you put in a URL manually $url = 'http://www.mywebsite.com'; every thing works as expected?
If so there is a problem here:
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
are you sure urls.txt is loading? Are you sure it contains http://a.com|http://b.com etc?
I would var dump
$contents = file_get_contents('urls.txt') before the explode statement to see if it is loading in.
If yes, then I would explode the into $urlarray, and var dump $urlarray[0]
if it looks right I would trim it before being sent to dom with trim($urlarray[0])
I may even go as far as using valid regex to make sure these URL's are in fact URL's before sending it to dom.
Let me know the results and I will try to help further, or post all sample code including URLS.txt
And I will run it locally

Categories