Xpath scraping further link from subpages? - php

I Finally managed to make a script in php for scraping basic elements from other websites. It is super simple. This example shows how to get title and url.
ini_set('display_errors', 1);
$url = 'http://test123cxqwq12.000webhostapp.com/mainpage.php';
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$title = $xpath->query('/html/body/a/h1');
$source = $xpath->query('/html/body/a/#href');
for ($i = 0; $i <= count($source)-1; $i++) {
$new = $source[$i]->nodeValue;
$text = $title[$i]->nodeValue;
echo ""."</br>";
}
Page with results: http://test123cxqwq12.000webhostapp.com/scrap.php
Page to scraping content: http://test123cxqwq12.000webhostapp.com/mainpage.php
Subpage: http://test123cxqwq12.000webhostapp.com/subpage.php
Now I would like to go a step further and take the data from the subpage. So instead of taking source from main page like is right now. I would like to go into this source and take another source (in this example google.com link) from subpage. I'm out of ideas. I would like to ask for some tips, is it possible to do it with xpath in similar way I was doing now?

I think a solution could be to store the URL in a database then apply your Curl and xpath functions to them
<?php
function curlGet($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch);
curl_close($ch);
return $results;
}
function returnXPathObject($item) {
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($item);
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath;
}
$allUrl = $cxn->query("SELECT * FROM yourDatabaseUrl");
$allUrl = $allUrl->fetchAll();
for ($i = 0; $i<count($allUrl); $i++){
$url = $allUrl[$i];
$getDom = curlGet($url);
$getDomXpath = returnXPathObject($getDom);
$title = $getDomXpath->query('/html/body/a/h1');
$source = $getDomXpath->query('/html/body/a/#href');
}
I'm not sure about this answer it's just a proposition

Related

Scrape Amazon.com webpage with PHP

I'm trying to simply fetch the html of a remote Amazon url. I had working code, but maybe they changed something? Not sure. I've spent hours trying code samples and plugins from here and there, but nothing is working. Here's what I have right now, but of course it doesn't work either:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $item['URL']);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$output = json_decode(curl_exec($curl));
//echo curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
#file_put_contents($graphics_file_root.'rps/amazon/temp2.html',$output);
$html = new DOMDocument();
#$html->loadHTML($output);
#file_put_contents($graphics_file_root.'rps/amazon/temp.html',$html->saveHTML());
$temp = $html->getElementsByTagName('img');
$html = file_get_contents($item['URL']);
#file_put_contents($graphics_file_root.'rps/amazon/temp2.html',$html);
$temp = $html->getElementsByTagName('img');
echo count($temp);
print_r($temp);
This doesn't work. simple_html_dom doesn't work. Nothing does that I can find.
It looks like some of the code I found online was json specific and removing the json-decode fixed it:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $item['URL']);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$output = curl_exec($curl);
//echo curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
//file_put_contents($graphics_file_root.'rps/amazon/temp2.html',$output);
$html = new DOMDocument();
#$html->loadHTML($output);
//file_put_contents($graphics_file_root.'rps/amazon/temp.html',$html->saveHTML());
$temp = $html->getElementsByTagName('img');

Cant use loadHTMLfile or file_get_contents for external URL

I want to know Groupon active deals so I write a scraper like:
libxml_use_internal_errors(true);
$dom = new DOMDocument();
#$dom->loadHTMLFile('https://www.groupon.com/browse/new-york?category=food-and-drink&minPrice=1&maxPrice=999');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[#class='slot']//a/#href");
foreach($entries as $e) {
echo $e->textContent . '<br />';
}
but when I run this function browser loading all time, just loading something but don't show any error.
How can I fix it? Not just case with Groupon - I also try other websites but also don't work. WHy?
What about using CURL to loading page data.
Not just case with Groupon - I also try other websites but also don't work
I think this code will help you but you should expect unexpected situations for each website which you want to scrap.
<?php
$dom = new DOMDocument();
$data = get_url_content('https://www.groupon.com', true);
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//label");
foreach($entries as $e) {
echo $e->textContent . '<br />';
}
function get_url_content($url = null, $justBody = true)
{
/* Init CURL */
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_HTTPHEADER, []);
$data = curl_exec($ch);
if ($justBody)
$data = #(explode("\r\n\r\n", $data, 2))[1];
var_dump($data);
return $data;
}

Getting title of a webpage issue

I want to get the title of a webpage with file_get_contents(),
I tried:
$get=file_get_meta_tags("http://example.com");
echo $get["title"];
but it doesn't match.
What is wrong with it?
Title tag is not part of match in get_meta_tags() function and it is also not a meta tag.
Try this:
$get=file_get_contents("http://example.com");
preg_match("#<title>(.*?)</title>#i,$get,$matches);
print_r($matches);
Regex #<title>(.*?)</title>#i matches the title string.
Use the Below Code snipet to get the webpage title.
<?php
function curl_file_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$targetUrl = "http://google.com/";
$html = curl_file_get_contents($targetUrl);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$page_title = $nodes->item(0)->nodeValue;
echo "Title: $page_title". '<br/><br/>';
?>

Getting site title in unknown format using Php Curl and Dom-Document

I want to get site title using site url with most of the site it is working but it is getting some not readable text with japennese and chinnese site.
Here is my function
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Use
use--------
$html = $this->file_get_contents_curl($url);
Parsing
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
I am getting this ouput "ã¢ã¡ã¼ãIDç»é² ã¡ã¼ã«ã®ç¢ºèªï½Ameba(ã¢ã¡ã¼ã)"
Site URL : https://user.ameba.jp/regist/registerIntro.do?campaignId=0053&frmid=3051
Please help me out suggest some way to get exact site title in any language.
//example
/* MEthod----------4 */
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$uurl="http://www.piaohua.com/html/xuannian/index.html";
$html = file_get_contents_curl($uurl);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
if(!empty($nodes->item(0)->nodeValue)){
$title = utf8_decode($nodes->item(0)->nodeValue);
}else{
$title =$uurl;
}
echo $title;
Make sure your script is using utf-8 encoding by adding following line to the begining of the file
mb_internal_encoding('UTF-8');
After doing so, remove utf8_decode function from your code. Everything should work fine without it
[DOMDocument::loadHtml]1 function gets encoding from html page meta tag. So you could have problems if page do not excplicitly specifies its encoding.
Simply add this line on top of your PHP Code.
header('Content-Type: text/html;charset=utf-8');
The code..
<?php
header('Content-Type: text/html;charset=utf-8');
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl('http://www.piaohua.com/html/lianxuju/2013/1108/27730.html');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
echo $title = $nodes->item(0)->nodeValue;

Why does this regex not match the URLs in this Google results page?

I'm having trouble scraping the URLs out of the Google results. This code worked for me for a long time but seems like Google changed a few things this week and now I'm getting a ton of extra characters surrounded by the actual URL I want.
preg_match_all('#<h3\s*class="r">\s*<a[^<>]*href="([^<>]*)"[^<>]*>(.*)</a>\s*</h3>#siU', $results, $matches[$key]);
EDIT
All links come out like this when scraped with the above code
/url?url=http://cooksandtravelbooks.com/write-for-us/&rct=j&sa=U&ei=XdayUNnHBIqDiwKZuYEY&ved=0CBQQFjAA&q=cooking+%5C%22Write+for+Us%5C%22&usg=AFQjCNGMiCiWYY_8JDAhqJggVDW2qHRMfw
<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
foreach($dom->getElementsByTagName('a') as $link) {
echo $link->getAttribute('href');
echo "<br />";
}
?>

Categories