I'm getting the content of the site using this following code
function get_content($url){
$content = #file_get_contents($url);
if( empty($content) ){
$content = get_url_contents($url);
}
return $content;
}
function get_url_contents($url){
$crl = curl_init();
$timeout = 90;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
$url = "http://www.site.com";
$html = get_content($url);
echo $html;
Everything is ok, but I need to get for example all my div elements or the title of the page or all my images.
How can I do that?
Thanks
Use a HTML Parsing library. While many of them exist, I have personally used SimpleHTMLDom and had a good experience. It uses JQuery style selectors making it easy to learn.
Some code samples:
To get title of page:
$html = str_get_html($html);
$title = $html->find('title',0);
echo $title->plaintext;
For all div elements:
$html = str_get_html($html);
$divs = $html->find('div');
foreach($divs as $div) {
// do something;
}
You can use DOMDocument
eg:
$dom = new DOMDocument;
$dom->loadHTML($html);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $div->nodeValue. PHP_EOL;
}
Related
I'm trying to get the HTML Code of the Instagram's Embed pages for my API, but it returns me a strange error and I do not know what to do now, because I'm new to PHP. The code works on other websites.
I tried it already on other websites like apple.com and the strange thing is that when I call this function on the 'normal' post page it works, the error only appears when I call it on the '/embed' URL.
This is my PHP Code:
<?php
if (isset($_GET['url'])) {
$filename = $_GET['url'];
$file = file_get_contents($filename);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$stringbody = $dom->saveHTML($body);
echo $stringbody;
}
?>
I call the API like this:
https://api.com/get-website-body.php?url=http://instagr.am/p/BoLVWplBVFb/embed
My goal is to get the body of the website, like it is when I call this code on the https://apple.com URL for example.
You can use direct url to scrape the data if you use CURL and its faster than file_get_content. Here is the curl code for different urls and this will scrape the body data alone.
if (isset($_GET['url'])) {
// $website_url = 'https://www.instagram.com/instagram/?__a=1';
// $website_url = 'https://apple.com';
// $website_url = $_GET['url'];
$website_url = 'http://instagr.am/p/BoLVWplBVFb/embed';
$curl = curl_init();
//curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $website_url);
curl_setopt($curl, CURLOPT_REFERER, $website_url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0(Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/66.0');
$str = curl_exec($curl);
curl_close($curl);
$json = json_decode($str, true);
print_r($str); // Just taking tha page as it is
// Taking body part alone and play as your wish
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($str);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
foreach ($bodies as $key => $value) {
print_r($value);// You will all content of body here
}
}
NOTE: Here you don't want to use https://api.com/get-website-body.php?url=....
I've written a script in php to fetch links and write them in a csv file from the main page of wikipedia. The script does fetch the links accordingly. However, I can't write the populated results in a csv file. When I execute my script, It does nothing, no error either. Any help will be highly appreciated.
My try so far:
<?php
include "simple_html_dom.php";
$url = "https://en.wikipedia.org/wiki/Main_Page";
function fetch_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$links = array();
foreach ($dom->find('a') as $link) {
$links[]= $link->href . '<br>';
}
return implode("\n", $links);
$file = fopen("itemfile.csv","w");
foreach ($links as $item) {
fputcsv($file,$item);
}
fclose($file);
}
fetch_content($url);
?>
1.You are using return in your function, that's why nothing gets written in the file as code stops executing after that.
2.Simplified your logic with below code:-
$file = fopen("itemfile.csv","w");
foreach ($dom->find('a') as $link) {
fputcsv($file,array($link->href));
}
fclose($file);
So the full code needs to be:-
<?php
//comment these two lines when script started working properly
error_reporting(E_ALL);
ini_set('display_errors',1); // 2 lines are for Checking and displaying all errors
include "simple_html_dom.php";
$url = "https://en.wikipedia.org/wiki/Main_Page";
function fetch_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$links = array();
$file = fopen("itemfile.csv","w");
foreach ($dom->find('a') as $link) {
fputcsv($file,array($link->href));
}
fclose($file);
}
fetch_content($url);
?>
The reason the file does not get written is because you return out of the function before that code can even be executed.
I've written a script in php to scrape the titles and its links from a webpage. The webpage displays it's content traversing multiple pages. My below script can parse the titles and links from it's landing page.
How can I rectify my existing script to get data from multiple pages, as in upto 10 pages?
This is my attempt so far:
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=2";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
}
}
get_content($link);
?>
The site increments it's pages like ?page=2,?page=3 e.t.c.
This is how I got success (coping with Nima's suggestion).
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
}
}
for($i = 1; $i<10; $i++){
get_content($link.$i);
}
?>
Here is how i would do it with XPath:
$url = 'https://stackoverflow.com/questions/tagged/web-scraping';
$dom = new DOMDocument();
$source = loadUrlSource($url);
$dom->loadHTML($source);
$xpath = new DOMXPath($dom);
$domPage = new DOMDocument();
$domPage->loadHTML($source);
$xpath_page = new DOMXPath($domPage);
// Find page links with the title "go to page" within the div container that contains "pager" class.
$pageItems = $xpath_page->query("//div[contains(#class, 'pager')]//a[contains(#title, 'go to page')]");
// Get last page number.
// Since you will look once at the beginning for the page number, subtract by 2 because the link "next" has title "go to page" as well.
$pageCount = (int)$pageItems[$pageItems->length-2]->textContent;
// Loop every page
for($page=1; $page < $pageCount; $page++) {
$source = loadUrlSource($url . "?page={$page}");
// Do whatever with the source. You can also call simple_html_dom on the content.
// $dom = new simple_html_dom();
// $dom->load($source);
}
I just learned what scrapping and cUrl is few hours ago, and since then I am playing with that. Nevertheless, I am facing something strange now. The here below code works fine with some sites and not with others (of course I modified the url and the xpath...). Note that I have no error raised when I test if curl_exec was executed properly. So the problem must come from somwhere after. Some my questions are as follows:
How can I check if the new DOMDocument as been created properly: if(??)
How can I check if the new DOMDocument has been populated properly with html?
...if a new DOMXPath object has been created?
Hope I was clear. Thank you in advance for your replies. Cheers. Marc
My php:
<?php
$target_url = "http://www.somesite.com";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('somepath');
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo "<br />Link: $url";
}
?>
Use a try/catch to check if the document object was created, then check the return value of loadHTML() to determine if the HTML was loaded into the document. You can use a try/catch on the XPath object as well.
try
{
$dom = new DOMDocument();
$loaded = $dom->loadHTML($html);
if($loaded)
{
// loaded OK
}
else
{
// could not load HTML
}
}
catch(Exception $e)
{
// document could not be created, see $e->getMessage()
}
Problem solved. The error came from firebug who gave a wrong path. Big thanks to MrCode for his support...
How can i get the blogid from a given blogspot.com url?
I looked at the source code of the webpage from a blogspot.com it looks like this
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://www.blogger.com/rsd.g?blogID=4899870735344410268" />
how can i parse this to get the number 4899870735344410268
Use DOMDocument to parse the document and then use its methods to retrieve the wanted element.
I cannot stress this enough: never use regular expressions to parse an HTML document.
function getBlogId($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$page = curl_exec ($ch);
curl_close($ch);
$doc = new DOMDocument();
#$doc->loadHTML($page);
$links = $doc->getElementsByTagName('link');
foreach($links as $link) {
$rel = $link->attributes->getNamedItem('rel');
if($rel && $rel->nodeValue == 'EditURI') {
$href = $link->attributes->getNamedItem('href')->nodeValue;
$query = parse_url($href, PHP_URL_QUERY);
if($query) {
$queryComp = array();
parse_str($query, $queryComp);
if($queryComp['blogID']) {
return $queryComp['blogID'];
}
}
}
}
return false;
}
Example use:
$id = getBlogId('http://thehouseinmarrakesh.blogspot.com/');
echo $id; // 483911541311389592
$pageContents = file_get_contents('blospot_url');
preg_match('~<link rel="EditURI" type="application/rsd\+xml" title="RSD" href="http://www.blogger.com/rsd.g\?blogID=([0-9]+)" />~', $pageContents, $matches);
echo $matches[1];