Unable to grab content traversing multiple pages

Unable to grab content traversing multiple pages - php

I've written a script in php to scrape the titles and its links from a webpage. The webpage displays it's content traversing multiple pages. My below script can parse the titles and links from it's landing page.
How can I rectify my existing script to get data from multiple pages, as in upto 10 pages?
This is my attempt so far:
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=2";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
}
}
get_content($link);
?>
The site increments it's pages like ?page=2,?page=3 e.t.c.

This is how I got success (coping with Nima's suggestion).
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
}
}
for($i = 1; $i<10; $i++){
get_content($link.$i);
}
?>

Here is how i would do it with XPath:
$url = 'https://stackoverflow.com/questions/tagged/web-scraping';
$dom = new DOMDocument();
$source = loadUrlSource($url);
$dom->loadHTML($source);
$xpath = new DOMXPath($dom);
$domPage = new DOMDocument();
$domPage->loadHTML($source);
$xpath_page = new DOMXPath($domPage);
// Find page links with the title "go to page" within the div container that contains "pager" class.
$pageItems = $xpath_page->query("//div[contains(#class, 'pager')]//a[contains(#title, 'go to page')]");
// Get last page number.
// Since you will look once at the beginning for the page number, subtract by 2 because the link "next" has title "go to page" as well.
$pageCount = (int)$pageItems[$pageItems->length-2]->textContent;
// Loop every page
for($page=1; $page < $pageCount; $page++) {
$source = loadUrlSource($url . "?page={$page}");
// Do whatever with the source. You can also call simple_html_dom on the content.
// $dom = new simple_html_dom();
// $dom->load($source);
}

Related

Need help extracting meta title from an URL using curl and DOM

I need to extract the dollar amount e.g. $594 from the meta title of an URL. I am getting full meta title however i just need the $594 from it not the whole title. Here is my code. Thanks
<?php
// Web page URL
$url = 'https://www.cheapflights.com.au/flights-to-Delhi/Sydney/';
// Extract HTML using curl
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
// Load HTML to DOM object
$dom = new DOMDocument();
#$dom->loadHTML($data);
// Parse DOM to get Title data
$nodes = $dom->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
// Parse DOM to get meta data
$metas = $dom->getElementsByTagName('meta');
$description = $keywords = '';
for($i=0; $i<$metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description'){
$description = $meta->getAttribute('content');
}
if($meta->getAttribute('name') == 'keywords'){
$keywords = $meta->getAttribute('content');
}
}
echo "$title". '<br/>';
?>

Error when trying to get Instagram Embed page HTML code

I'm trying to get the HTML Code of the Instagram's Embed pages for my API, but it returns me a strange error and I do not know what to do now, because I'm new to PHP. The code works on other websites.
I tried it already on other websites like apple.com and the strange thing is that when I call this function on the 'normal' post page it works, the error only appears when I call it on the '/embed' URL.
This is my PHP Code:
<?php
if (isset($_GET['url'])) {
$filename = $_GET['url'];
$file = file_get_contents($filename);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$stringbody = $dom->saveHTML($body);
echo $stringbody;
}
?>
I call the API like this:
https://api.com/get-website-body.php?url=http://instagr.am/p/BoLVWplBVFb/embed
My goal is to get the body of the website, like it is when I call this code on the https://apple.com URL for example.

You can use direct url to scrape the data if you use CURL and its faster than file_get_content. Here is the curl code for different urls and this will scrape the body data alone.
if (isset($_GET['url'])) {
// $website_url = 'https://www.instagram.com/instagram/?__a=1';
// $website_url = 'https://apple.com';
// $website_url = $_GET['url'];
$website_url = 'http://instagr.am/p/BoLVWplBVFb/embed';
$curl = curl_init();
//curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $website_url);
curl_setopt($curl, CURLOPT_REFERER, $website_url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0(Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/66.0');
$str = curl_exec($curl);
curl_close($curl);
$json = json_decode($str, true);
print_r($str); // Just taking tha page as it is
// Taking body part alone and play as your wish
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($str);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
foreach ($bodies as $key => $value) {
print_r($value);// You will all content of body here
}
}
NOTE: Here you don't want to use https://api.com/get-website-body.php?url=....

Script writes partial content to a csv file

I've written a script in php to scrape the titles and its links from a webpage and write them accordingly to a csv file. As I'm dealing with a paginated site, only the content of last page remains in the csv file and the rest are being overwritten. I tried with writing mode w. However, when I do the same using append a then I find all the data in that csv file.
As appending and writing data makes the csv file open and close multiple times (because of my perhaps wrongly applied loops), the script becomes less efficient and time consuming.
How can i do the same in an efficient manner and of course using (writing) w mode?
This is I've written so far:
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$infile = fopen("itemfile.csv","a");
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
fputcsv($infile,[$itemTitle,$itemLink]);
}
fclose($infile);
}
for($i = 1; $i<10; $i++){
get_content($link.$i);
}
?>

If you don't want to open and close the file multiple times, then move the opening script before your for-loop and close it after:
function get_content($url, $inifile)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
fputcsv($infile,[$itemTitle,$itemLink]);
}
}
$infile = fopen("itemfile.csv","w");
for($i = 1; $i<10; $i++) {
get_content($link.$i, $inifile);
}
fclose($infile);
?>

I would consider not echoing or writing results to the file in get_content function. I would rewrite it so it would only get content, so I can handle extracted data any way I like. Something like this (please read code comments):
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
// This function does not write data to a file or print it. It only extracts data
// and returns it as an array.
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
// We don't need the following line anymore
// $infile = fopen("itemfile.csv","a");
// We will collect extracted data in an array
$result = [];
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
$result []= [$itemTitle, $itemLink];
// echo "{$itemTitle},{$itemLink}<br>";
// No need to write to file, so we don't need the following as well
// fputcsv($infile,[$itemTitle,$itemLink]);
}
// No files opened, so the following line is no more required
// fclose($infile);
// Return extracted data from this specific URL
return $result;
}
// Merge all results (result for each url with different page parameter
// With a little refactoring, get_content() can handle this as well
$result = [];
for($page = 1; $page < 10; $page++){
$result = array_merge($result, get_content($link.$page));
}
// Now do whatever you want with $result. Like writing its values to a file, or print it, etc.
// You might want to write a function for this
$outputFile = fopen("itemfile.csv","a");
foreach ($result as $row) {
fputcsv($outputFile, $row);
}
fclose($outputFile);
?>

file_get_contents is doing strange on server

this is a script that gives you a direct mp3 download link from a youtube video id, variable $gg is the video id
So when i run this code on my xampp locally it runs fine and returns me a direct download link but when i try to run this code on my hosts server it returns a link but not a direct link but a download page what am i doing wrong?
<?php
$gg = '6Y1Emb7Jyks';
$site = 'http://www.youtubeinmp3.com/widget/button/?video=https://www.youtube.com/watch?v='.$gg;
$html = file_get_contents($site);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
}
$lol = 'http://www.youtubeinmp3.com/'.$url;
echo $lol;

figured it out myself i had to follow the redirect
final code
<?php
$gg = '6Y1Emb7Jyks';
$site = 'http://www.youtubeinmp3.com/widget/button/?video=https://www.youtube.com/watch?v='.$gg;
$html = file_get_contents($site);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
}
$lol = 'http://www.youtubeinmp3.com/'.$url;
$url= $lol;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
echo $url;

how to scraping a site using php

I'm getting the content of the site using this following code
function get_content($url){
$content = #file_get_contents($url);
if( empty($content) ){
$content = get_url_contents($url);
}
return $content;
}
function get_url_contents($url){
$crl = curl_init();
$timeout = 90;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
$url = "http://www.site.com";
$html = get_content($url);
echo $html;
Everything is ok, but I need to get for example all my div elements or the title of the page or all my images.
How can I do that?
Thanks

Use a HTML Parsing library. While many of them exist, I have personally used SimpleHTMLDom and had a good experience. It uses JQuery style selectors making it easy to learn.
Some code samples:
To get title of page:
$html = str_get_html($html);
$title = $html->find('title',0);
echo $title->plaintext;
For all div elements:
$html = str_get_html($html);
$divs = $html->find('div');
foreach($divs as $div) {
// do something;
}

You can use DOMDocument
eg:
$dom = new DOMDocument;
$dom->loadHTML($html);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $div->nodeValue. PHP_EOL;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Unable to grab content traversing multiple pages - php

Related

Need help extracting meta title from an URL using curl and DOM

Error when trying to get Instagram Embed page HTML code

Script writes partial content to a csv file

file_get_contents is doing strange on server

how to scraping a site using php

Categories

Resources