Script writes partial content to a csv file - php

I've written a script in php to scrape the titles and its links from a webpage and write them accordingly to a csv file. As I'm dealing with a paginated site, only the content of last page remains in the csv file and the rest are being overwritten. I tried with writing mode w. However, when I do the same using append a then I find all the data in that csv file.
As appending and writing data makes the csv file open and close multiple times (because of my perhaps wrongly applied loops), the script becomes less efficient and time consuming.
How can i do the same in an efficient manner and of course using (writing) w mode?
This is I've written so far:
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$infile = fopen("itemfile.csv","a");
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
fputcsv($infile,[$itemTitle,$itemLink]);
}
fclose($infile);
}
for($i = 1; $i<10; $i++){
get_content($link.$i);
}
?>

If you don't want to open and close the file multiple times, then move the opening script before your for-loop and close it after:
function get_content($url, $inifile)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
fputcsv($infile,[$itemTitle,$itemLink]);
}
}
$infile = fopen("itemfile.csv","w");
for($i = 1; $i<10; $i++) {
get_content($link.$i, $inifile);
}
fclose($infile);
?>

I would consider not echoing or writing results to the file in get_content function. I would rewrite it so it would only get content, so I can handle extracted data any way I like. Something like this (please read code comments):
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
// This function does not write data to a file or print it. It only extracts data
// and returns it as an array.
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
// We don't need the following line anymore
// $infile = fopen("itemfile.csv","a");
// We will collect extracted data in an array
$result = [];
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
$result []= [$itemTitle, $itemLink];
// echo "{$itemTitle},{$itemLink}<br>";
// No need to write to file, so we don't need the following as well
// fputcsv($infile,[$itemTitle,$itemLink]);
}
// No files opened, so the following line is no more required
// fclose($infile);
// Return extracted data from this specific URL
return $result;
}
// Merge all results (result for each url with different page parameter
// With a little refactoring, get_content() can handle this as well
$result = [];
for($page = 1; $page < 10; $page++){
$result = array_merge($result, get_content($link.$page));
}
// Now do whatever you want with $result. Like writing its values to a file, or print it, etc.
// You might want to write a function for this
$outputFile = fopen("itemfile.csv","a");
foreach ($result as $row) {
fputcsv($outputFile, $row);
}
fclose($outputFile);
?>

Related

Trouble writing results in a csv file

I've written a script in php to fetch links and write them in a csv file from the main page of wikipedia. The script does fetch the links accordingly. However, I can't write the populated results in a csv file. When I execute my script, It does nothing, no error either. Any help will be highly appreciated.
My try so far:
<?php
include "simple_html_dom.php";
$url = "https://en.wikipedia.org/wiki/Main_Page";
function fetch_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$links = array();
foreach ($dom->find('a') as $link) {
$links[]= $link->href . '<br>';
}
return implode("\n", $links);
$file = fopen("itemfile.csv","w");
foreach ($links as $item) {
fputcsv($file,$item);
}
fclose($file);
}
fetch_content($url);
?>
1.You are using return in your function, that's why nothing gets written in the file as code stops executing after that.
2.Simplified your logic with below code:-
$file = fopen("itemfile.csv","w");
foreach ($dom->find('a') as $link) {
fputcsv($file,array($link->href));
}
fclose($file);
So the full code needs to be:-
<?php
//comment these two lines when script started working properly
error_reporting(E_ALL);
ini_set('display_errors',1); // 2 lines are for Checking and displaying all errors
include "simple_html_dom.php";
$url = "https://en.wikipedia.org/wiki/Main_Page";
function fetch_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$links = array();
$file = fopen("itemfile.csv","w");
foreach ($dom->find('a') as $link) {
fputcsv($file,array($link->href));
}
fclose($file);
}
fetch_content($url);
?>
The reason the file does not get written is because you return out of the function before that code can even be executed.

Unable to grab content traversing multiple pages

I've written a script in php to scrape the titles and its links from a webpage. The webpage displays it's content traversing multiple pages. My below script can parse the titles and links from it's landing page.
How can I rectify my existing script to get data from multiple pages, as in upto 10 pages?
This is my attempt so far:
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=2";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
}
}
get_content($link);
?>
The site increments it's pages like ?page=2,?page=3 e.t.c.
This is how I got success (coping with Nima's suggestion).
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
}
}
for($i = 1; $i<10; $i++){
get_content($link.$i);
}
?>
Here is how i would do it with XPath:
$url = 'https://stackoverflow.com/questions/tagged/web-scraping';
$dom = new DOMDocument();
$source = loadUrlSource($url);
$dom->loadHTML($source);
$xpath = new DOMXPath($dom);
$domPage = new DOMDocument();
$domPage->loadHTML($source);
$xpath_page = new DOMXPath($domPage);
// Find page links with the title "go to page" within the div container that contains "pager" class.
$pageItems = $xpath_page->query("//div[contains(#class, 'pager')]//a[contains(#title, 'go to page')]");
// Get last page number.
// Since you will look once at the beginning for the page number, subtract by 2 because the link "next" has title "go to page" as well.
$pageCount = (int)$pageItems[$pageItems->length-2]->textContent;
// Loop every page
for($page=1; $page < $pageCount; $page++) {
$source = loadUrlSource($url . "?page={$page}");
// Do whatever with the source. You can also call simple_html_dom on the content.
// $dom = new simple_html_dom();
// $dom->load($source);
}

extract specific data from webpage using php

I wants to create a php script for alerts from my work website when new notice is published, so following the page url
http://www.mahapwd.com/nit/ueviewnotice.asp?noticeid=1767
from this page i want a variable for Date & Time of Meeting (Date and time seperately two variables)
Place of Meeting and Published On
please help me to create a perfect php script.
I tried to create following script but it gives to many errors
<?php
$url1 = "http://www.mahapwd.com/nit/ueIndex.asp?district=12";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
preg_match("/href=(.*)\", $data, $urldata);
$url2 = "http://www.mahapwd.com/nit/$urldata[1];
curl_setopt($ch, CURLOPT_URL, $url2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data2 = curl_exec($ch);
preg_match("/Published On:</b>(.*)<\/font>", $data, $pubDt);
$PubDate = $pubDt[1];
preg_match("/Time of Meeting:</b>(.*)&nbsp", $data, $MtDt);
$MeetDate = $MtDt[1];
preg_match("/Time of Meeting:</b>$MtDt[1]&nbsp(.*)</font>", $data, $MtTime);
$MeetTime = $MtTime[1];
preg_match("/Place of Meeting:</b>(.*)<\/font>", $data, $pubDt);
$PubDate = $pubDt[1];
?>
Hello i have done simple code for you. You can download simple_html_dom.php from http://simplehtmldom.sourceforge.net/
require_once "simple_html_dom.php";
$url='http://www.mahapwd.com/nit/ueviewnotice.asp?noticeid=1767';
//parse url
for ($i=0;$i<1;$i++) {
$html1 = file_get_html($url);
if(!$html1){ echo "no content"; }
else {
//here is parsed html
$string1 = $html1;
//now you need to find table
$element1=$html1->find('table');
//here is a table you need
$input=$element1[2];
//now you can select row from here
foreach($input->find('td') as $element) {
//in here you can find name than save it to database than check it
}
}
}

php curl search list of webpage

i have the list of http in txt separately by \n
I want by php curl every page and search specify string
the specifc string is :
http://www.....com/xyz/...png or .gif
$ch = curl_init("ARRAY of page from txt????");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$text = curl_exec($ch);
$test = strpos($text, "HOW CREATE THE SPECIFIC STRING???");
if ($test==false)
{
echo $test;
}
else
{
echo "no exist";
}
<?
$array = explode("\n", file_get_contents('fileName.txt'));
foreach($array as $url){
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
$result[] = $output;
}
?>
finally array $result contain all html for your links that was inside your text file
Do an explode on the url like:
$urls = explode ('\n', urlstring)
Now you can loop through the array of urls.

CURL alternative to the built-in "file_get_contents()" function

So from my understanding this should be fairly simple as I should only need to change the original fileget contents code, and the rest of the script should still work? I have commented out the old file get contents and added the curl below.
after changing from file get contents to cURL the code below does not output
//$data = #file_get_contents("http://www.city-data.com/city/".$cityActualURL."-".$stateActualURL.".html");
//$data = file_get_contents("http://www.city-data.com/city/Geneva-Illinois.html");
//Initialize the Curl session
$ch = curl_init();
$url= "http://www.city-data.com/city/".$cityActualURL."-".$stateActualURL.".html";
//echo "$url<br>";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
//echo $data;
$details = str_replace("\n", "", $data);
$details = str_replace("\r", "", $details);
$detailsBlock = <<<HTML
~<div style='clear:both;'></div><br/><b>(.*?) on our <a href='http://www.city-data.com/top2/toplists2.html'>top lists</a>: </b><ul style='margin:10px;'>(.*?)<div style='bp_bindex'>~
HTML;
$detailsBlock2 = <<<HTML
~<br/><br/><b>(.*?) on our <a href='http://www.city-data.com/top2/toplists2.html'>top lists</a>: </b><ul style='margin:10px;'>(.*?)</ul></td>~
HTML;
$detailsBlock3 = <<<HTML
~<div style='clear:both;'></div><br/><b>(.*?) on our <a href='http://www.city-data.com/top2/toplists2.html'>top lists</a>: </b><ul style='margin:10px;'>(.*?)</ul></td>~
HTML;
preg_match($detailsBlock, $details, $matches);
preg_match($detailsBlock2, $details, $matches2);
preg_match($detailsBlock3, $details, $matches3);
if (isset($matches[2]))
{
$facts = "<ul style='margin:10px;'>".$matches[2];
}
elseif (isset($matches2[2]))
{
$facts = "<ul style='margin:10px;'>".$matches2[2];
}
elseif (isset($matches3[2]))
{
$facts = "<ul style='margin:10px;'>".$matches3[2];
}
else
{
$facts = "More Information to Come...";
}
If you have a problem with your script you need to debug it. For example:
$data = curl_exec($ch);
var_dump($data); die();
Then you will get an output what $data is. Depending on the output you can further decide where to look next for the cause of the malfunction.
The following function works great, just pass it a URL.
function file_get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set curl to return the data instead of printing it to the browser.
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
TIP: New lines and carriage returns can be replaced with one line of code.
$details = str_replace(array("\r\n","\r","\n"), '', $data);

Categories