I've written a script in php to fetch links and write them in a csv file from the main page of wikipedia. The script does fetch the links accordingly. However, I can't write the populated results in a csv file. When I execute my script, It does nothing, no error either. Any help will be highly appreciated.
My try so far:
<?php
include "simple_html_dom.php";
$url = "https://en.wikipedia.org/wiki/Main_Page";
function fetch_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$links = array();
foreach ($dom->find('a') as $link) {
$links[]= $link->href . '<br>';
}
return implode("\n", $links);
$file = fopen("itemfile.csv","w");
foreach ($links as $item) {
fputcsv($file,$item);
}
fclose($file);
}
fetch_content($url);
?>
1.You are using return in your function, that's why nothing gets written in the file as code stops executing after that.
2.Simplified your logic with below code:-
$file = fopen("itemfile.csv","w");
foreach ($dom->find('a') as $link) {
fputcsv($file,array($link->href));
}
fclose($file);
So the full code needs to be:-
<?php
//comment these two lines when script started working properly
error_reporting(E_ALL);
ini_set('display_errors',1); // 2 lines are for Checking and displaying all errors
include "simple_html_dom.php";
$url = "https://en.wikipedia.org/wiki/Main_Page";
function fetch_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$links = array();
$file = fopen("itemfile.csv","w");
foreach ($dom->find('a') as $link) {
fputcsv($file,array($link->href));
}
fclose($file);
}
fetch_content($url);
?>
The reason the file does not get written is because you return out of the function before that code can even be executed.
Related
I've written a script in php to scrape the titles and its links from a webpage and write them accordingly to a csv file. As I'm dealing with a paginated site, only the content of last page remains in the csv file and the rest are being overwritten. I tried with writing mode w. However, when I do the same using append a then I find all the data in that csv file.
As appending and writing data makes the csv file open and close multiple times (because of my perhaps wrongly applied loops), the script becomes less efficient and time consuming.
How can i do the same in an efficient manner and of course using (writing) w mode?
This is I've written so far:
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$infile = fopen("itemfile.csv","a");
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
fputcsv($infile,[$itemTitle,$itemLink]);
}
fclose($infile);
}
for($i = 1; $i<10; $i++){
get_content($link.$i);
}
?>
If you don't want to open and close the file multiple times, then move the opening script before your for-loop and close it after:
function get_content($url, $inifile)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
fputcsv($infile,[$itemTitle,$itemLink]);
}
}
$infile = fopen("itemfile.csv","w");
for($i = 1; $i<10; $i++) {
get_content($link.$i, $inifile);
}
fclose($infile);
?>
I would consider not echoing or writing results to the file in get_content function. I would rewrite it so it would only get content, so I can handle extracted data any way I like. Something like this (please read code comments):
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
// This function does not write data to a file or print it. It only extracts data
// and returns it as an array.
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
// We don't need the following line anymore
// $infile = fopen("itemfile.csv","a");
// We will collect extracted data in an array
$result = [];
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
$result []= [$itemTitle, $itemLink];
// echo "{$itemTitle},{$itemLink}<br>";
// No need to write to file, so we don't need the following as well
// fputcsv($infile,[$itemTitle,$itemLink]);
}
// No files opened, so the following line is no more required
// fclose($infile);
// Return extracted data from this specific URL
return $result;
}
// Merge all results (result for each url with different page parameter
// With a little refactoring, get_content() can handle this as well
$result = [];
for($page = 1; $page < 10; $page++){
$result = array_merge($result, get_content($link.$page));
}
// Now do whatever you want with $result. Like writing its values to a file, or print it, etc.
// You might want to write a function for this
$outputFile = fopen("itemfile.csv","a");
foreach ($result as $row) {
fputcsv($outputFile, $row);
}
fclose($outputFile);
?>
I wants to create a php script for alerts from my work website when new notice is published, so following the page url
http://www.mahapwd.com/nit/ueviewnotice.asp?noticeid=1767
from this page i want a variable for Date & Time of Meeting (Date and time seperately two variables)
Place of Meeting and Published On
please help me to create a perfect php script.
I tried to create following script but it gives to many errors
<?php
$url1 = "http://www.mahapwd.com/nit/ueIndex.asp?district=12";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
preg_match("/href=(.*)\", $data, $urldata);
$url2 = "http://www.mahapwd.com/nit/$urldata[1];
curl_setopt($ch, CURLOPT_URL, $url2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data2 = curl_exec($ch);
preg_match("/Published On:</b>(.*)<\/font>", $data, $pubDt);
$PubDate = $pubDt[1];
preg_match("/Time of Meeting:</b>(.*) ", $data, $MtDt);
$MeetDate = $MtDt[1];
preg_match("/Time of Meeting:</b>$MtDt[1] (.*)</font>", $data, $MtTime);
$MeetTime = $MtTime[1];
preg_match("/Place of Meeting:</b>(.*)<\/font>", $data, $pubDt);
$PubDate = $pubDt[1];
?>
Hello i have done simple code for you. You can download simple_html_dom.php from http://simplehtmldom.sourceforge.net/
require_once "simple_html_dom.php";
$url='http://www.mahapwd.com/nit/ueviewnotice.asp?noticeid=1767';
//parse url
for ($i=0;$i<1;$i++) {
$html1 = file_get_html($url);
if(!$html1){ echo "no content"; }
else {
//here is parsed html
$string1 = $html1;
//now you need to find table
$element1=$html1->find('table');
//here is a table you need
$input=$element1[2];
//now you can select row from here
foreach($input->find('td') as $element) {
//in here you can find name than save it to database than check it
}
}
}
I have the following code:
<?php
error_reporting(E_ALL & ~E_NOTICE);
set_time_limit(1000);
$f = $_GET['location'].'.txt';
if ( !file_exists($f) ) {
die('Location unavailable');
}
$file = fopen($f, "r");
$i = 0;
while (!feof($file)) {
$members[] = fgets($file);
}
fclose($file);
function get_thumbs($url)
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 0);
ob_start();
curl_exec ($ch);
curl_close ($ch);
$string = ob_get_contents();
ob_end_clean();
return $string;
}
foreach ($members as $id){
// echo $id.'<br/>'; // do something with each line from text file here
$id = preg_replace('/\s+/', '', $id);
$link = 'http://localhost/cvr/get.php?id='.$id.'&thumb=yes&title=yes';
$content = get_thumbs($link);
echo $content;
}
?>
Where get.php is using almost the same above cURL function to grab data from a website and parse it.
In the txt file I have like 20 ids to get data from but the foreach seems to take a very long time to load, like 30+ seconds. Any advice?
I am a php beginner so please don't be very hard on me.
Thanks!
I want to parse the following XML file.
What I have so far is:
$xml = new SimpleXMLElement('http://smarkets.s3.amazonaws.com/oddsfeed.xml', LIBXML_NOCDATA, true);
foreach ($xml->odds->event as $item) {
echo (string)$item->market;
}
But this does not work. Can you help me?
You can try with php CURL:
$ch = curl_init();
$url = 'http://smarkets.s3.amazonaws.com/oddsfeed.xml';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');
$data = curl_exec($ch);
curl_close($ch);
$xml = simplexml_load_string($data);
print_r($xml);
I have no idea which information you want to extract so here is an example how to get the attributes 'id' and 'slug' from all market nodes.
Just add compress.zlib:// to your url to get the xml, for PHP 4.3.0 and up
<?php
$xml = simplexml_load_file('compress.zlib://http://smarkets.s3.amazonaws.com/oddsfeed.xml') or die("Error: Cannot create object");
foreach ($xml->event as $item) {
echo $item->market['id'] . "<br>" . $item->market['slug'] . "<br><br>";
}
?>
I'm trying to parse an xml file by starting with simplexml_load_file to load the contents. The file comes from a wordpress using an xml feed generated by a .php file.
The problem is it never can load the xml file..I'm not sure what I can do to make this work. Here is the code
<?php
$url = "http://marshallmashup.usc.edu/feed.php";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$result = curl_exec($ch);
curl_close($ch);
$rss = simplexml_load_string($result);
if( ! $rss = simplexml_load_file($url,NULL, LIBXML_NOERROR | LIBXML_NOWARNING) )
{
echo 'unable to load XML file';
}
else
{
echo 'XML file loaded successfully';
}
?>
First of all after this line:
$result = curl_exec($ch);
you should add this one:
$result = utf8_encode($result);
Said that, you'll have no problems with the function simplexml_load_string($result); which will correctly create a DOM based on the string you give to the function and that is the feed gotten from the php page. You can see the result using var_dump($rss); after the statement $rss = simplexml_load_string($result);.