I'm trying to scrape a webpage with PHP Simple Html Dom,
but there seems to be a problem with the loop process which only loop once and then stops, which should do several loops and stop.
Here is the URL I want to scrape: http://anymart.c1.biz/index.html.
Please show me where is the problem and show me the correct code.
Here's my code :
require('simple_html_dom.php');
$url=('http://anymart.c1.biz/index.html');
$html = file_get_html($url);
$articles = [];
$i = 0;
foreach ($html->find('div.gallery-a table[width=745]') as $data)
{
if ($i > 5)
{
break;
}
$title = $data->find('tr.*[id] > td > a.gal_title', $i);
$item['title'] = $title->plaintext;
$url = $data->find('tr.*[id] > td > a.gal_title', $i);
$item['url'] = $url->href;
$image = $data->find('img.gal_thumb', $i);
$item['image'] = $image->src;
$articles[] = $item;
$i++;
}
$result = json_encode($articles, JSON_PRETTY_PRINT);
header('Access-Control-Allow-Origin: *');
header('Content-type: Application/JSON');
echo $result;
Thank you.
Related
I want to load csv file data to extract the urls from CSV and check for the title tag for all the urls and update the urls with corresponding title tags in a new csv. But while I try to add data to the csv all the urls are getting listed but only the title of the last url is displayed in the CSV. I have tried different ways to overcome this problem but unable to do so.
Here is my code:
<?php
ini_set('max_execution_time', '300'); //300 seconds = 5 minutes
ini_set('max_execution_time', '0');
include('simple_html_dom.php');
// if (isset($_POST['resurl'])) {
// $url = $_POST['resurl'];
if (($csv_file = fopen("old.csv", "r", 'a')) !== FALSE) {
$arraydata = array();
while (($read_data = fgetcsv($csv_file, 1000, ",")) !== FALSE) {
$column_count = count($read_data);
for ($c = 0; $c < $column_count; $c++) {
array_push($arraydata, $read_data[$c]);
}
}
fclose($csv_file);
}
$title = [];
foreach ($arraydata as $ad) {
$ard = [];
$ard = $ad;
$html = file_get_html($ard);
if ($html) {
$title = $html->find('title', 0)->plaintext;
// echo '<pre>';
// print_r($title);
}
}
$ncsv = fopen("updated.csv", "a");
$head = "Url,Title";
fwrite($ncsv, "\n" . $head);
foreach ($arraydata as $value) {
// $ar[]=$value;
$csvdata = "$value,$title";
fwrite($ncsv, "\n" . $csvdata);
}
fclose($ncsv);
I've changed the code so that you write the CSV file as you read the HTML pages. This saves having another loop and an extra array of titles.
I've also changed it to use fputcsv to write the data out as it sorts ot things like escaping values etc.
// Open file, using w to clear the old file down
$ncsv = fopen('updated.csv', 'w');
$head = 'Url,Title';
fwrite($ncsv, "Url,Title" . PHP_EOL . $head);
foreach ($arraydata as $ad) {
$html = file_get_html($ad);
// Fetch title, or set to blank if html is not loaded
if ($html) {
$title = $html->find('title', 0)->plaintext;
} else {
$title = '';
}
// Write record out
fputcsv($ncsv, [$value, $title]);
}
fclose($ncsv);
I was able to solve it finally.
Here is the updated code:
<?php
ini_set('max_execution_time', '300'); //300 seconds = 5 minutes
ini_set('max_execution_time', '0');
include('simple_html_dom.php');
// if (isset($_POST['resurl'])) {
// $url = $_POST['resurl'];
if (($csv_file = fopen("ntsurl.csv", "r", 'a')) !== FALSE) {
$arraydata = array();
while (($read_data = fgetcsv($csv_file, 1000, ",")) !== FALSE) {
$column_count = count($read_data);
for ($c = 0; $c < $column_count; $c++) {
array_push($arraydata, $read_data[$c]);
}
}
fclose($csv_file);
}
// print_r($arraydata);
$title=[];
$ncsv=fopen("ntsnew.csv","a");
$head="Website Url,title";
fwrite($ncsv,"\n".$head);
foreach($arraydata as $ad)
{
$ard = [];
$ard = $ad;
$html = file_get_html($ard);
if ($html) {
$title = $html->find('title', 0)->plaintext;
echo '<pre>';
print_r($title);
$csvdata="$ard,$title ";
fwrite($ncsv,"\n".$csvdata);
}
}
// fclose($ncsv);
i'm setting up a new server, and want to scrape some information from a website
this is my code i tried to scrape pages one by one but i only get 2 of pages
$result = array();
function scrapingAnimelist($url, $page)
{
$res = array();
$urlParsed = $url . "&page=" . $page;
$html = file_get_html($urlParsed);
$pageData = array();
foreach ($html->find('div[class=body]') as $item) {
$metaData = array();
$metaData['title'] = $item->find('h2[class=title]', 0)->innertext;
$metaData['img'] = $item->find('img[class=img]', 0)->src;
$metaData['url'] = $item->find('a', 0)->href;
array_push($pageData, $metaData);
}
$res[$page] = $pageData;
if (sizeof($pageData) == 20) {
$page++;
$res[$page] = scrapingAnimelist($url, $page);
}
global $result;
$result = $res;
return $pageData;
}
i expect the output of json object with only 2 arrays ( page datas ) to be 3 in link : https://anime-list2.cf/anime-search?s=mag
Your $result is not set on the second run
yout should make it like this
$result = array();
function scrapingAnimelist($url, $page) {
global $result;
$urlParsed = $url . "&page=" . $page;
$html = file_get_html($urlParsed);
$pageData = array();
foreach ($html->find('div[class=body]') as $item) {
$metaData = array();
$metaData['title'] = $item->find('h2[class=title]', 0)->innertext;
$metaData['img'] = $item->find('img[class=img]', 0)->src;
$metaData['url'] = $item->find('a', 0)->href;
array_push($pageData, $metaData);
}
$result[$page] = $pageData;
if (sizeof($pageData) == 20) {
return scrapingAnimelist($url, $page + 1);
}
return $result;
}
I want to get all <p> elements from 1st jokes so basically I made this script:
<?php
$url = "http://sms.hindijokes.co";
$html = file_get_contents($url);
$doc = new DOMDocument;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTML("<html><body>".$html."
</body> </html>");
$xpath = new DOMXPath($doc);
$query1 = "//h2[#class='entry-title']/a";
$query2 = "//div[#class='entry-content']/p";
$entries1 = $xpath->query($query1);
$entries2 = $xpath->query($query2);
$var1 = $entries1->item(0)->textContent;
$var2 = $entries2->item(0)->textContent;
echo "$var1";
echo "<br>";
$f = 5;
for($i = 0; $i < $f; $i++){
echo $entries2->item($i)->textContent."\n";
}
?>
This time I was knowing that there are five <p> elements in first joke but if I want it to be automate script, there would be sometimes more or less than five <p> elements so it would cause mess.
You need first div's p elements only, so your query would be:
$entries2 = $xpath->query('//(div[#class='entry-content'])[1]/p');
Now you can iterate all p elements with foreach() loop (extracting its html contents):
$innerHtml = '';
foreach ($entries2 as $entry) {
$children = $entry->childNodes;
foreach ($children as $child) {
$innerHtml .= $child->ownerDocument->saveXML($child);
}
}
$innerHtml = str_replace(["\r\n", "\r", "\n", "\t"], '', $innerHtml);
DOMXPath::query returns DOMNodeList object. Use DOMNodeList::length property.
$f = $entries2->length;
Try this way it is returning until null; but some joke has multiple p tags so its better for you to find it by your custom class/id
$i = 0;
while($entries2->item($i)->textContent!=NULL) {
echo "<br>";
echo $i." ".$entries2->item($i)->textContent;
$i++;
}
Forgive me because my knowledge of PHP is limited but I have this code which retrieves all the items from an RSS Feed but I now need it to be using a for loop instead of a foreach loop so that I can limit the amount of times it runs and what item number it starts from. How would I go about doing this? Thank you for your help in advance.
$urls = array("WordlideVideo" => "http://feeds.reuters.com/reuters/USVideoWorldNews");
$rss = fetch_rss($urls[$_GET['url']]);
foreach ($rss->items as $item) {
$href = $item['link'];
$title = $item['title'];
$video = $item['video'];
$titleLength = strlen($title);
if ($titleLength > 180) {
$title = substr($title, 0, 177);
$title = $title . "...";
} else {
$title = $title;
}
}
Assuming $rss->items is an array, you should be able to do something like the following:
$items = $rss->items;
$limit = count($items);
// put any logic here to reduce $limit if it's greater than your threshold
for($i=0; $i<$limit; $i++) {
$item = $items[$i];
// code as before inside foreach loop
}
So far my script is working fine, basically it gets all htm files, out puts results, however im using DOM to get the HTML title tag from each file, that's where im not get to get it in the random array.. (image basenames and htm basename files are the same (firstresult.htm has picture firstresult.jpg)
I hope the code I provide and answer will be useful
<?php
// loop through the images
$count = 0;
$filenamenoext = array();
foreach (glob("/mydirectory/*.htm") as $filename) {
$filenamenoext[$count] = basename($filename, ".htm");
$count++;
}
for ($i = 0; $i < 10; $i++) {
$random = mt_rand(1, $count - 1);
$cachefile = "$filename";
$contents = file($cachefile);
$string = implode($contents);
$doc = new DOMDocument();
#$doc->loadHTML($string);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
echo '<img class="image" src="'.$filenamenoext[$random].'.jpg" " />"'.$title.'"<BR><BR>';
}
?>
It looks like the $filename variable you use on the line $cachefile = "$filename"; hasn't been set. It's only defined in the foreach loop's scope.
You should change it to
$cachefile = $filenamenoext[$random] . '.htm';
Also, it's a better practice to use array_push() and count() functions, instead of using a counter and manually filling the array. At least the code is better looking and more readable.
<?php
// loop through the images
$count = 0;
$filenamenoext = array();
foreach (glob("/mydirectory/*.htm") as $filename) {
array_push($filenamenoext, basename($filename, ".htm"));
}
for ($i = 0; $i < 10; $i++) {
$random = mt_rand(1, count($filenamenoext) - 1);
$cachefile = $filenamenoext[$random] . '.htm';
$contents = file($cachefile);
$string = implode($contents);
$doc = new DOMDocument();
#$doc->loadHTML($string);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
echo '<img class="image" src="' . $filenamenoext[$random] . '.jpg" " />"' . $title . '"<BR><BR>';
}
?>