Collect web data using Simple HTML Dom from multiple pages - php

I used the below code and successfully collected the data from a specific page as follows:
include 'simplehtmldom/simple_html_dom.php';
$html = file_get_html('http://test.com/file/1209i0329/');
// Find all article blocks
foreach($html->find('div.Content') as $file) {
$item['date'] = $file->find('id.article-date', 0)->plaintext;
$item['location'] = $file->find('id.article-location', 0)->plaintext;
$item['price'] = $file->find('div.article', 0)->plaintext;
$files[] = $item;
}
print_r($files);
The code works well for http://test.com/file/1209i0329.php, but my goal is to collect data from all pages starting with http://test.com/file/ on this domain (For example, http://test.com/file/1209i0329/, http://test.com/file/120dnkj329/, and etc). Is there a solution to overcome this problem using simle_html_dom?

I dont know where you would search your files (same domain, or outside), you may need to loop an array containing the urls of what you want to search.
Consider this example:
include 'simplehtmldom/simple_html_dom.php';
// most likely this process will take some time
$files = array();
$urls = array(
'http://test.com/file/1209i0329/',
'http://test.com/file/120dnkj329/',
'http://en.wikipedia.org/wiki/',
);
foreach($urls as $url) {
$html = file_get_html($url);
// Find all article blocks
foreach($html->find('div.Content') as $file) {
$item['date'] = $file->find('id.article-date', 0)->plaintext;
$item['location'] = $file->find('id.article-location', 0)->plaintext;
$item['price'] = $file->find('div.article', 0)->plaintext;
$files[] = $item;
}
}
print_r($files);

Related

PHP-MySQL inserts "Array" into a table in a foreach loop with Simple HTML Dom library

I have a piece of code similar to below:
include 'simplehtmldom/simple_html_dom.php';
...
...
foreach ($files as $file){
$results= array();
if(substr($file->getAttribute('href'),0,strlen($lookfor))==$lookfor){
$URLs= $file->getAttribute('href');
echo $URLs ."<br>";
$html = file_get_html($URLs);
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$results[] = $item;
}
print_r($results) ."</br>";
...
...
...
$my_id ="1";
$photos = "1";
$insert_query = mysqli_query($db_connect, "INSERT INTO jackson.data (
my_id, photos, results) VALUES (
'$my_id', '$photos', '$results')");
The code echos the $results values in the browser perfectly fine; however, when I inserted the data into the database, results field only stores the "Array" as values. So, is there something I'm missing? and how can I insert the HTML format of the $results values which is echoing on my browser rather than the plain text?
You are using print_r which outputs the array with index and that's why the browser displays the result perfectly.I think you are using the variable $results in your insert query and that's why it fails as it contains an array.Try something like this:
Change your table structure to
jackson.data (my_id, photos, title,date,location,post)
and put the insert statement into the foreach loop and insert the values accordingly.
Example
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$query=mysqli_query($db_connect,"INSERT INTO jackson.data (
my_id, photos, title,date,location,post) VALUES (
'$my_id', '$photos', '$item['title'],$item['date'],.....)");
}
For html formatting:
Do something like this:
echo "<html><body>";
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$query=mysqli_query($db_connect,"INSERT INTO jackson.data (
my_id, photos, title,date,location,post) VALUES (
'$my_id', '$photos', '$item['title'],$item['date'],.....)");
echo "<div class=\"my_post\"><h1>".$item['title']."</h1>"."<br />Published:". $item['date']."<br />".$item['location']."<br /><br />".$item['post']."</div>";
}
echo "</body></html>";
In your css you can have something like this:
.my_post
{
margin:0 auto;//centers the contents
font-weight:bold;
font:fontname;
font-size:16px;
color:brown;
padding-top:15px;//Adjusts the gap between two posts;
}
you can use
"<pre>".print_r($result,true)."</pre>"
to store in db to display html output similar to browser

Scraping html tables with different number of rows

I'm trying to pull data from this site http://www.citizencorps.fema.gov/cc/CertIndex.do?reportsForState&cert=&state=IN using php. Can anyone please tell me why my code below isn't working. Ideally I want to pull the Name, Point of contact, Phone number, email, and Brief Description if one exists then convert that data into a csv file.
<?php
require_once "support/simple_html_dom.php";
$url = "http://www.citizencorps.fema.gov/cc/CertIndex.do?reportsForState&cert=&state=IN";
$html = file_get_html($url);
foreach($html->find('tr') as $row) {
$name = $row->find('td', 0)->plaintext;
$poc = $row->find('td', 1)->plaintext;
$phone = $row->find('td', 2)->plaintext;
$email = $row->find('td', 3)->plaintext;
if(count($row->find('td', 4)->plaintext) > 0) {
$desc = find('td', 4)->plaintext;
}
print_r($name.'<br/>'. $poc.'<br/>'.$phone.'<br/>'.$email.'<br/>'.$desc);
}
?>

PHP Simple HTML DOM Parser - Combining Two Arrays

What I am trying to do is scrape a page on Trip Advisor - I have what I need from the first page and then I do another loop to get the contents from the next page but when I try and add these details to the existing array it doesn't work for some reason.
error_reporting(E_ALL);
include_once('simple_html_dom.php');
$html = file_get_html('http://www.tripadvisor.co.uk/Hotels-g186534-c2-Glasgow_Scotland-Hotels.html');
$articles = '';
// Find all article blocks
foreach($html->find('.listing') as $hotel) {
$item['name'] = $hotel->find('.property_title', 0)->plaintext;
$item['link'] = $hotel->find('.property_title', 0)->href;
$item['rating'] = $hotel->find('.sprite-ratings', 0)->alt;
$item['rating'] = explode(' ', $item['rating']);
$item['rating'] = $item['rating'][0];
$articles[] = $item;
}
foreach($articles as $article) {
echo '<pre>';
print_r($article);
echo '</pre>';
$hotel_html = file_get_html('http://www.tripadvisor.co.uk'.$article['link'].'/');
foreach($hotel_html->find('#MAIN') as $hotel_page) {
$article['address'] = $hotel_page->find('.street-address', 0)->plaintext;
$article['extendedaddress'] = $hotel_page->find('.extended-address', 0)->plaintext;
$article['locality'] = $hotel_page->find('.locality', 0)->plaintext;
$article['country'] = $hotel_page->find('.country-name', 0)->plaintext;
echo '<pre>';
print_r($article);
echo '</pre>';
$articles[] = $article;
}
}
echo '<pre>';
print_r($articles);
echo '</pre>';
Here is all the debugging output that I get: http://pastebin.com/J0V9WbyE
URL: http://www.4playtheband.co.uk/scraper/
I would change
$articles = '';
to:
$articles = array();
Before foreach():
$articlesNew = array();
When iterating over the array, insert in the new array
$articlesNew[] = $article;
At the end merge the arrays
$articles = array_merge($articles, $articlesNew);
Source: http://php.net/manual/en/function.array-merge.php for more array php merge / combine.
I never tried to alter an array when already iterating through it in PHP, but if you did this with C++ collections improperly it would crash unless you treat fatal exceptions. My wild guess is that you shouldn't alter the array while iterating it. I know i would never do that. Work with another variable.

Loading content from remote site doesn't work, but why?

I'm still working on this catalogue for a client, which loads images from a remote site via PHP and the Simple DOM Parser.
// Code excerpt from http://internetvolk.de/fileadmin/template/res/scrape.php, this is just one case of a select
$subcat = $_GET['subcat'];
$url = "http://pinesite.com/meubelen/index.php?".$subcat."&lang=de";
$html = file_get_html(html_entity_decode($url));
$iframe = $html->find('iframe',0);
$url2 = $iframe->src;
$html->clear();
unset($html);
$fullurl = "http://pinesite.com/meubelen/".$url2;
$html2 = file_get_html(html_entity_decode($fullurl));
$pagecount = 1;
$titles = $html2->find('.tekst');
$images = $html2->find('.plaatje');
$output='';
$i=0;
foreach ($images as $image) {
$item['title'] = $titles[$i]->find('p',0)->plaintext;
$imagePath = $image->find('img',0)->src;
$item['thumb'] = resize("http://pinesite.com".str_replace('thumb_','',$imagePath),array("w"=>225, "h"=>162));
$item['image'] = 'http://pinesite.com'.str_replace('thumb_','',$imagePath);
$fullurl2 = "http://pinesite.com/meubelen/prog/showpic.php?src=".str_replace('thumb_','',$imagePath)."&taal=de";
$html3 = file_get_html($fullurl2);
$item['size'] = str_replace(' ','',$html3->find('td',1)->plaintext);
unset($html3);
$output[] = $item;
$i++;
}
if (count($html2->find('center')) > 1) {
// ok, multi-page here, let's find out how many there are
$pagecount = count($html2->find('center',0)->find('a'))-1;
for ($i=1;$i<$pagecount; $i++) {
$startID = $i*20;
$newurl = html_entity_decode($fullurl."&beginrec=".$startID);
$html3 = file_get_html($newurl);
$titles = $html3->find('.tekst');
$images = $html3->find('.plaatje');
$a=0;
foreach ($images as $image) {
$item['title'] = $titles[$a]->find('p',0)->plaintext;
$item['image'] = 'http://pinesite.com'.str_replace('thumb_','',$image->find('img',0)->src);
$item['thumb'] = resize($item['image'],array("w"=>225, "h"=>150));
$output[] = $item;
$a++;
}
$html3->clear();
unset ($html3);
}
}
echo json_encode($output);
So what it should do (and does with some categories): Output the images, the titles and the the thumbnails from this page: http://pinesite.com
This works, for example, if you pass it a "?function=images&subcat=antiek", but not if you pass it a "?function=images&subcat=stoelen". I don't even think it's a problem with the remote page, so there has to be an error in my code.
Ehm..trying to state the obvious maybe but 'stoele'?
As it turns out, my code was completely fine, it was a missing space in the HTML of the remote site that got the Simple PHP DOM Parser to not recognize the iframe I was looking for. I fixed it on my end by running a str_replace on the code first to replace the faulty code.
I know it's a dirty solution, but it works :)

PHP Simple DOM Parser to Scrape From Multiple URLs

Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.

Categories