What I am trying to do is scrape a page on Trip Advisor - I have what I need from the first page and then I do another loop to get the contents from the next page but when I try and add these details to the existing array it doesn't work for some reason.
error_reporting(E_ALL);
include_once('simple_html_dom.php');
$html = file_get_html('http://www.tripadvisor.co.uk/Hotels-g186534-c2-Glasgow_Scotland-Hotels.html');
$articles = '';
// Find all article blocks
foreach($html->find('.listing') as $hotel) {
$item['name'] = $hotel->find('.property_title', 0)->plaintext;
$item['link'] = $hotel->find('.property_title', 0)->href;
$item['rating'] = $hotel->find('.sprite-ratings', 0)->alt;
$item['rating'] = explode(' ', $item['rating']);
$item['rating'] = $item['rating'][0];
$articles[] = $item;
}
foreach($articles as $article) {
echo '<pre>';
print_r($article);
echo '</pre>';
$hotel_html = file_get_html('http://www.tripadvisor.co.uk'.$article['link'].'/');
foreach($hotel_html->find('#MAIN') as $hotel_page) {
$article['address'] = $hotel_page->find('.street-address', 0)->plaintext;
$article['extendedaddress'] = $hotel_page->find('.extended-address', 0)->plaintext;
$article['locality'] = $hotel_page->find('.locality', 0)->plaintext;
$article['country'] = $hotel_page->find('.country-name', 0)->plaintext;
echo '<pre>';
print_r($article);
echo '</pre>';
$articles[] = $article;
}
}
echo '<pre>';
print_r($articles);
echo '</pre>';
Here is all the debugging output that I get: http://pastebin.com/J0V9WbyE
URL: http://www.4playtheband.co.uk/scraper/
I would change
$articles = '';
to:
$articles = array();
Before foreach():
$articlesNew = array();
When iterating over the array, insert in the new array
$articlesNew[] = $article;
At the end merge the arrays
$articles = array_merge($articles, $articlesNew);
Source: http://php.net/manual/en/function.array-merge.php for more array php merge / combine.
I never tried to alter an array when already iterating through it in PHP, but if you did this with C++ collections improperly it would crash unless you treat fatal exceptions. My wild guess is that you shouldn't alter the array while iterating it. I know i would never do that. Work with another variable.
Related
I am trying to get the data in <div id listing-page-cart-inner> and <div id="description text"> and <div id="tags">, but i am finding it difficult to mine data.
Can anyone guide me? I am not able to fetch data though first div that I mentioned I am able to scrape, but other div I am not able to. When I loop through the second foreach it takes longer time.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('https://etsy.com/listing/107492702/');
//$val = $html->find('div[id=listing-page-cart-inner]');
function scraping_etsy() {
// create HTML DOM
$html = file_get_html('https://etsy.com/listing/107492702/');
foreach($html->find('div[id=listing-page-cart-inner]') as $article)
{
// get title
//$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('span', 0)->plaintext);
// get intro
//$lists = $articles->find('div[id=item-overview]');
$item['list1'] = trim($article->find('li',0)->plaintext);
$item['list2'] = trim($article->find('li',1)->plaintext);
$item['list3'] = trim($article->find('li',2)->plaintext);
$item['list4'] = trim($article->find('li',3)->plaintext);
$item['list5'] = trim($article->find('li',4)->plaintext);
/*foreach($article->find('li') as $al){
$item['lists'] =trim($al->find('li')->plaintext);
}*/
$ret[] = $item;
}
foreach($html->find('div[id=description]') as $content){
var_dump($content->find('text'));
// $item['content'] = trim($content->find('div[id=description]')->plaintext);
// $ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret ;
}
$ret = scraping_etsy();
var_dump($ret);
/*foreach($ret as $v) {
echo $v['title'].'<br>';
echo '<ul>';
echo '<li>'.$v['details'].'</li>';
echo '<li>Diggs: '.$v['diggs'].'</li>';
echo '</ul>';
}*/
?>
As for getting children of those divs, just remember that if found the parent element, always use ->find('<the selector here>', 0) always use the index to actually point to that element.
$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';
$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
echo $overview->innertext . '<br/>';
}
// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
$tags[] = $li->find('a', 0)->innertext;
}
echo '<pre>', print_r($tags, 1), '</pre>';
// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;
The easiest way to start always is to use 3d-party library, i.e. Symfony DomCrawler
It usage as easy as
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
print $domElement->nodeName;
}
And you can use filters like
$crawler = $crawler->filter('body > p');
I have a piece of code similar to below:
include 'simplehtmldom/simple_html_dom.php';
...
...
foreach ($files as $file){
$results= array();
if(substr($file->getAttribute('href'),0,strlen($lookfor))==$lookfor){
$URLs= $file->getAttribute('href');
echo $URLs ."<br>";
$html = file_get_html($URLs);
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$results[] = $item;
}
print_r($results) ."</br>";
...
...
...
$my_id ="1";
$photos = "1";
$insert_query = mysqli_query($db_connect, "INSERT INTO jackson.data (
my_id, photos, results) VALUES (
'$my_id', '$photos', '$results')");
The code echos the $results values in the browser perfectly fine; however, when I inserted the data into the database, results field only stores the "Array" as values. So, is there something I'm missing? and how can I insert the HTML format of the $results values which is echoing on my browser rather than the plain text?
You are using print_r which outputs the array with index and that's why the browser displays the result perfectly.I think you are using the variable $results in your insert query and that's why it fails as it contains an array.Try something like this:
Change your table structure to
jackson.data (my_id, photos, title,date,location,post)
and put the insert statement into the foreach loop and insert the values accordingly.
Example
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$query=mysqli_query($db_connect,"INSERT INTO jackson.data (
my_id, photos, title,date,location,post) VALUES (
'$my_id', '$photos', '$item['title'],$item['date'],.....)");
}
For html formatting:
Do something like this:
echo "<html><body>";
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$query=mysqli_query($db_connect,"INSERT INTO jackson.data (
my_id, photos, title,date,location,post) VALUES (
'$my_id', '$photos', '$item['title'],$item['date'],.....)");
echo "<div class=\"my_post\"><h1>".$item['title']."</h1>"."<br />Published:". $item['date']."<br />".$item['location']."<br /><br />".$item['post']."</div>";
}
echo "</body></html>";
In your css you can have something like this:
.my_post
{
margin:0 auto;//centers the contents
font-weight:bold;
font:fontname;
font-size:16px;
color:brown;
padding-top:15px;//Adjusts the gap between two posts;
}
you can use
"<pre>".print_r($result,true)."</pre>"
to store in db to display html output similar to browser
I used the below code and successfully collected the data from a specific page as follows:
include 'simplehtmldom/simple_html_dom.php';
$html = file_get_html('http://test.com/file/1209i0329/');
// Find all article blocks
foreach($html->find('div.Content') as $file) {
$item['date'] = $file->find('id.article-date', 0)->plaintext;
$item['location'] = $file->find('id.article-location', 0)->plaintext;
$item['price'] = $file->find('div.article', 0)->plaintext;
$files[] = $item;
}
print_r($files);
The code works well for http://test.com/file/1209i0329.php, but my goal is to collect data from all pages starting with http://test.com/file/ on this domain (For example, http://test.com/file/1209i0329/, http://test.com/file/120dnkj329/, and etc). Is there a solution to overcome this problem using simle_html_dom?
I dont know where you would search your files (same domain, or outside), you may need to loop an array containing the urls of what you want to search.
Consider this example:
include 'simplehtmldom/simple_html_dom.php';
// most likely this process will take some time
$files = array();
$urls = array(
'http://test.com/file/1209i0329/',
'http://test.com/file/120dnkj329/',
'http://en.wikipedia.org/wiki/',
);
foreach($urls as $url) {
$html = file_get_html($url);
// Find all article blocks
foreach($html->find('div.Content') as $file) {
$item['date'] = $file->find('id.article-date', 0)->plaintext;
$item['location'] = $file->find('id.article-location', 0)->plaintext;
$item['price'] = $file->find('div.article', 0)->plaintext;
$files[] = $item;
}
}
print_r($files);
I create an scraper for an automoto site and first I want to get all manufactures and after that all links of models for each manufactures but with the code below I get only the first model on the list. Why?
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.auto-types.com');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[#class='clearfix_center']/a/#href");
$output = array();
foreach($entries as $e) {
$dom2 = new DOMDocument();
#$dom2->loadHTMLFile('http://www.auto-types.com' . $e->textContent);
$xpath2 = new DOMXPath($dom2);
$data = array();
$data['newLinks'] = trim($xpath2->query("//div[#class='modelImage']/a/#href")->item(0)->textContent);
$output[] = $data;
}
echo '<pre>' . print_r($output, true) . '</pre>';
?>
SO I need to get: mercedes/100, mercedes/200, mercedes/300 but now with my script i get only the first link so mercedes/100...
please help
You need to iterate through the results instead of just taking the first item:
$items = $xpath2->query("//div[#class='modelImage']/a/#href");
$links = array();
foreach($items as $item) {
$links[] = $item->textContent;
}
$data['newLinks'] = implode(', ', $links);
Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.