I am trying to parse movie website with pagination. I want to parse all movie items on page 1 and when it will be done I want parser to continue on next page. I wrote a parser which works but it does not parses all movie items on page and do not continue on another page. I want to detect when parsing of one result is done and make it move on next item. Then detect when all movie items are parsed and make it move on next page. I expect that when I run parser, it should display movie title, year, etc. one by one and then continue on next page. Currently it only displays/parsing only one movie item on page 1 and do not continues work. Here's my code and example:
Parsing Example: http://minerbitco.in/parse/parse.php
<?php
include_once 'simple_html_dom.php';
$page = (!isset($_GET['page'])) ? 1 : $_GET['page'];
echo '<br> Parsing Page #'.$page.'<br><br>';
$html = file_get_html('https://srulad.com/movies/type/movie#page-'.$page);
$obj = $html->find('div.movie_item');
$datas = [];
if($obj){
foreach ($obj as $key => $data) {
$movie_url = 'https://srulad.com/'.$data->find('div.poster a', 0)->href;
$html2 = file_get_html($movie_url);
$item['url'] = $movie_url;
$item['year'] = $html2->find('#movie_content > div', 0)->children(2)->find('div', 0)->children(0)->children(1)->plaintext;
$item['genre'] = $html2->find('#movie_content > div', 0)->children(1)->find('span', 0)->plaintext;
$item['description'] = $html2->find('#movie_content > div', 0)->children(1)->find('div.plot', 0)->plaintext;
$item['imdb_rating'] = $html2->find('#movie_content > div', 0)->children(2)->find('div', 0)->children(1)->children(1)->find('span', 0)->plaintext;
$item['englishtitle'] = $html2->find('#movie_content > div', 0)->children(1)->find('h2.newmt', 0)->plaintext;
$item['geotitle'] = $html2->find('#movie_content > div', 0)->children(1)->find('h3.newmt', 0)->plaintext;
$item['poster'] = $html2->find('#movie_content > div', 0)->children(0)->find('img', 0)->src;
$url = $item['url'];
$year = $item['year'];
$desc = $item['description'];
$rating = $item['imdb_rating'];
$poster = $item['poster'];
$engtitle = $item['englishtitle'];
$geotitle = $item['geotitle'];
$genre = $item['genre'];
}}
if ($data === end($obj)) {
echo '<META http-equiv="refresh" content="10;URL=#page-'.($page+1).'">';
}
else {
echo "dasrulebulia.";
}
echo 'URL: '.$url.'<br>';
echo 'პოსტერის URL: '.$poster.'<br>';
echo 'სათაური ინგლისურად: '.$engtitle.'<br>';
echo 'სათაური ქართულად: '.$geotitle.'<br>';
echo 'წელი:'.$year.'<br>';
echo 'ჟანრი:'.$genre.'<br>';
echo 'აღწერა:'.$desc.'<br>';
echo 'რეიტინგი:'.$rating.'<br>';
?>
you can give it a try to Parser i have written:
https://github.com/sachinsinghshekhawat/simple-html-dom-parser-php
Related
I get
main(): Node no longer exists
while trying to display images (media:content)from a Rss url. I have been trying to solve it but was not successful . Can someone explain me Why this error is displaying and Help me solve it.the code works well if I Parsed RSS Feeds Urls with a .xml extension but not with a .cms Extension Urls.I Dont know why this is happening or whats wrong with my CODE.
My View
foreach ($data1 as $key1 => $value1) {
$image=$data1[$key1]['image']->children('http://search.yahoo.com/mrss/')->content->attributes()->url;
echo '<p>'.$image.'</p>';
echo '<div class="col-lg-6">';
echo '<img src="'.$image.'" /><strong style="color:pink;">'.$data1[$key1]["title"].'</strong>';
echo '<p>'.$data1[$key1]["description"].'</p>';
echo '<p>'.$data1[$key1]["pubDate"].'</p>';
echo '</div>';
}
My Controller
foreach($urls as $key => $url){
$xml = simplexml_load_file($url);
$result[$key]['title'] = $xml->channel->title;
$data = [];
for ($i=0; $i<5 ; $i++) {
# code...
$items = [];
if($xml->channel->item[$i] == null)
{
break;
}
$items['title'] = $xml->channel->item[$i]->title;
$items['link'] = $xml->channel->item[$i]->link;
$items['description'] = $xml->channel->item[$i]->description;
$items['pubDate'] = $xml->channel->item[$i]->pubDate;
$items['image'] = $xml->channel->item[$i];
$data[$i] = $items;
}
$result[$key]['data'] = $data;
//$entries = array_merge($worldFeed,$entries);
}
return View::make('index')->with('result',$result);
Currently, I am able to scrape the content from my desired website without any problems, but if you view my demo, you can see that in my array it's only displaying The Source no matter what I change around, it's not fixing..
$page = (isset($_GET['p'])&&$_GET['p']!=0) ? (int) $_GET['p'] : '';
$html = file_get_html('http://screenrant.com/movie-news/'.$page);
foreach($html->find('#site-top ul h2 a') as $element)
{
print '<br><br>';
echo $url = ''.$element->href;
$html2 = file_get_html($url);
print '<br><br>';
$image = $html2->find('meta[property=og:image]',0);
print $news['image'] = $image->content;
print '<br><br>';
// Ending The Featured Image
$title = $html2->find(' header > h1',0);
print $news['title'] = $title->plaintext;
print '<br>';
// Ending the titles
print '<br>';
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
}
$news['content'] = $article->plaintext;
print '<br><br>';
#post> div:nth-child(2) > header > p > time
$date = $html2->find('header > p > time',0);
$news['date'] = $date->plaintext;
$dexp = explode(', ',$date);
print $date = $dexp[0].', '.$dexp[1];
print '<br><br>';
$genre = "news";
print '<br>';
mysqli_query($DB,"INSERT INTO `wp_scraped_news` SET
`hash` = '".$news['title']."',
`title` = '".$news['title']."',
`image` = '".$news['image']."',
`content` = '".$news['content']."'");
print '<pre>';print_r($news);print '</pre>';
}
Currently using simple_html_dom.php to scrape.
If you take a look at this piece of code:
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
//This is printing the article content line by line
}
$news['content'] = $article->plaintext;
//This is grabbing the last line of the article content AKA the source
//The last <p> as it's not in the foreach.
Effectively, you need to be doing this:
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
$news['content'] = $news['content'] . $article->plaintext . "<p>";
}
I am trying to get the data in <div id listing-page-cart-inner> and <div id="description text"> and <div id="tags">, but i am finding it difficult to mine data.
Can anyone guide me? I am not able to fetch data though first div that I mentioned I am able to scrape, but other div I am not able to. When I loop through the second foreach it takes longer time.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('https://etsy.com/listing/107492702/');
//$val = $html->find('div[id=listing-page-cart-inner]');
function scraping_etsy() {
// create HTML DOM
$html = file_get_html('https://etsy.com/listing/107492702/');
foreach($html->find('div[id=listing-page-cart-inner]') as $article)
{
// get title
//$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('span', 0)->plaintext);
// get intro
//$lists = $articles->find('div[id=item-overview]');
$item['list1'] = trim($article->find('li',0)->plaintext);
$item['list2'] = trim($article->find('li',1)->plaintext);
$item['list3'] = trim($article->find('li',2)->plaintext);
$item['list4'] = trim($article->find('li',3)->plaintext);
$item['list5'] = trim($article->find('li',4)->plaintext);
/*foreach($article->find('li') as $al){
$item['lists'] =trim($al->find('li')->plaintext);
}*/
$ret[] = $item;
}
foreach($html->find('div[id=description]') as $content){
var_dump($content->find('text'));
// $item['content'] = trim($content->find('div[id=description]')->plaintext);
// $ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret ;
}
$ret = scraping_etsy();
var_dump($ret);
/*foreach($ret as $v) {
echo $v['title'].'<br>';
echo '<ul>';
echo '<li>'.$v['details'].'</li>';
echo '<li>Diggs: '.$v['diggs'].'</li>';
echo '</ul>';
}*/
?>
As for getting children of those divs, just remember that if found the parent element, always use ->find('<the selector here>', 0) always use the index to actually point to that element.
$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';
$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
echo $overview->innertext . '<br/>';
}
// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
$tags[] = $li->find('a', 0)->innertext;
}
echo '<pre>', print_r($tags, 1), '</pre>';
// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;
The easiest way to start always is to use 3d-party library, i.e. Symfony DomCrawler
It usage as easy as
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
print $domElement->nodeName;
}
And you can use filters like
$crawler = $crawler->filter('body > p');
I have a piece of code similar to below:
include 'simplehtmldom/simple_html_dom.php';
...
...
foreach ($files as $file){
$results= array();
if(substr($file->getAttribute('href'),0,strlen($lookfor))==$lookfor){
$URLs= $file->getAttribute('href');
echo $URLs ."<br>";
$html = file_get_html($URLs);
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$results[] = $item;
}
print_r($results) ."</br>";
...
...
...
$my_id ="1";
$photos = "1";
$insert_query = mysqli_query($db_connect, "INSERT INTO jackson.data (
my_id, photos, results) VALUES (
'$my_id', '$photos', '$results')");
The code echos the $results values in the browser perfectly fine; however, when I inserted the data into the database, results field only stores the "Array" as values. So, is there something I'm missing? and how can I insert the HTML format of the $results values which is echoing on my browser rather than the plain text?
You are using print_r which outputs the array with index and that's why the browser displays the result perfectly.I think you are using the variable $results in your insert query and that's why it fails as it contains an array.Try something like this:
Change your table structure to
jackson.data (my_id, photos, title,date,location,post)
and put the insert statement into the foreach loop and insert the values accordingly.
Example
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$query=mysqli_query($db_connect,"INSERT INTO jackson.data (
my_id, photos, title,date,location,post) VALUES (
'$my_id', '$photos', '$item['title'],$item['date'],.....)");
}
For html formatting:
Do something like this:
echo "<html><body>";
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$query=mysqli_query($db_connect,"INSERT INTO jackson.data (
my_id, photos, title,date,location,post) VALUES (
'$my_id', '$photos', '$item['title'],$item['date'],.....)");
echo "<div class=\"my_post\"><h1>".$item['title']."</h1>"."<br />Published:". $item['date']."<br />".$item['location']."<br /><br />".$item['post']."</div>";
}
echo "</body></html>";
In your css you can have something like this:
.my_post
{
margin:0 auto;//centers the contents
font-weight:bold;
font:fontname;
font-size:16px;
color:brown;
padding-top:15px;//Adjusts the gap between two posts;
}
you can use
"<pre>".print_r($result,true)."</pre>"
to store in db to display html output similar to browser
I'm trying to pull data from this site http://www.citizencorps.fema.gov/cc/CertIndex.do?reportsForState&cert=&state=IN using php. Can anyone please tell me why my code below isn't working. Ideally I want to pull the Name, Point of contact, Phone number, email, and Brief Description if one exists then convert that data into a csv file.
<?php
require_once "support/simple_html_dom.php";
$url = "http://www.citizencorps.fema.gov/cc/CertIndex.do?reportsForState&cert=&state=IN";
$html = file_get_html($url);
foreach($html->find('tr') as $row) {
$name = $row->find('td', 0)->plaintext;
$poc = $row->find('td', 1)->plaintext;
$phone = $row->find('td', 2)->plaintext;
$email = $row->find('td', 3)->plaintext;
if(count($row->find('td', 4)->plaintext) > 0) {
$desc = find('td', 4)->plaintext;
}
print_r($name.'<br/>'. $poc.'<br/>'.$phone.'<br/>'.$email.'<br/>'.$desc);
}
?>