Currently, I am able to scrape the content from my desired website without any problems, but if you view my demo, you can see that in my array it's only displaying The Source no matter what I change around, it's not fixing..
$page = (isset($_GET['p'])&&$_GET['p']!=0) ? (int) $_GET['p'] : '';
$html = file_get_html('http://screenrant.com/movie-news/'.$page);
foreach($html->find('#site-top ul h2 a') as $element)
{
print '<br><br>';
echo $url = ''.$element->href;
$html2 = file_get_html($url);
print '<br><br>';
$image = $html2->find('meta[property=og:image]',0);
print $news['image'] = $image->content;
print '<br><br>';
// Ending The Featured Image
$title = $html2->find(' header > h1',0);
print $news['title'] = $title->plaintext;
print '<br>';
// Ending the titles
print '<br>';
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
}
$news['content'] = $article->plaintext;
print '<br><br>';
#post> div:nth-child(2) > header > p > time
$date = $html2->find('header > p > time',0);
$news['date'] = $date->plaintext;
$dexp = explode(', ',$date);
print $date = $dexp[0].', '.$dexp[1];
print '<br><br>';
$genre = "news";
print '<br>';
mysqli_query($DB,"INSERT INTO `wp_scraped_news` SET
`hash` = '".$news['title']."',
`title` = '".$news['title']."',
`image` = '".$news['image']."',
`content` = '".$news['content']."'");
print '<pre>';print_r($news);print '</pre>';
}
Currently using simple_html_dom.php to scrape.
If you take a look at this piece of code:
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
//This is printing the article content line by line
}
$news['content'] = $article->plaintext;
//This is grabbing the last line of the article content AKA the source
//The last <p> as it's not in the foreach.
Effectively, you need to be doing this:
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
$news['content'] = $news['content'] . $article->plaintext . "<p>";
}
Related
I am trying to parse movie website with pagination. I want to parse all movie items on page 1 and when it will be done I want parser to continue on next page. I wrote a parser which works but it does not parses all movie items on page and do not continue on another page. I want to detect when parsing of one result is done and make it move on next item. Then detect when all movie items are parsed and make it move on next page. I expect that when I run parser, it should display movie title, year, etc. one by one and then continue on next page. Currently it only displays/parsing only one movie item on page 1 and do not continues work. Here's my code and example:
Parsing Example: http://minerbitco.in/parse/parse.php
<?php
include_once 'simple_html_dom.php';
$page = (!isset($_GET['page'])) ? 1 : $_GET['page'];
echo '<br> Parsing Page #'.$page.'<br><br>';
$html = file_get_html('https://srulad.com/movies/type/movie#page-'.$page);
$obj = $html->find('div.movie_item');
$datas = [];
if($obj){
foreach ($obj as $key => $data) {
$movie_url = 'https://srulad.com/'.$data->find('div.poster a', 0)->href;
$html2 = file_get_html($movie_url);
$item['url'] = $movie_url;
$item['year'] = $html2->find('#movie_content > div', 0)->children(2)->find('div', 0)->children(0)->children(1)->plaintext;
$item['genre'] = $html2->find('#movie_content > div', 0)->children(1)->find('span', 0)->plaintext;
$item['description'] = $html2->find('#movie_content > div', 0)->children(1)->find('div.plot', 0)->plaintext;
$item['imdb_rating'] = $html2->find('#movie_content > div', 0)->children(2)->find('div', 0)->children(1)->children(1)->find('span', 0)->plaintext;
$item['englishtitle'] = $html2->find('#movie_content > div', 0)->children(1)->find('h2.newmt', 0)->plaintext;
$item['geotitle'] = $html2->find('#movie_content > div', 0)->children(1)->find('h3.newmt', 0)->plaintext;
$item['poster'] = $html2->find('#movie_content > div', 0)->children(0)->find('img', 0)->src;
$url = $item['url'];
$year = $item['year'];
$desc = $item['description'];
$rating = $item['imdb_rating'];
$poster = $item['poster'];
$engtitle = $item['englishtitle'];
$geotitle = $item['geotitle'];
$genre = $item['genre'];
}}
if ($data === end($obj)) {
echo '<META http-equiv="refresh" content="10;URL=#page-'.($page+1).'">';
}
else {
echo "dasrulebulia.";
}
echo 'URL: '.$url.'<br>';
echo 'პოსტერის URL: '.$poster.'<br>';
echo 'სათაური ინგლისურად: '.$engtitle.'<br>';
echo 'სათაური ქართულად: '.$geotitle.'<br>';
echo 'წელი:'.$year.'<br>';
echo 'ჟანრი:'.$genre.'<br>';
echo 'აღწერა:'.$desc.'<br>';
echo 'რეიტინგი:'.$rating.'<br>';
?>
you can give it a try to Parser i have written:
https://github.com/sachinsinghshekhawat/simple-html-dom-parser-php
I'm trying to make a php slideshow and I'm almost done I just need to implement the next and back buttons which I thought were going to be easy, but apparently you can't increment indexes in php?
$sql = "SELECT pic_url FROM pic_info";
$result = $conn->query($sql);
$count = 0;
$dir = "http://dev2.matrix.msu.edu/~matrix.training/Holmberg_Dane/";
$source = "gallery.php";
if ($result->num_rows > 0) {
// output data of each row
$pic_array = array();
while ($row = $result->fetch_assoc()) {
$pic_array[$count] = $row['pic_url'];
$count++;
}
$index = 1;
echo "<img src= ' $dir$pic_array[$index]' />";
echo "<a href= '$dir$pic_array[$index + 1]'>next</a>";
echo "<a href= '$dir$pic_array[$index - 1]'>back</a>";
}
$conn->close();
?>
Try this
<?php
echo '<img src="'.$dir.$pic_array[$index].'">
next
back';
?>
I would suggest to place the url's in an array and loop over them.
$urls = ['website/url/for/image/image.jpg', 'another/url/image.jpg'];
foreach ($urls as $url) {
echo 'a href="http://'.$url.'"> Click </a>';
}
It is definitely more readable that way.
is there is any way to define a variable under the while loop and print it above the while loop
something like this
print $catName
while ($b = mysqli_fetch_assoc($results)) {
$catName = $b['cat'];
}
EDIT
$getBooks = $db->prepare('SELECT
id, cat, instituteId, title, cover, authorName FROM books_library WHERE instituteId=? AND cat=?');
$getList = array();
$getBooks->bind_param('ii', $inId, $cid);
if ($getBooks->execute()) {
$libResults = $getBooks->get_result();
$getList[] = $libResults;
?>
HTML CODE COME HERE
<h3 class="marginBottom8px margin-top_30px text-bold text-danger"><?php echo catName ?></h3>
AND THEN THE WHILE LOOP
while ($book = mysqli_fetch_assoc($libResults)) {
$bokId = $book['id'];
$bookTitle = $bokId['title'];
$catName = $bokId['cat'];
$catName = $bokId['cat'];
now under the while loop I got the $catName How can I echo it above in the <h3></h3>
After the while block, put your h3 html in a variable and print it.
$myHtml = '<h3 class="marginBottom8px margin-top_30px text-bold text-danger">'
. $catName . '</h3>';
print $myHtml;
Don't mix html and php like that.
I think this is what you're after based on the edited question
while ($book = mysqli_fetch_assoc($libResults)) {
echo '<h3>' . $book['cat'] . '</h3>';
}
I am getting JSON data through visiting a link using PHP HTML DOM, but sometimes, I get an empty page so I want to know that how can I really check if page is empty so that I can skip it by using continue in for loop
I am checking it through :
if (empty($jsondata))
But I always get TRUE never gets false even if page is returned empty
Here is my code :
<?php
$prefix = $_POST['prefix'];
$start_product = $_POST['start_product'];
$end_product = $_POST['end_product'];
set_time_limit(0);
for ($i=$start_product; $i <= $end_product; $i++) {
include('simple_html_dom.php');
$prefix ="00";
$i= "11";
$jsondata = file_get_html('http://www.ewallpk.com/index.php?controller=search&q=A'.$prefix.$i.'&limit=10×tamp=1445547668758&ajaxSearch=1&id_lang=1');
if (!empty($jsondata)) {
$data = json_decode($jsondata, true);
$product = file_get_html($data[0]["product_link"]);
$product_name= "";
foreach($product->find('div[id=pb-left-column] h1') as $element) {
$product_name.=$element->innertext . '<br>';
}
$product_name = explode("_", $product_name);
$count = count($product_name);
if ($count < 3) {
$product_name=$product_name[0];
} else {
$product_name = "Error";
}
$product_description= "";
foreach($product->find('div[id=short_description_content]') as $element) {
$product_description.=$element->plaintext . '<br>';
}
$product_price= "";
foreach($product->find('p[class=our_price_display] span') as $element) {
$product_price.=$element->innertext . '<br>';
}
$image_link= "";
foreach($product->find('img[id=bigpic]') as $element) {
$image_link.=$element->src;
}
$content = file_get_contents($image_link);
file_put_contents('item_images/A'.$prefix.$i.'.jpg', $content);
echo "<strong>Product No : </strong> A".$prefix.$i."</br>";
echo "<strong>Product Name : </strong>".$product_name."</br>";
echo "<strong>Product Description : </strong>".$product_description;
echo "<strong>Product Price : </strong>".$product_price."</br></br></br>";
} else {
continue;
}
}
?>
You're probably getting some whitespace in the empty response, so trim it off before testing. You also should be using file_get_contents, since the response is not HTML.
$jsondata = file_get_contents('http://www.ewallpk.com/index.php?controller=search&q=A'.$prefix.$i.'&limit=10×tamp=1445547668758&ajaxSearch=1&id_lang=1');
$jsondata = trim($jsondata);
if (!empty($jsondata)) {
...
}
I am trying to get the data in <div id listing-page-cart-inner> and <div id="description text"> and <div id="tags">, but i am finding it difficult to mine data.
Can anyone guide me? I am not able to fetch data though first div that I mentioned I am able to scrape, but other div I am not able to. When I loop through the second foreach it takes longer time.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('https://etsy.com/listing/107492702/');
//$val = $html->find('div[id=listing-page-cart-inner]');
function scraping_etsy() {
// create HTML DOM
$html = file_get_html('https://etsy.com/listing/107492702/');
foreach($html->find('div[id=listing-page-cart-inner]') as $article)
{
// get title
//$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('span', 0)->plaintext);
// get intro
//$lists = $articles->find('div[id=item-overview]');
$item['list1'] = trim($article->find('li',0)->plaintext);
$item['list2'] = trim($article->find('li',1)->plaintext);
$item['list3'] = trim($article->find('li',2)->plaintext);
$item['list4'] = trim($article->find('li',3)->plaintext);
$item['list5'] = trim($article->find('li',4)->plaintext);
/*foreach($article->find('li') as $al){
$item['lists'] =trim($al->find('li')->plaintext);
}*/
$ret[] = $item;
}
foreach($html->find('div[id=description]') as $content){
var_dump($content->find('text'));
// $item['content'] = trim($content->find('div[id=description]')->plaintext);
// $ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret ;
}
$ret = scraping_etsy();
var_dump($ret);
/*foreach($ret as $v) {
echo $v['title'].'<br>';
echo '<ul>';
echo '<li>'.$v['details'].'</li>';
echo '<li>Diggs: '.$v['diggs'].'</li>';
echo '</ul>';
}*/
?>
As for getting children of those divs, just remember that if found the parent element, always use ->find('<the selector here>', 0) always use the index to actually point to that element.
$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';
$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
echo $overview->innertext . '<br/>';
}
// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
$tags[] = $li->find('a', 0)->innertext;
}
echo '<pre>', print_r($tags, 1), '</pre>';
// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;
The easiest way to start always is to use 3d-party library, i.e. Symfony DomCrawler
It usage as easy as
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
print $domElement->nodeName;
}
And you can use filters like
$crawler = $crawler->filter('body > p');