Getting span contents with “Simple HTML DOM” - php

I'm trying to scrape Twitter tweets from a user page using “Simple HTML DOM”.
I can get the tweets but not their timestamp.
The HTML seems to be like this:
<p class="ProfileTweet-text js-tweet-text u-dir" lang="en" dir="ltr" data-aria-label-part="0">Tweet content<a href="/hashtag/TweetContent?src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" ><s>#</s><b>TweetContent</b></a> <a href="http://t.co/JFredfvgYs" class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr" >pic.twitter.com/JFredfvgYs</a></p>
The UNIX timestamp is in this:
<span class="js-short-timestamp "
data-aria-label-part="last"
data-time="1411584273"
data-long-form="true" >
Sep 24
</span>
So I'm doing:
<?php
include 'simple_html_dom.php';
$html = file_get_html('https://twitter.com/UserName');
$tweets = $html->find('div.ProfileTweet-contents');
foreach ($tweets as $tweet) {
$tweetText = $tweet->find('p.ProfileTweet-text', 0)->plaintext;
echo $tweetText;
}
?>
... which is fine for getting the tweet text but I don't know how to approach getting that Unix timestamp.
I thought maybe:
<?php
include 'simple_html_dom.php';
$html = file_get_html('https://twitter.com/UserName');
$tweets = $html->find('div.ProfileTweet-contents');
foreach ($tweets as $tweet) {
$tweetText = $tweet->find('p.ProfileTweet-text', 0)->plaintext;
$tweetDate = $tweet->find('span.js-short-timestamp ', 0);
echo $tweetText.' '.$tweetDate->data-time;
?>
... but that's all wrong. Any help?

Most likely because of that property that you're trying to access. Wrapped that hypenated property with this:
$tweetDate->{'data-time'};
Rough example:
$html = file_get_html('https://twitter.com/katyperry');
$tweet_block = $html->find('div.ProfileTweet');
foreach($tweet_block as $tweet) {
// get tweet text
$tweetText = $tweet->find('p.ProfileTweet-text text', 0)->innertext;
echo 'Tweet: ' . $tweetText . '<br/>';
// get tweet stamp
$tweetDate = $tweet->find('a.ProfileTweet-timestamp span.js-short-timestamp', 0);
echo 'Timestamp: ' .$tweetDate->{'data-time'} . '<br/>';
echo '<hr/>';
}

Related

HTML Pagination Parsing with PHP Simple HTML DOM Parser

I am trying to parse movie website with pagination. I want to parse all movie items on page 1 and when it will be done I want parser to continue on next page. I wrote a parser which works but it does not parses all movie items on page and do not continue on another page. I want to detect when parsing of one result is done and make it move on next item. Then detect when all movie items are parsed and make it move on next page. I expect that when I run parser, it should display movie title, year, etc. one by one and then continue on next page. Currently it only displays/parsing only one movie item on page 1 and do not continues work. Here's my code and example:
Parsing Example: http://minerbitco.in/parse/parse.php
<?php
include_once 'simple_html_dom.php';
$page = (!isset($_GET['page'])) ? 1 : $_GET['page'];
echo '<br> Parsing Page #'.$page.'<br><br>';
$html = file_get_html('https://srulad.com/movies/type/movie#page-'.$page);
$obj = $html->find('div.movie_item');
$datas = [];
if($obj){
foreach ($obj as $key => $data) {
$movie_url = 'https://srulad.com/'.$data->find('div.poster a', 0)->href;
$html2 = file_get_html($movie_url);
$item['url'] = $movie_url;
$item['year'] = $html2->find('#movie_content > div', 0)->children(2)->find('div', 0)->children(0)->children(1)->plaintext;
$item['genre'] = $html2->find('#movie_content > div', 0)->children(1)->find('span', 0)->plaintext;
$item['description'] = $html2->find('#movie_content > div', 0)->children(1)->find('div.plot', 0)->plaintext;
$item['imdb_rating'] = $html2->find('#movie_content > div', 0)->children(2)->find('div', 0)->children(1)->children(1)->find('span', 0)->plaintext;
$item['englishtitle'] = $html2->find('#movie_content > div', 0)->children(1)->find('h2.newmt', 0)->plaintext;
$item['geotitle'] = $html2->find('#movie_content > div', 0)->children(1)->find('h3.newmt', 0)->plaintext;
$item['poster'] = $html2->find('#movie_content > div', 0)->children(0)->find('img', 0)->src;
$url = $item['url'];
$year = $item['year'];
$desc = $item['description'];
$rating = $item['imdb_rating'];
$poster = $item['poster'];
$engtitle = $item['englishtitle'];
$geotitle = $item['geotitle'];
$genre = $item['genre'];
}}
if ($data === end($obj)) {
echo '<META http-equiv="refresh" content="10;URL=#page-'.($page+1).'">';
}
else {
echo "dasrulebulia.";
}
echo 'URL: '.$url.'<br>';
echo 'პოსტერის URL: '.$poster.'<br>';
echo 'სათაური ინგლისურად: '.$engtitle.'<br>';
echo 'სათაური ქართულად: '.$geotitle.'<br>';
echo 'წელი:'.$year.'<br>';
echo 'ჟანრი:'.$genre.'<br>';
echo 'აღწერა:'.$desc.'<br>';
echo 'რეიტინგი:'.$rating.'<br>';
?>
you can give it a try to Parser i have written:
https://github.com/sachinsinghshekhawat/simple-html-dom-parser-php

Content working fine in print but not in array

Currently, I am able to scrape the content from my desired website without any problems, but if you view my demo, you can see that in my array it's only displaying The Source no matter what I change around, it's not fixing..
$page = (isset($_GET['p'])&&$_GET['p']!=0) ? (int) $_GET['p'] : '';
$html = file_get_html('http://screenrant.com/movie-news/'.$page);
foreach($html->find('#site-top ul h2 a') as $element)
{
print '<br><br>';
echo $url = ''.$element->href;
$html2 = file_get_html($url);
print '<br><br>';
$image = $html2->find('meta[property=og:image]',0);
print $news['image'] = $image->content;
print '<br><br>';
// Ending The Featured Image
$title = $html2->find(' header > h1',0);
print $news['title'] = $title->plaintext;
print '<br>';
// Ending the titles
print '<br>';
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
}
$news['content'] = $article->plaintext;
print '<br><br>';
#post> div:nth-child(2) > header > p > time
$date = $html2->find('header > p > time',0);
$news['date'] = $date->plaintext;
$dexp = explode(', ',$date);
print $date = $dexp[0].', '.$dexp[1];
print '<br><br>';
$genre = "news";
print '<br>';
mysqli_query($DB,"INSERT INTO `wp_scraped_news` SET
`hash` = '".$news['title']."',
`title` = '".$news['title']."',
`image` = '".$news['image']."',
`content` = '".$news['content']."'");
print '<pre>';print_r($news);print '</pre>';
}
Currently using simple_html_dom.php to scrape.
If you take a look at this piece of code:
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
//This is printing the article content line by line
}
$news['content'] = $article->plaintext;
//This is grabbing the last line of the article content AKA the source
//The last <p> as it's not in the foreach.
Effectively, you need to be doing this:
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
$news['content'] = $news['content'] . $article->plaintext . "<p>";
}

How to scrape page using simple htmldom and PHP?

I am trying to get the data in <div id listing-page-cart-inner> and <div id="description text"> and <div id="tags">, but i am finding it difficult to mine data.
Can anyone guide me? I am not able to fetch data though first div that I mentioned I am able to scrape, but other div I am not able to. When I loop through the second foreach it takes longer time.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('https://etsy.com/listing/107492702/');
//$val = $html->find('div[id=listing-page-cart-inner]');
function scraping_etsy() {
// create HTML DOM
$html = file_get_html('https://etsy.com/listing/107492702/');
foreach($html->find('div[id=listing-page-cart-inner]') as $article)
{
// get title
//$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('span', 0)->plaintext);
// get intro
//$lists = $articles->find('div[id=item-overview]');
$item['list1'] = trim($article->find('li',0)->plaintext);
$item['list2'] = trim($article->find('li',1)->plaintext);
$item['list3'] = trim($article->find('li',2)->plaintext);
$item['list4'] = trim($article->find('li',3)->plaintext);
$item['list5'] = trim($article->find('li',4)->plaintext);
/*foreach($article->find('li') as $al){
$item['lists'] =trim($al->find('li')->plaintext);
}*/
$ret[] = $item;
}
foreach($html->find('div[id=description]') as $content){
var_dump($content->find('text'));
// $item['content'] = trim($content->find('div[id=description]')->plaintext);
// $ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret ;
}
$ret = scraping_etsy();
var_dump($ret);
/*foreach($ret as $v) {
echo $v['title'].'<br>';
echo '<ul>';
echo '<li>'.$v['details'].'</li>';
echo '<li>Diggs: '.$v['diggs'].'</li>';
echo '</ul>';
}*/
?>
As for getting children of those divs, just remember that if found the parent element, always use ->find('<the selector here>', 0) always use the index to actually point to that element.
$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';
$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
echo $overview->innertext . '<br/>';
}
// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
$tags[] = $li->find('a', 0)->innertext;
}
echo '<pre>', print_r($tags, 1), '</pre>';
// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;
The easiest way to start always is to use 3d-party library, i.e. Symfony DomCrawler
It usage as easy as
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
print $domElement->nodeName;
}
And you can use filters like
$crawler = $crawler->filter('body > p');

Reading XML POST data from URL

I am working with a 3rd party SMS supplier which they are sending me the delivery report of the SMS via URL as below:
http://www.mydomain.com/dlr.php <DeliveryReport><message id="024042313063119191" sentdate="2014/04/23 15:06:31" donedate="2014/04/23 15:06:35" status="DELIVERED" gsmerror="0" price="7.0" /></DeliveryReport>
And i am trying to read the XML data in dlr.php like below:
<?php
// read raw POST data
$postData = file_get_contents("php://input");
$dom = new DOMDocument();
$dom->loadXML($postData);
// create new XPath object for quering XML elements (nodes)
$xPath = new domxpath($dom);
// query “message” element
$reports = $xPath->query("/DeliveryReport/message");
// write out attributes of each “message” element
foreach ($reports as $node) {
echo “<br>id: “ . $node->getAttribute('id');
echo “<br>sent: “ . $node->getAttribute('sentdate');
echo “<br>done: “ . $node->getAttribute('donedate');
echo “<br>status: “ . $node->getAttribute('status');
echo “<br>gsmerrorcode: “ . $node->getAttribute('gsmerrorcode');
}
?>
I am getting this error:
Warning: DOMDocument::loadXML(): Empty string supplied as input in dlr.php
Any help how can I read the posted data correctly.
Thanks,
You can simply use this function for getting data from XML
function getFeed($feed_url)
{
$content = file_get_contents($feed_url);
$x = new SimpleXmlElement($content);
foreach($x->channel->item as $entry) : ?>
<?php
$pdate = $entry->pubDate;
$pdate = rtrim($pdate,' -500');
$pdate = explode(', ',$pdate);
?>
<div >
<a href="<?php echo $entry->link; ?>" target="_blank">
<span > <?php echo $entry->title;?></span></a> <?php echo $pdate[1]; ?>
</div>
<?php
endforeach;
}
getFeed("// Your URL");

Removing wrapping HTML elements inside a RSS XML node

I have a fetch function that injects rss content into a page for me. This returns an xml which contains the usual RSS elements like title, link, description but the problem is the returned description is a table with two tds which one contains an image the other the text. I am not sure how I can remove the table, img and the tds and be left only with the text using php and not javascript.
Any help is much appreciated.
<?php
require_once('rss_fetch.inc');
$url = 'http://www.domain.com/rss.aspx?typeid=0&imagesize=120&topcount=20';
if ( $url ) {
$rss = fetch_rss( $url );
//echo "Channel: " . $rss->channel['title'] . "<p>";
echo "<ul>";
foreach ($rss->items as $item) {
$href = $item['link'];
$title = $item['title'];
$description = $item['description'];
$pubdate = date('F dS, Y', strtotime($item['pubdate']));
echo "<li><h3>$title<em>$pubdate</em></h3>$description <p><a href='$href' target='_blank'>ادامه مطلب</a></p><br/></li>";
}
echo "</ul>";
}
?>
strip_tags() will do the job..

Categories