using simple html dom to scrape

using simple html dom to scrape - php

I am trying to scrape some content using simple_html_dom without luck.
I am trying to grab the title, image path and the link and display it.
The HTML structure is:
<div class="article_item clearfix">
<h2 class="title">My amazing Title</h2>
<p class="date">September 22 2014</p>
<p class="image_left">
<a href="http://www.demodomain/articleid=1">
<img src="http://www.demodomain/photos/cef78533cd5.jpg" alt="My amazing post ">
</a>
</p>
<p>This is a demo description<strong>of this amazing</strong> article</p>
<p class="more">Read more...</p>
</div>
My code so far:
foreach($html->find('article_item') as $article) {
$item['title'] = $article->find('.title, a', 0)->plaintext;
$item['thumb'] = $article->find('.image_left img', 0)->src;
$item['details'] = $article->find('p', 0)->plaintext;
$item['url'] = $article->find('.more, a', 0)->plaintext;
echo 'Title: ' . $item['title'];
echo "</br>";
echo "image url: " . $item['thumb'];
echo "</br>";
echo "Description: " . $item['details'];
echo "</br>";
echo "Read More Url: " . $item['url'];
}
// Clear dom object
$html->clear();
unset($html);

You didn't state whats not working but consider this example:
foreach($html->find('div.article_item') as $div) {
// ^ point to div tag with class name article_item
$title = $div->find('h2.title a ', 0)->innertext;
// ^ target the h2 tag with class title with child anchor
// just same as accessing dom with jquery
$thumb = $div->find('p.image_left img ', 0)->src;
$details = $div->children(3)->plaintext;
// $url = $div->find('p.more', 0)->plaintext;
$url = $div->find('p.more a', 0)->href;
echo $title . '<br/>';
echo $thumb . '<br/>';
echo $details . '<br/>';
echo $url . '<br/>';
}
Basically, this is just the same as selecting selectors.

can you try like this
$item['title'] = $article->find('h2.title')->plaintext;
$item['thumb'] = $article->find('p.image_left')->find('img')->src;

Related

RSS and RSS page for each post

This is my RSS feed format:
<item>
<title></title>
<link></link>
<description></description>
<pubDate></pubDate>
<guid></guid>
<dc:date></dc:date>
</item>
I want to dipsplay the last 7 posts with CSS style, so I use this code:
<?php
$url = "**THE URL I AM SCRAPING DATA FROM**";
$rss = simplexml_load_file($url);
$i = 0;
if (!empty($rss))
{
$site = $rss
->channel->title;
$sitelink = $rss
->channel->link;
foreach ($rss
->channel->item as $item)
{
$title = $item->title;
$link = $item->link;
$description = $item->description;
$item->description = strip_tags($item->description);
$date = $item->pubDate;
$pubDate = date('d.m.Y', strtotime($date));
if ($i >= 7) break;
?>
<div class="post-item">
<div class="post-item-wrap">
<div class="post-image">
<a href="<?php echo $link;?>">
<img alt="" src="images/news/nra.jpg">
</a>
</div>
<div class="post-item-description">
<span class="post-meta-date"><?php echo $pubDate;?></span>
<h2><a href="<?php echo $link ?>" target="_blank"><?php echo $title;?>
</a></h2>
<p><?php echo implode(' ', array_slice(explode(' ', $description), 0, 30)) . "..";?></p>
learn more <i class="icon-chevron-right"></i>
</div>
</div>
</div>
<?php
$i++;
}
}
?>
Now I want each of these 7 posts to have unique id.
I need script to generate item -> title and item -> description depending on the item -> link.
(If I click first xml post for example it will take me to page where I can display the title and description according to which post I clicked)
Thanks in advance.

use this code get last 7 items
array_slice($rss->channel->item, -7);

If url is "rss.php?feed=1" or "rss.php?feed=2"
$rssSno=$_GET['feed'];
$output = array_slice($rss->channel->item, ($rssSno-1),1);

Using Xpath to return multiple elements value

I have a result from a curl request from a page like this:
$result =
<div class="c-wrapper">
<a href="link-to-a-page.php">
<div class="c-content-img">
<img src="...">
</div>
<div class="c-link-data">
<div class="c-link-data-title">
<h4>TITLE</h4>
</div>
</div>
</a>
<div>
<div class="c-wrapper">
<div class="c-content-img">
<img src="...">
</div>
<div class="c-link-data">
<div class="c-link-data-title">
<h4>TITLE 2</h4>
</div>
</div>
<div>
Now I have to count how many c-wrapper is present:
I use correctly this:
$doc = new DOMDocument();
#$doc->loadHTML($result);
$xpath = new DOMXPath($doc);
$divs = $xpath->query("//div[contains(#class, 'c-wrapper')]");
echo $divs-length; //<--- printed: 2
Then I have to print all titles:
I use correctly this:
$titles = $xpath->query("//div[contains(#class, 'c-link-data-title')]/h4");
foreach ($titles as $title) {
echo $title->textContent . "<br>";
}
Now the part I don't know: In the first div is present a link, in the second one no link. I'd like to edit my print of titles like this:
foreach ($titles as $title) {
if ( $link_extracted !="" )
echo "<a href='" . $link_extracted . "'>" . $title->textContent . "</a><br>";
else
echo $title->textContent . "<br>";
}
How can I edit $titles = $xpath->query("//div[contains(#class, 'c-link-data-title')]/h4"); to achieve this?

Rather than doing this in separate stages, the code finds the c-wrapper elements and then further uses XPath to find the various parts you want inside that particular element, so in
$link_extracted = $xpath->evaluate("a/#href", $div)[0];
it is looking for an <a> element relative to the $div element. Using [0] as you want only the first one.
$doc = new DOMDocument();
#$doc->loadHTML($result);
$xpath = new DOMXPath($doc);
$divs = $xpath->query("//div[contains(#class, 'c-wrapper')]");
echo $divs->length;
foreach ( $divs as $div ) {
$link_extracted = $xpath->evaluate("a/#href", $div)[0];
$title = $xpath->evaluate("descendant::div[contains(#class, 'c-link-data-title')]/h4/text()"
, $div)[0];
if ( !empty($link_extracted->nodeValue) ) {
echo "<a href='" . $link_extracted->nodeValue . "'>" . $title->textContent . "</a><br>";
}
else {
echo $title->textContent . "<br>";
}
}
which for your test HTML gives...
2<a href='link-to-a-page.php'>TITLE</a><br>TITLE 2<br>

Extract full list of rss feed using simplexml_load_file

I'm using a code to extract the items from an ebay rss feed, the only problem is that it is only extracting one item.
I suspected it was because of for each, but after searching this whole site, I couldn't find a solution. The feed URL will output 8 items (entriesPerPage=8), if you access the feed, you'll that the full xml code is there, but the parser is only getting one item.
<?php
$feedurl = "http://rest.ebay.com/epn/v1/find/item.rss?keyword=%28jewelry%2Ccraft%2Cclothing%2Cshoes%2Cdiy%29&sortOrder=BestMatch&programid=1&campaignid=5337945426&toolid=10039&listingType1=All&lgeo=1&topRatedSeller=true&hideDuplicateItems=true&entriesPerPage=8&feedType=rss";
$rss = simplexml_load_file($feedurl);
foreach ($rss->channel->item as $item) {
$link = $item->link;
$title = $item->title;
$description = $item->description;
}
?>
<div class="mainproductebayfloatright-bottom">
<div class="aroundebay">
<?
print "<div class=\"titleebay\">" . $title . "</div>";
print $description;
?>
</div>
</div>
?>

Move your html inside a loop, as currently on each iteration your variables are overwritten and after the loop is over what you have is values of the last xml-item:
foreach ($rss->channel->item as $item) {
$link = $item->link;
$title = $item->title;
$description = $item->description;?>
<div class="mainproductebayfloatright-bottom">
<div class="aroundebay">
<?php
// simple title
print "<div class=\"titleebay\">" . $title . "</div>";
// title-link
print "<a href=\"" . $link . "\">" . $title . "</div>";
print $description;
?>
</div>
</div>
<?php
}

simplehtmldom return the content of 2 divs inside another

here is my html :
<div id="main">
<div id="child1">
child1
link1
</div>
<div id="child2">
child2
link2
</div>
</div>
I am trying to return (echo in php) child1 and child2 as links
this is part of a HUGE file so I need to loop through it.
this is what I have so far but its not working :
$linkObjs = $html->find('#main');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->fildchild()->plaintext);
$link = trim($linkObj->fildchild()->href);
echo '<p class="titro" ><a href="' . $link . '" >' . $title . '</a></p>';
}

Not sure exactly which part of the elements you needed so here's everything dissected.
// Find all divs in #main
foreach ($html -> find('#main div') as $div)
{
// Find plain text in div
foreach ($div -> find('text') as $text)
{
echo $text;
}
// Find <a> tags and href
foreach ($div -> find('a') as $a)
{
echo $a -> href;
}
}

Accessing URL within Twitter XML Tag

I'am currently parsing in a twitter tweets by geocode and displaying them using php. But at the moment unable to access the URL to the tweet and have instead opted to show the users twitter profile.
The URL I'm trying to access is within the link tag
<link type="text/html" href="http://twitter.com/pmhigham/statuses/166271863331368961" rel="alternate" >
Its location within the XML file is Feed->Entry->Link.
I have thought about using a regular expression but don't know how to go about using this.
Here is my code.
<h3><?php
if (isset($column1Heading)) {
echo $column1Heading;
}
?></h3>
<p>
<strong>Tweets within 10 miles radius of London Eye</strong></br>
<?php
$feed = simplexml_load_file ('http://search.twitter.com/search.atom?geocode=51.5069999695%2C-0.142489999533%2C10.0mi%22london%22&lang=en&rpp=5');
if ($feed){foreach ($feed->entry as $item) {
echo '<a href=\'' . $item->author->uri. '\'>' . $item->title . '</a>', '</br>' . $item->published, '</br>';
}
}
else echo "Cannot find Twitter feed!"
?>
</br>
<strong>London Hashtag</strong>
</br>
<?php
$feed = simplexml_load_file ('http://search.twitter.com/search.atom?q=%23london&lang=en&rpp=5');
if ($feed){foreach ($feed->entry as $item) {
echo '<a href=\'' . $item->author->uri. '\'>' . $item->title . '</a>', '</br>' . $item->published, '</br>';
}
}
else echo "Cannot find Twitter feed!"
?>
</p>
</li>
<li class="col2">
<h3><?php
if (isset($column2Heading)) {
echo $column2Heading;
}
?></h3>
<p>
<?php
$feed = simplexml_load_file ('http://search.twitter.com/search.atom?geocode=40.744544%2C-74.027593%2C10.0mi%22manhattan%2C+ny%22&lang=en&rpp=5');
if ($feed){foreach ($feed->entry as $item) {
echo '<a href=\'' . $item->author->uri. '\'>' . $item->title . '</a>', '</br>' . $item->published, '</br>';
}
}
else echo "Cannot find Twitter feed!"
?>
<strong>NYC Hashtag</strong>
</br>
<?php
$feed = simplexml_load_file ('http://search.twitter.com/search.atom?q=%23nyc&lang=en&rpp=5');
if ($feed){foreach ($feed->entry as $item) {
echo '<a href=\'' . $item->author->uri. '\'>' . $item->title . '</a>', '</br>' . $item->published, '</br>';
}
}
else echo "Cannot find Twitter feed!"
?>
</p>
</li>
<li class="col3">
<h3><?php
if (isset($column3Heading)) {
echo $column3Heading;
}
?></h3>
<p>
<?php
$feed = simplexml_load_file ('http://search.twitter.com/search.atom?lang=en&geocode=48.861%2C2.336%2C5.0mi%22paris%2C+fr%22&rpp=5&lang=en');
if ($feed){foreach ($feed->entry as $item) {
echo '<a href=\'' . $item->author->uri. '\'>' . $item->title . '</a>', '</br>' . $item->published, '</br>';
}
}
else echo "Cannot find Twitter feed!"
?>
</p>
</li>
</ul>
<div class="clear"></div>
</div>
<?php require_once 'footer.php'; ?>

If I understand you correctly:
You can access the attribute with the built in attributes() method of SimpleXML.
$item->link->attributes()->href
Otherwise the regular expression should look like this
$string = '<link type="text/html" href="http://twitter.com/pmhigham/statuses/166271863331368961" rel="alternate" >';
preg_match('/href\=\"([^\"]+)\"/i', $string, $matches);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

using simple html dom to scrape - php

can you try like this $item['title'] = $article->find('h2.title')->plaintext; $item['thumb'] = $article->find('p.image_left')->find('img')->src;

Related

RSS and RSS page for each post

Using Xpath to return multiple elements value

Extract full list of rss feed using simplexml_load_file

simplehtmldom return the content of 2 divs inside another

Accessing URL within Twitter XML Tag

Categories

Resources