php xpath get contents of div with class - php

What is the right syntax to use xpath to get the contents of all divs with a certain class? i seem to be getting the divs but i don't know how to get their innerHTML.
$url = "http://www.vanityfair.com/politics/2012/10/michael-lewis-profile-barack-obama";
$ctx = stream_context_create(array('http'=> array('timeout' => 10)));
libxml_use_internal_errors(TRUE);
$num = 0;
if($html = #file_get_contents($url,false,$ctx)){
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath->query('//div[#class="page-display"]') as $div){
$num++;
echo "$num. ";
//????
echo "<br/>";
}
echo "<br/>FINISHED";
}else{
echo "FAIL";
}

There is no HTML in the class="page-display" divs - so you're not going to get anything at all.
Do you mean the get class="parbase cn_text"?
foreach($xpath->query('//div[#class="parbase cn_text"]') as $div){
$num++;
echo "$num. ";
//????
echo $div->textContent;
echo "<br/>";
}

Related

Get Div Class data-files as php value

So, i have been trying to get the data-files as a php variable, but have not been able to get it.
This is from the source code.
<div class="videoplayer" id="video1" data-files="files.mp4">
This is the code im having most succes with, but i dont get the data-files value.
<?php
$doc = new DOMDocument();
#$doc->loadHTML($url);
$doc->validateOnParse = true;
libxml_use_internal_errors(true);
$doc->loadHtml(file_get_contents($url));
libxml_use_internal_errors(false);
$classname="videoplayer";
$finder = new DomXPath($doc);
$result = $finder->query("//*[contains(#class, '$classname')]");
// There's actually something in the list
if($result->length > 0) {
$node = $result->item(0);
echo "{$node->nodeName} - {$node->nodeValue}";
}
else {
echo "Empty";
}
?>
Any ideaas how to achieve this?
You get the value of attributes using DOMElement::getAttribute. So to get the data-files attribute, use:
$file = $node->getAttribute("data-files");
echo "$node->nodeName - $file";

How to scrape page using simple htmldom and PHP?

I am trying to get the data in <div id listing-page-cart-inner> and <div id="description text"> and <div id="tags">, but i am finding it difficult to mine data.
Can anyone guide me? I am not able to fetch data though first div that I mentioned I am able to scrape, but other div I am not able to. When I loop through the second foreach it takes longer time.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('https://etsy.com/listing/107492702/');
//$val = $html->find('div[id=listing-page-cart-inner]');
function scraping_etsy() {
// create HTML DOM
$html = file_get_html('https://etsy.com/listing/107492702/');
foreach($html->find('div[id=listing-page-cart-inner]') as $article)
{
// get title
//$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('span', 0)->plaintext);
// get intro
//$lists = $articles->find('div[id=item-overview]');
$item['list1'] = trim($article->find('li',0)->plaintext);
$item['list2'] = trim($article->find('li',1)->plaintext);
$item['list3'] = trim($article->find('li',2)->plaintext);
$item['list4'] = trim($article->find('li',3)->plaintext);
$item['list5'] = trim($article->find('li',4)->plaintext);
/*foreach($article->find('li') as $al){
$item['lists'] =trim($al->find('li')->plaintext);
}*/
$ret[] = $item;
}
foreach($html->find('div[id=description]') as $content){
var_dump($content->find('text'));
// $item['content'] = trim($content->find('div[id=description]')->plaintext);
// $ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret ;
}
$ret = scraping_etsy();
var_dump($ret);
/*foreach($ret as $v) {
echo $v['title'].'<br>';
echo '<ul>';
echo '<li>'.$v['details'].'</li>';
echo '<li>Diggs: '.$v['diggs'].'</li>';
echo '</ul>';
}*/
?>
As for getting children of those divs, just remember that if found the parent element, always use ->find('<the selector here>', 0) always use the index to actually point to that element.
$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';
$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
echo $overview->innertext . '<br/>';
}
// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
$tags[] = $li->find('a', 0)->innertext;
}
echo '<pre>', print_r($tags, 1), '</pre>';
// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;
The easiest way to start always is to use 3d-party library, i.e. Symfony DomCrawler
It usage as easy as
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
print $domElement->nodeName;
}
And you can use filters like
$crawler = $crawler->filter('body > p');

How Can i get the child element using class using php DOMXPath?

I want to get the child element with specific class form html I have manage to find the element using tag name but can't figureout how can I get the child emlement with specific class?
Here is my CODE:
<?php
$html = file_get_contents('myfileurl'); //get the html returned from the following url
$pokemon_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if (!empty($html)) { //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pokemon_xpath = new DOMXPath($pokemon_doc);
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query("//li[#class='content']");
if ($pokemon_row->length > 0) {
foreach ($pokemon_row as $row) {
$title = $row->getElementsByTagName('h3');
foreach ($title as $a) {
echo "Title: ";
echo strip_tags($a->nodeValue). '<br>';
}
$links = $row->getElementsByTagName('a');
foreach ($links as $l) {
echo "Link: ";
echo strip_tags($l->nodeValue). '<br>';
}
$desc = $row->getElementsByTagName('span');
//I tried that but didnt work..... iwant to get the span with class desc
//$desc = $row->query("//span[#class='desc']");
foreach ($desc as $d) {
echo "DESC: ";
echo strip_tags($d->nodeValue) . '<br><br>';
}
// echo $row->nodeValue . "<br/>";
}
}
}
?>
Please let me know if this is a duplicate but I cant find out or you think question is not good or not explaining well please let me know in comments.
Thanks.

Simple HTML Dom

Thanks for taking the time to read my post... I'm trying to extract some information from my website using Simple HTML Dom...
I have it reading from the HTML source ok, now I'm just trying to extract the information that I need. I have a feeling I'm going about this in the wrong way... Here's my script...
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$html = file_get_html('http://myshop.com/small_houses.html');
$html .= file_get_html('http://myshop.com/medium_houses.html');
$html .= file_get_html('http://myshop.com/large_houses.html');
//Define my variable for later
$product['image'] = '';
$product['title'] = '';
$product['description'] = '';
foreach($html->find('img') as $src){
if (strpos($src->src,"http://myshop.com") === false) {
$src->src = "http://myshop.com/$src->src";
}
$product['image'] = $src->src;
}
foreach($html->find('p[class*=imAlign_left]') as $description){
$product['description'] = $description->innertext;
}
foreach($html->find('span[class*=fc3]') as $title){
$product['title'] = $title->innertext;
}
echo $product['img'];
echo $product['description'];
echo $product['title'];
?>
I put echo's on the end for sake of testing...but I'm not getting anything... Any pointers would be a great HELP!
Thanks
Charles
file_get_html() returns a HTMLDom Object, and you cannot concatenate Objects, although HTMLDom have __toString methods when there concatenated there more then lilly corrupt in some way, try the following:
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$pages = array(
'http://myshop.com/small_houses.html',
'http://myshop.com/medium_houses.html',
'http://myshop.com/large_houses.html'
)
foreach($pages as $page)
{
$product = array();
$source = file_get_html($page);
foreach($source->find('img') as $src)
{
if (strpos($src->src,"http://myshop.com") === false)
{
$product['image'] = "http://myshop.com/$src->src";
}
}
foreach($source->find('p[class*=imAlign_left]') as $description)
{
$product['description'] = $description->innertext;
}
foreach($source->find('span[class*=fc3]') as $title)
{
$product['title'] = $title->innertext;
}
//debug perposes!
echo "Current Page: " . $page . "\n";
print_r($product);
echo "\n\n\n"; //Clear seperator
}
?>

creating xml with php and 'if changed' option

I am creating an address book xml feed from a MySQL database, everything is working fine, but I have a section tag which gets the first letter of the surname and pops it in that tag. I only want this to display if it has changed, but for some reason my brain isn't working this morning!
Current code:
<?php
echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
echo "<data>";
do {
$char = $row_fetch["surname_add"];
$section = $char[0];
//if(changed???){
echo "<section><character>".$section."</character>";
//}
echo "<person>";
echo "<name>".$row_fetch["firstname_add"]." ".$row_fetch["surname_add"]."</name>";
echo "<title>".$row_fetch["title_add"]."</title>";
echo "</person>";
//if(){
echo "</section>";
//}
} while ($row_fetch = mysql_fetch_assoc($fetch));
echo "</data>";
?>
Any help on this welcome, don't know why I can't think of it!
And if you still want to generate XML manually, I suppose, something like this will work:
$section = "NoSectionStartedYet";
while ($row_fetch = mysql_fetch_assoc($fetch)) {
$char = $row_fetch["surname_add"];
if ($char[0] != $section)
{
if ($section != "NoSectionStartedYet")
{
echo "</section>";
}
$section = $char[0];
echo "<section><character>".$section."</character>";
}
echo "<person>";
echo "<name>".$row_fetch["firstname_add"]." ".$row_fetch["surname_add"]."</name>";
echo "<title>".$row_fetch["title_add"]."</title>";
echo "</person>";
}
echo "</section>";
To be sure that your XML is valid it is better to build a DOM tree, here is an example from the PHP manual:
<?php
$doc = new DOMDocument;
$node = $doc->createElement("para");
$newnode = $doc->appendChild($node);
echo $doc->saveXML();
?>

Categories