Thanks for taking the time to read my post... I'm trying to extract some information from my website using Simple HTML Dom...
I have it reading from the HTML source ok, now I'm just trying to extract the information that I need. I have a feeling I'm going about this in the wrong way... Here's my script...
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$html = file_get_html('http://myshop.com/small_houses.html');
$html .= file_get_html('http://myshop.com/medium_houses.html');
$html .= file_get_html('http://myshop.com/large_houses.html');
//Define my variable for later
$product['image'] = '';
$product['title'] = '';
$product['description'] = '';
foreach($html->find('img') as $src){
if (strpos($src->src,"http://myshop.com") === false) {
$src->src = "http://myshop.com/$src->src";
}
$product['image'] = $src->src;
}
foreach($html->find('p[class*=imAlign_left]') as $description){
$product['description'] = $description->innertext;
}
foreach($html->find('span[class*=fc3]') as $title){
$product['title'] = $title->innertext;
}
echo $product['img'];
echo $product['description'];
echo $product['title'];
?>
I put echo's on the end for sake of testing...but I'm not getting anything... Any pointers would be a great HELP!
Thanks
Charles
file_get_html() returns a HTMLDom Object, and you cannot concatenate Objects, although HTMLDom have __toString methods when there concatenated there more then lilly corrupt in some way, try the following:
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$pages = array(
'http://myshop.com/small_houses.html',
'http://myshop.com/medium_houses.html',
'http://myshop.com/large_houses.html'
)
foreach($pages as $page)
{
$product = array();
$source = file_get_html($page);
foreach($source->find('img') as $src)
{
if (strpos($src->src,"http://myshop.com") === false)
{
$product['image'] = "http://myshop.com/$src->src";
}
}
foreach($source->find('p[class*=imAlign_left]') as $description)
{
$product['description'] = $description->innertext;
}
foreach($source->find('span[class*=fc3]') as $title)
{
$product['title'] = $title->innertext;
}
//debug perposes!
echo "Current Page: " . $page . "\n";
print_r($product);
echo "\n\n\n"; //Clear seperator
}
?>
Related
in my table, I have a row that contains a string like this:
<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>
and I want to give the <img> tag an alt attribute. I've got quite close now but somehow my code still shows 2 <img> tags although the string only has 1. can anyone tell me what I'm doing wrong?
this is my code so far:
$str = '<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>';
$new_html = '';
$dom = new DOMDocument();
$dom->loadHTML($str);
$content = $dom->getElementsByTagName('*');
foreach ($content as $i => $node)
{
if ($node->nodeName == 'html' || $node->nodeName == 'body')
{
continue; // dont need to process these tags, right?
}
if ($node->nodeName == 'img')
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveXML($node);
}
$content = $dom->getElementsByTagName('img');
foreach ($content as $node) {
$img_src = $node->getAttribute('src');
$filename = basename($img_src);
$node->setAttribute('alt', $filename);
}
echo $dom->saveHTML();
Loop only through images with $content = $dom->getElementsByTagName('img');
Move $dom->saveHTML(); after lthe loop.
Get filename with $filename = basename($img_src);
The slightly changed code below does the work. It only gets the img tags and saves the HTML outside the loop. Note that I changed the way that HTML was loaded, to not include the wrapper tags.
<?php
$str = '<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>';
$new_html = '';
$dom = new DOMDocument();
$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$content = $dom->getElementsByTagName('img');
foreach ($content as $i => $node)
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveHTML();
The problem is that when you use
echo $dom->saveXML($node);
in the loop, it will output for various tags and so the output is not the end result, but a combination of other parts of the document.
Try changing it to
echo $node->nodeName."=>".$dom->saveXML($node).PHP_EOL;
to see what it does.
You could just remove the current echo and add
echo $dom->saveXML();
after the end of the loop.
Alternatively, if you just want to process the <img> tags, you can limit the loop more specifically...
$content = $dom->getElementsByTagName('img');
foreach ($content as $i => $node)
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveXML();
I am trying to get the data in <div id listing-page-cart-inner> and <div id="description text"> and <div id="tags">, but i am finding it difficult to mine data.
Can anyone guide me? I am not able to fetch data though first div that I mentioned I am able to scrape, but other div I am not able to. When I loop through the second foreach it takes longer time.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('https://etsy.com/listing/107492702/');
//$val = $html->find('div[id=listing-page-cart-inner]');
function scraping_etsy() {
// create HTML DOM
$html = file_get_html('https://etsy.com/listing/107492702/');
foreach($html->find('div[id=listing-page-cart-inner]') as $article)
{
// get title
//$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('span', 0)->plaintext);
// get intro
//$lists = $articles->find('div[id=item-overview]');
$item['list1'] = trim($article->find('li',0)->plaintext);
$item['list2'] = trim($article->find('li',1)->plaintext);
$item['list3'] = trim($article->find('li',2)->plaintext);
$item['list4'] = trim($article->find('li',3)->plaintext);
$item['list5'] = trim($article->find('li',4)->plaintext);
/*foreach($article->find('li') as $al){
$item['lists'] =trim($al->find('li')->plaintext);
}*/
$ret[] = $item;
}
foreach($html->find('div[id=description]') as $content){
var_dump($content->find('text'));
// $item['content'] = trim($content->find('div[id=description]')->plaintext);
// $ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret ;
}
$ret = scraping_etsy();
var_dump($ret);
/*foreach($ret as $v) {
echo $v['title'].'<br>';
echo '<ul>';
echo '<li>'.$v['details'].'</li>';
echo '<li>Diggs: '.$v['diggs'].'</li>';
echo '</ul>';
}*/
?>
As for getting children of those divs, just remember that if found the parent element, always use ->find('<the selector here>', 0) always use the index to actually point to that element.
$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';
$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
echo $overview->innertext . '<br/>';
}
// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
$tags[] = $li->find('a', 0)->innertext;
}
echo '<pre>', print_r($tags, 1), '</pre>';
// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;
The easiest way to start always is to use 3d-party library, i.e. Symfony DomCrawler
It usage as easy as
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
print $domElement->nodeName;
}
And you can use filters like
$crawler = $crawler->filter('body > p');
I want to get the child element with specific class form html I have manage to find the element using tag name but can't figureout how can I get the child emlement with specific class?
Here is my CODE:
<?php
$html = file_get_contents('myfileurl'); //get the html returned from the following url
$pokemon_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if (!empty($html)) { //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pokemon_xpath = new DOMXPath($pokemon_doc);
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query("//li[#class='content']");
if ($pokemon_row->length > 0) {
foreach ($pokemon_row as $row) {
$title = $row->getElementsByTagName('h3');
foreach ($title as $a) {
echo "Title: ";
echo strip_tags($a->nodeValue). '<br>';
}
$links = $row->getElementsByTagName('a');
foreach ($links as $l) {
echo "Link: ";
echo strip_tags($l->nodeValue). '<br>';
}
$desc = $row->getElementsByTagName('span');
//I tried that but didnt work..... iwant to get the span with class desc
//$desc = $row->query("//span[#class='desc']");
foreach ($desc as $d) {
echo "DESC: ";
echo strip_tags($d->nodeValue) . '<br><br>';
}
// echo $row->nodeValue . "<br/>";
}
}
}
?>
Please let me know if this is a duplicate but I cant find out or you think question is not good or not explaining well please let me know in comments.
Thanks.
I have code trying to extract the Event SKU from the Robot Events Page, here is an example. The code that I am using dosn't find any of the SKU on the page. The SKU is on line 411, with a div of the class "product-sku". My code doesn't event find the Div on the page and just downloads all the events. Here is my code:
<?php
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = file_get_html($event[4]);
$html->load($htmldown);
echo "Downloaded";
foreach ($html->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
?>
Can anyone help me fix my code?
This code is used DOMDocument php class. It works successfully for below sample HTML. Please try this code.
// new dom object
$dom = new DOMDocument();
// HTML string
$html_string = '<html>
<body>
<div class="product-sku1" name="div_name">The this the div content product-sku</div>
<div class="product-sku2" name="div_name">The this the div content product-sku</div>
<div class="product-sku" name="div_name">The this the div content product-sku</div>
</body>
</html>';
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//the table by its tag name
$divs = $dom->getElementsByTagName('div');
// loop over the all DIVs
foreach ($divs as $div) {
if ($div->hasAttributes()) {
foreach ($div->attributes as $attribute){
if($attribute->name === 'class' && $attribute->value == 'product-sku'){
// Peri DIV class name and content
echo 'DIV Class Name: '.$attribute->value.PHP_EOL;
echo 'DIV Content: '.$div->nodeValue.PHP_EOL;
}
}
}
}
I would use a regex (regular expression) to accomplish pulling skus out.
The regex:
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
See php regex docs.
New code:
<?php
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = curl_init($event[4]);
curl_setopt($htmldown, CURLOPT_RETURNTRANSFER, true);
$html=curl_exec($htmldown);
curl_close($htmldown)
echo "Downloaded";
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
foreach ($matches as $row) {
echo $row;
}
}
?>
And actually in this case (using that webpage) being that there is only one sku...
instead of:
foreach ($matches as $row) {
echo $row;
}
You could just use: echo $matches[1]; (The reason for array index 1 is because the whole regex pattern plus the sku will be in $matches[0] but just the subgroup containing the sku is in $matches[1].)
try to use
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = str_get_html($event[4]);
echo "Downloaded";
foreach ($htmldown->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
and if class "product-sku" is only for div's then you can use
$htmldown->find('.product-sku')
<?
$file = "http://www.google.com";
$doc = new DOMDocument();
echo #$doc->loadHTML(file_get_contents($file));
$element = $doc->getElementsbyTagName('span');
echo trim($element->item(0)->nodeValue);
echo trim($element->item(0)->textContent);
if (!is_null($element)) {
$content = $element->nodeValue;
if (empty($content)) {
$content = $element->textContent;
}
echo $content . "\n";
}
?>
i am trying to test this script and am wondering why can't i parse google? if you look into the source page, hit ctrl+f type in span there is obviously a span tag. why isn't it giving me results??
<?php
$file = 'http://www.google.com';
$doc = new DOMDocument();
# $doc->loadHTML(file_get_contents($file));
$element = $doc->getElementsByTagName('span');
if (0 != $element->length)
{
$content = trim($element->item(0)->nodeValue);
if (empty($content))
{
$content = trim($element->item(0)->textContent);
}
echo $content . "\n";
}
?>
Not 100% sure, but doesnt allow_url_fopen need to be enabled in php.ini for this to work?
code removed