I'm trying to learn the simple_html_dom syntax, but i'm not having much luck. Could someone show me an example from this:
<div id="container">
<span>Apples</span>
<span>Oranges</span>
<span>Bananas</span>
</div>
If I want to just return the values Apples, Oranges and Bananas.
Can I simply use the php simple_html_dom class or will I also have to use xcode, curl, etc?
UPDATE:
I was able to get this to work, but not convinced it's the most efficient way of getting what I need:
foreach ($html->find('div[id=cont]') as $div);
foreach($div->find('span') as $element)
echo $element->innertext . '<br>';
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Your suggestion is correct:
foreach ($html->find('div[id=cont]') as $div);
foreach($div->find('span') as $element)
echo $element->innertext . '<br>';
More simply:
foreach($html->find('div#container span') as $element)
echo $element->innerText();
That means any span that descends from a div with id: container
Related
I am building a web scraper using the simple HTML DOM parser. However, I ran into some issues figuring out how to store HTML elements on a web page as objects. I would like to take an input URL, and turn all the HTML elements like tags, divs, fields, etc. and turn them into an object that gets spit out onto a page. I have written some code that currently works when I type in a URL, but the output is not what I am trying to achieve. Below, I have attached the code that I have worked out already, and I am seeking to find a way in which I could achieve what I am trying to do.
I have tried finding all images and links as well as creating a DOM object. I can't seem to figure out how to convert these elements into objects that I can use to learn more about a website, and possibly store that data into a database.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$url = $_POST["url"];
$html = file_get_html($url);
echo $html;
// Find all images
$element = new simple_html_dom();
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
$element = new simple_html_dom();
foreach($html->find('a') as $element)
echo $element->href . '<br>';
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a URL
$html->load_file($url);
echo $html;
?>
I am expecting an output of objects, but I am instead getting an actual output of images and links on a web page.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
// $url = $_POST["url"];
$url = 'Your-Url'; // Your url: 'www.example.com'
$html = file_get_html($url);
// Find all images
$images = []; //create empty images array
foreach($html->find('img') as $element){
$images[] = $element->src . '<br>'; //Store the found elements in the images array
}
echo '<pre>Output $images: '; var_dump($images); echo '</pre>'; //An output from the images array
// Find all links
$links = []; //create empty images array
foreach($html->find('a') as $element){
$links[] = $element->href . '<br>'; //Store the found elements in the links array
}
echo '<pre>Output $links: '; var_dump($links); echo '</pre>'; //An output from the links array
The echo's display the arrays filled with 'image' and 'a' tags value's from your page
I wrote the code blow to get all unique links from a url:
include_once ('simple_html_dom.php');
$html = file_get_html('http://www.example.com');
foreach($html->find('a') as $element){
$input = array($element->href = $element->href . '<br />');
print_r(array_unique($input));}
but I really can't understand why it shows the duplicated links too!
is there any problem with the function array_unique and simple html dom?
and there's another thing I guess is related to the problem: when you execute this you see all of the link that it extracted are in one key I mean this :
array(key => all values)
Is there any one who can solve this?
I believe you want it more like this:
$temp = array();
foreach($html->find('a') as $element) {
$temp[] = $element->href;
}
echo '<pre>' . print_r(array_unique($temp), true) . '</pre>';
I'm aware that there is an amazon API for pulling their data but I'm just trying to learn to scrape for my own knowledge and pulling data from amazon seems like a good test.
<?php
ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(-1);
include('../includes/simple_html_dom.php');
$html = file_get_html('http://www.amazon.co.uk/gp/product/B00AZYBFGY/ref=s9_simh_gw_p86_d0_i1?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1MP0FXRF8V70NWAN3ZWW&pf_r$')
foreach($html->find('a-section') as $element) {
echo $element->plaintext . '<br />';
}
echo $ret;
?>
All I'm trying to do is pull the product description from the link but I'm not sure why it's working. I'm not getting any errors or any data at all, really.
The class for the Product Description is simply productDescriptionWrapper so in your sample code use that css selector
foreach($html->find('.productDescriptionWrapper') as $element) {
echo $element->plaintext . '<br />';
}
simplehtmldom uses css selectors very similar to jQuery. so if you want all divs say ->find('div') if you want all anchors with a class of 'hotProduct' say ->find('a.hotProduct') so on and so forth
It doesn't work because the product description is being added by JavaScript into an iFrame.
You first can check if there is an HTML taken from the Amazon. It might block your request.
$url = "https://www.amazon.co.uk/gp/product/B00AZYBFGY/ref=s9_simh_gw_p86_d0_i1?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1MP0FXRF8V70NWAN3ZWW&pf_r$"
$htmlContent = file_get_contents($url);
echo $htmlContent;
$html = str_get_html($htmlContent);
Note, the https://, you have http://, maybe that is why you get nothing.
Once you get HTML, you can go forward.
Try different selectors:
foreach($html->find('div[id=productDescription]')) as $element) {
echo $element->plaintext . '<br />';
}
foreach($html->find('div[id=content]')) as $element) {
echo $element->plaintext . '<br />';
}
foreach($html->find('div[id=feature-bullets]')) as $element) {
echo $element->plaintext . '<br />';
}
It should display the page itself, maybe with some missing CSS.
If the HTML is in place. You can try those xpaths
I want to extract all text link from a webpage using simplehtmldom class. But i don't want image links.
<?
foreach($html->find('a[href]') as $element)
echo $element->href . '<br>';
?>
above code shows all anchor links containing href attribute.
contact
about
<a herf="/home"><img src="logo.png" /><a>
i want only /contact and /about not /home because it contains image instead of text
<?php
foreach($html->find('a[href]') as $element)
{
if (empty(trim($element->plaintext)))
continue;
echo $element->href . '<br>';
}
<?
foreach($html->find('a[href]') as $element){
if(!preg_match('%<img%', $element->href)){
echo $element->href . '<br>';
}
}
?>
It is possible to do that in css and with phpquery as:
$html->find('a:not(:has(img))')
This is not a feature that will likely ever come to simple though.
I'm using SimpleHTMLDOM Parser to scape a website and I would like to know if there's any error handling method. For example, if the link is broken there is no use to advance in the code and search the document.
Thank you.
<?php
$html = file_get_html('http://www.google.com/');
foreach($html->find('a') as $element)
{
if(empty($element->href))
{
continue; //will skip <a> without href
}
echo $element->href . "<br>\n";
}
?>
a loop and continue?