I'm using SimpleHTMLDOM Parser to scape a website and I would like to know if there's any error handling method. For example, if the link is broken there is no use to advance in the code and search the document.
Thank you.
<?php
$html = file_get_html('http://www.google.com/');
foreach($html->find('a') as $element)
{
if(empty($element->href))
{
continue; //will skip <a> without href
}
echo $element->href . "<br>\n";
}
?>
a loop and continue?
Related
php simple html DOM has some problem with parentheses in href
If you have a sample.php page and it contains:
if you do like this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=(parentheses)]') as $element)
{
echo $element->href;
}
or like this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=this-href]') as $element)
{
echo $element->href;
}
It works.
But if you write something after or before the parentheses it doesn't work:
This:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=contains-(parentheses)]') as $element)
{
echo $element->href;
}
Or this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=(parentheses)-and-more]') as $element)
{
echo $element->href;
}
Doesn't work.
The reason that does not work is because there is a glaring error in the Simple HTML DOM code (well, one of many):
On line 673 of simple_html_dom.php you will see the line:
return preg_match("/".$pattern."/i", $value);
Change it to:
return preg_match("/".preg_quote($pattern)."/i", $value);
Presto, problem solved.
You can report the error here: https://sourceforge.net/p/simplehtmldom/bugs/ but with all the errors about the find method and others it is likely already reported.
I have this code that output all url with ?tid=someNumbers
<?php
include 'simple_html_dom.php';
// Create DOM from URL or file
$html = file_get_html('http://news.sinchew.com.my/node');
// Find all links
foreach($html->find('a') as $element) {
$tid = '?tid';
$url = 'news.sinchew.com.my/node';
if(strpos($element->href,$tid) && (strpos($element->href,$url))) {
echo $element->href . '<br>';
}
}
?>
What i wanted to do is change ?tid=someNumbers to ?tid=1234 and then output all url with ?tid=1234 . I stuck here for hours,can someone help me with this?
Try preg_replace to perform substitutions based on regular expressions:
<?php
//...
echo preg_replace("/\\?tid=[0-9]+/", "?tid=1234", $element->href);
//...
?>
I want to extract all text link from a webpage using simplehtmldom class. But i don't want image links.
<?
foreach($html->find('a[href]') as $element)
echo $element->href . '<br>';
?>
above code shows all anchor links containing href attribute.
contact
about
<a herf="/home"><img src="logo.png" /><a>
i want only /contact and /about not /home because it contains image instead of text
<?php
foreach($html->find('a[href]') as $element)
{
if (empty(trim($element->plaintext)))
continue;
echo $element->href . '<br>';
}
<?
foreach($html->find('a[href]') as $element){
if(!preg_match('%<img%', $element->href)){
echo $element->href . '<br>';
}
}
?>
It is possible to do that in css and with phpquery as:
$html->find('a:not(:has(img))')
This is not a feature that will likely ever come to simple though.
I'm trying to learn the simple_html_dom syntax, but i'm not having much luck. Could someone show me an example from this:
<div id="container">
<span>Apples</span>
<span>Oranges</span>
<span>Bananas</span>
</div>
If I want to just return the values Apples, Oranges and Bananas.
Can I simply use the php simple_html_dom class or will I also have to use xcode, curl, etc?
UPDATE:
I was able to get this to work, but not convinced it's the most efficient way of getting what I need:
foreach ($html->find('div[id=cont]') as $div);
foreach($div->find('span') as $element)
echo $element->innertext . '<br>';
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Your suggestion is correct:
foreach ($html->find('div[id=cont]') as $div);
foreach($div->find('span') as $element)
echo $element->innertext . '<br>';
More simply:
foreach($html->find('div#container span') as $element)
echo $element->innerText();
That means any span that descends from a div with id: container
example: at this domain http://www.example.com/234234/go.html is only one iframe-code
how can i get the url in the iframe-code?
go.html:
<iframe style="width: 99%;height:80%;margin:0 auto;border:1px solid grey;" src="i want this url" scrolling="auto" id="iframe_content"></iframe>
i have this snippet, but its very bad coded..
function downloadlink ($d_id)
{
$res = #get_url ('' . 'http://www.example.com/' . $d_id . '/go.html');
$re = explode ('<iframe', $res);
$re = explode ('src="', $re[1]);
$re = explode ('"', $re[1]);
$url = $re[0];
return $url;
}
thank you!
Use a html parser such as simple_html_dom to parse html.
$html = file_get_html('http://www.example.com/');
// Find all iframes
foreach($html->find('iframe') as $element)
echo $element->src . '<br>';
I don't know what scope you have here - is it just that snippet, or are you browsing whole pages?
If you're browsing whole pages, you could use the PHP Simple HTML DOM Parser.
A slightly modified example from their site:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all iframes
foreach($html->find('iframe') as $element)
echo $element->style . '<br>';
This sample code goes through all iframes on the page, and outputs their src property.
PHP has built-in functions for this as well (like SimpleXML), but I find the DOM Parser very nice and easy to handle (as you can see).