php code to extract all text links not image link - php

I want to extract all text link from a webpage using simplehtmldom class. But i don't want image links.
<?
foreach($html->find('a[href]') as $element)
echo $element->href . '<br>';
?>
above code shows all anchor links containing href attribute.
contact
about
<a herf="/home"><img src="logo.png" /><a>
i want only /contact and /about not /home because it contains image instead of text

<?php
foreach($html->find('a[href]') as $element)
{
if (empty(trim($element->plaintext)))
continue;
echo $element->href . '<br>';
}

<?
foreach($html->find('a[href]') as $element){
if(!preg_match('%<img%', $element->href)){
echo $element->href . '<br>';
}
}
?>

It is possible to do that in css and with phpquery as:
$html->find('a:not(:has(img))')
This is not a feature that will likely ever come to simple though.

Related

PHP - Help scraping content from web page using simplehtmldom

I need to get contents marked with reed in images.
The link of the logo with 200px
The link of ground place
The Capacity number
I have tried this so far
include("simple_html_dom.php");
//Wikipedia page to parse
$html = file_get_html('https://en.wikipedia.org/wiki/Alloa_Athletic_F.C.');
foreach($html->find('.label') as $element)
echo $element->href . "\n";
second code
include("simple_html_dom.php");
$aHtml = file('https://en.wikipedia.org/wiki/Alloa_Athletic_F.C.');
foreach($aHtml as $id => $element):
if( strpos($element, 'logo') ):
echo $id .':' .htmlspecialchars($element) . "<br><br>\n";
endif;
endforeach;
but i cant get the results i need.

Scraping data from amazon

I'm aware that there is an amazon API for pulling their data but I'm just trying to learn to scrape for my own knowledge and pulling data from amazon seems like a good test.
<?php
ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(-1);
include('../includes/simple_html_dom.php');
$html = file_get_html('http://www.amazon.co.uk/gp/product/B00AZYBFGY/ref=s9_simh_gw_p86_d0_i1?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1MP0FXRF8V70NWAN3ZWW&pf_r$')
foreach($html->find('a-section') as $element) {
echo $element->plaintext . '<br />';
}
echo $ret;
?>
All I'm trying to do is pull the product description from the link but I'm not sure why it's working. I'm not getting any errors or any data at all, really.
The class for the Product Description is simply productDescriptionWrapper so in your sample code use that css selector
foreach($html->find('.productDescriptionWrapper') as $element) {
echo $element->plaintext . '<br />';
}
simplehtmldom uses css selectors very similar to jQuery. so if you want all divs say ->find('div') if you want all anchors with a class of 'hotProduct' say ->find('a.hotProduct') so on and so forth
It doesn't work because the product description is being added by JavaScript into an iFrame.
You first can check if there is an HTML taken from the Amazon. It might block your request.
$url = "https://www.amazon.co.uk/gp/product/B00AZYBFGY/ref=s9_simh_gw_p86_d0_i1?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1MP0FXRF8V70NWAN3ZWW&pf_r$"
$htmlContent = file_get_contents($url);
echo $htmlContent;
$html = str_get_html($htmlContent);
Note, the https://, you have http://, maybe that is why you get nothing.
Once you get HTML, you can go forward.
Try different selectors:
foreach($html->find('div[id=productDescription]')) as $element) {
echo $element->plaintext . '<br />';
}
foreach($html->find('div[id=content]')) as $element) {
echo $element->plaintext . '<br />';
}
foreach($html->find('div[id=feature-bullets]')) as $element) {
echo $element->plaintext . '<br />';
}
It should display the page itself, maybe with some missing CSS.
If the HTML is in place. You can try those xpaths

simple_html_dom example help for a newbie

I'm trying to learn the simple_html_dom syntax, but i'm not having much luck. Could someone show me an example from this:
<div id="container">
<span>Apples</span>
<span>Oranges</span>
<span>Bananas</span>
</div>
If I want to just return the values Apples, Oranges and Bananas.
Can I simply use the php simple_html_dom class or will I also have to use xcode, curl, etc?
UPDATE:
I was able to get this to work, but not convinced it's the most efficient way of getting what I need:
foreach ($html->find('div[id=cont]') as $div);
foreach($div->find('span') as $element)
echo $element->innertext . '<br>';
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Your suggestion is correct:
foreach ($html->find('div[id=cont]') as $div);
foreach($div->find('span') as $element)
echo $element->innertext . '<br>';
More simply:
foreach($html->find('div#container span') as $element)
echo $element->innerText();
That means any span that descends from a div with id: container

PHP Explode: How to Download and Split a Specific HTML Part?

example: at this domain http://www.example.com/234234/go.html is only one iframe-code
how can i get the url in the iframe-code?
go.html:
<iframe style="width: 99%;height:80%;margin:0 auto;border:1px solid grey;" src="i want this url" scrolling="auto" id="iframe_content"></iframe>
i have this snippet, but its very bad coded..
function downloadlink ($d_id)
{
$res = #get_url ('' . 'http://www.example.com/' . $d_id . '/go.html');
$re = explode ('<iframe', $res);
$re = explode ('src="', $re[1]);
$re = explode ('"', $re[1]);
$url = $re[0];
return $url;
}
thank you!
Use a html parser such as simple_html_dom to parse html.
$html = file_get_html('http://www.example.com/');
// Find all iframes
foreach($html->find('iframe') as $element)
echo $element->src . '<br>';
I don't know what scope you have here - is it just that snippet, or are you browsing whole pages?
If you're browsing whole pages, you could use the PHP Simple HTML DOM Parser.
A slightly modified example from their site:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all iframes
foreach($html->find('iframe') as $element)
echo $element->style . '<br>';
This sample code goes through all iframes on the page, and outputs their src property.
PHP has built-in functions for this as well (like SimpleXML), but I find the DOM Parser very nice and easy to handle (as you can see).

Simple HTML DOM Parser error handling

I'm using SimpleHTMLDOM Parser to scape a website and I would like to know if there's any error handling method. For example, if the link is broken there is no use to advance in the code and search the document.
Thank you.
<?php
$html = file_get_html('http://www.google.com/');
foreach($html->find('a') as $element)
{
if(empty($element->href))
{
continue; //will skip <a> without href
}
echo $element->href . "<br>\n";
}
?>
a loop and continue?

Categories