get specified url from webpage using simplehtmldom - php

i am trying to build simple php crawler
for this purpose
i am getting constants of webpage using
http://simplehtmldom.sourceforge.net/
after getting page data i get page as bellow
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e)
echo $e->href . '<br>';
this works perfectly,and print all links on that page.
i only want to get some url like
/view.php?view=open&id=
i have wirtten function for this purpose
function starts_text_with($s, $prefix){
return strpos($s, $prefix) === 0;
}
and use this function as
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e) {
if (starts_text_with($e->href, "/view.php?view=open&id=")))
echo $e->href . '<br>';
}
but nothing return.
i hope you understand what i need.
i need to print only url which match that criteria.
Thanks

include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e) {
if (preg_match($e->href, "view.php?view=open&id="))
echo $e->href . '<br>';
}
try this once.
refer preg_match

Related

PHP file_get_html() $html->find('a') not working for certain URL

For some reason, this link doesn't work, but the other one does. I tried expanding the size, and it seemed to max out~ I don't see where it could be looping infinitely :S.
The error said:
Fatal error: Call to a member function find() on boolean in /Applications/XAMPP/xamppfiles/htdocs/simplehtmldom_1_5/example/test.php on line 7
Doesn't work:
<?php
include('../simple_html_dom.php');
$html = file_get_html('http://www.watchepisodes4.com/the-flash');
foreach($html->find('a') as $e) {
echo $e->href . '<br>';
}
?>
Works:
<?php
include('../simple_html_dom.php');
$html = file_get_html('http://www.watchepisodes4.com');
foreach($html->find('a') as $e) {
echo $e->href . '<br>';
}
?>
Thanks!

I want to parse span innertext using simple_html_dom in php

<span class="contact-seller-name">Enda</span>
Now I want to echo 'Enda' inside this span tag using php
Here's my php code
$url="http://website.example.com";
$html = file_get_html( $url );
$value = $html->find('span.contact-seller-name');
echo $value->innertext;
From their documentation it looks like find returns an array of found values matching filter parameters:
From:
http://simplehtmldom.sourceforge.net/
Code:
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
They also provide another example for getting a specific element:
$html->find('div[id=hello]', 0)->innertext = 'foo';
So my guess would be something like this will get you want you desire:
$value = $html->find('span.contact-seller-name', 0);
echo $value->innertext;
By adding the 0 as a parameter it returns the first found instance of that filter.
Take a look at their API here:
http://simplehtmldom.sourceforge.net/manual_api.htm
It describe what the find method returns (an array of element objects or element object if the second parameter is defined)
Then using any of the provided methods for the element object you can get the desired text.
Full working example tested on a live site:
$url = "http://fleeceandthankyou.org/";
$html = file_get_html($url);
$value = $html->find('span.givecamp-header-wide', 0);
//If it can't find the element, throw an error
try
{
echo $value->innertext;
}
catch (Exception $e)
{
echo "Couldn't access magic method: " . $e->getMessage();
}

array_unique() in php simple html dom

I wrote the code blow to get all unique links from a url:
include_once ('simple_html_dom.php');
$html = file_get_html('http://www.example.com');
foreach($html->find('a') as $element){
$input = array($element->href = $element->href . '<br />');
print_r(array_unique($input));}
but I really can't understand why it shows the duplicated links too!
is there any problem with the function array_unique and simple html dom?
and there's another thing I guess is related to the problem: when you execute this you see all of the link that it extracted are in one key I mean this :
array(key => all values)
Is there any one who can solve this?
I believe you want it more like this:
$temp = array();
foreach($html->find('a') as $element) {
$temp[] = $element->href;
}
echo '<pre>' . print_r(array_unique($temp), true) . '</pre>';

Scraping data from amazon

I'm aware that there is an amazon API for pulling their data but I'm just trying to learn to scrape for my own knowledge and pulling data from amazon seems like a good test.
<?php
ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(-1);
include('../includes/simple_html_dom.php');
$html = file_get_html('http://www.amazon.co.uk/gp/product/B00AZYBFGY/ref=s9_simh_gw_p86_d0_i1?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1MP0FXRF8V70NWAN3ZWW&pf_r$')
foreach($html->find('a-section') as $element) {
echo $element->plaintext . '<br />';
}
echo $ret;
?>
All I'm trying to do is pull the product description from the link but I'm not sure why it's working. I'm not getting any errors or any data at all, really.
The class for the Product Description is simply productDescriptionWrapper so in your sample code use that css selector
foreach($html->find('.productDescriptionWrapper') as $element) {
echo $element->plaintext . '<br />';
}
simplehtmldom uses css selectors very similar to jQuery. so if you want all divs say ->find('div') if you want all anchors with a class of 'hotProduct' say ->find('a.hotProduct') so on and so forth
It doesn't work because the product description is being added by JavaScript into an iFrame.
You first can check if there is an HTML taken from the Amazon. It might block your request.
$url = "https://www.amazon.co.uk/gp/product/B00AZYBFGY/ref=s9_simh_gw_p86_d0_i1?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1MP0FXRF8V70NWAN3ZWW&pf_r$"
$htmlContent = file_get_contents($url);
echo $htmlContent;
$html = str_get_html($htmlContent);
Note, the https://, you have http://, maybe that is why you get nothing.
Once you get HTML, you can go forward.
Try different selectors:
foreach($html->find('div[id=productDescription]')) as $element) {
echo $element->plaintext . '<br />';
}
foreach($html->find('div[id=content]')) as $element) {
echo $element->plaintext . '<br />';
}
foreach($html->find('div[id=feature-bullets]')) as $element) {
echo $element->plaintext . '<br />';
}
It should display the page itself, maybe with some missing CSS.
If the HTML is in place. You can try those xpaths

How to change url in html parser

I have this code that output all url with ?tid=someNumbers
<?php
include 'simple_html_dom.php';
// Create DOM from URL or file
$html = file_get_html('http://news.sinchew.com.my/node');
// Find all links
foreach($html->find('a') as $element) {
$tid = '?tid';
$url = 'news.sinchew.com.my/node';
if(strpos($element->href,$tid) && (strpos($element->href,$url))) {
echo $element->href . '<br>';
}
}
?>
What i wanted to do is change ?tid=someNumbers to ?tid=1234 and then output all url with ?tid=1234 . I stuck here for hours,can someone help me with this?
Try preg_replace to perform substitutions based on regular expressions:
<?php
//...
echo preg_replace("/\\?tid=[0-9]+/", "?tid=1234", $element->href);
//...
?>

Categories