Printing out links from a site using PHP web crawler - php

I have made this code for printing links in a given url/site. But unfortunately it prints them as text instead of anchor links.
so far this code works as printing links as text::
<?php
include_once('simple_html_dom.php');
$url = "http://www.sitename.com";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link) {
echo $url.$link->href."<br/>";
}
?>
while i tried this, to prints them a anchor links but it prints blank page (not even a error)
<?php
include_once('simple_html_dom.php');
$url = "http://www.sitename.com";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link){
echo "<a href='".$url.$link->href."'>"."</a>"."<br/>";
}
?>
And also , I was wondering how to extracts only links from particular area (div element) instead of whole page?
Any Help is greatly appreciated..

Related

simple_html_dom trying to parse the site, but the cat doesn't output anything

I'm trying to parse the site, but the cat doesn't output anything
<?php
include_once 'simple_html_dom.php';
$html = file_get_html('https://teleprogramma.pro/headlines');
foreach($html->find('.text-part') as $element) {
echo $element->outertext;
}
?>
There are no Elements in the document which match the class .text-part. You can look at the source code when you save the HTML into a file.
<?php
include_once 'simple_html_dom.php';
$html = file_get_html('https://teleprogramma.pro/headlines');
file_put_contents('htmlData.html', $html);
When you try for example to find .block-top-section-posts you'll get a result.
<?php
include_once 'simplehtmldom_1_9_1/simple_html_dom.php';
$html = file_get_html('https://teleprogramma.pro/headlines');
foreach($html->find('.block-top-section-posts') as $element) {
echo $element->outertext;
}
// Outputs
/*
<div class="vue-container block-top-section-posts"> <div id="vuetag-top-section-posts" class="vue-tag news-line" data-url="/rest/term-additional-loading/section/76217/0/0" data-max-width="0" data-min-width="0" data-information=""> </div> </div>
*/
When you lookup the Site in a Browser you will get redirected to another URL. If you want to use that, have a look at php get url of redirect from source url to get the final address.

Get the html from another webpage from another machine

I am trying to get the html of this page: http://213.177.10.50:6060/itn/default.asp and from this page go to 'Drumuri' where the cars is placed.
For short I am trying to get that tabel from 'Drumuri ' page.
I have tried this code:
<?php
$DOM = new DOMDocument;
$DOM->loadHTML('http://213.177.10.50:6060/itn/default.asp');
$items = $DOM->getElementsByTagName('a');
print_R($items);
?>
And I also tried with cURL, but no results. This my county's website and I think it is very secured so that's why I can not get it's html. Can you please try and give me the right answer. If I can or not and how or why?
You cannot load directly http://213.177.10.50:6060/itn/default.asp because it contains iFrame. The iFrame source is http://213.177.10.50:6060/itn/dreapta.asp
Here is how to go thru the links and find the DRUMURI link:
<?php
$baseUrl = 'http://213.177.10.50:6060/itn/';
$DOM = new DOMDocument;
$DOM->loadHTMLFile($baseUrl.'dreapta.asp');
foreach($DOM->getElementsByTagName('a') as $link) {
if ($link->nodeValue == 'DRUMURI') {
echo "Label -> ".$link->nodeValue."\n";
echo "Link -> ".$baseUrl.$link->getAttribute('href')."\n";
}
}
->
Label -> DRUMURI
Link -> http://213.177.10.50:6060/itn/drumuri.asp

PHP Simple HTML DOM Parser some pages issue

I have a code for read all inputs in a form.
The code works in my demo page an others, but not work in some pages.
For the example issue:
facebook:
$url = 'https://www.facebook.com';
$html = file_get_html($url);
$post = $html->find('form[id=reg]'); //id for the register facebook page
print_r($post);
Print an empty array.
Functional example:
$url = 'http://www.monografias.com/usuario/registro';
$html = file_get_html($url);
$post = $html->find('form[name=myform]');
print_r($post);
Print a form content
Facebook won't give you registration form directly, it will only respond with basic html, and the rest will be created with javascript. see for yourself
$url = 'https://www.facebook.com';
$html = file_get_html($url);
echo htmlspecialchars($html);
there is no form with "reg" ID in the html they send you.
simple_html_dom.php contains a line limiting the max file size it will parse:
define('MAX_FILE_SIZE', 600000);
For files larger than this size, file_get_html() will just return false.

PHP Simple HTML DOM Scrape External URL

I'm trying to build a personal project of mine, however I'm a bit stuck when using the Simple HTML DOM class.
What I'd like to do is scrape a website and retrieve all the content, and it's inner html, that matches a certain class.
My code so far is:
<?php
error_reporting(E_ALL);
include_once("simple_html_dom.php");
//use curl to get html content
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
//Get all data inside the <div class="item-list">
foreach($html->find('div[class=item-list]') as $div) {
//get all div's inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data = $d->outertext;
}
}
print_r($data)
echo "END";
?>
All I get with this is a blank page with "END", nothing else outputted at all.
It seems your $data variable is being assigned a different value on each iteration. Try this instead:
$data = "";
foreach($html->find('div[class=item-list]') as $div) {
//get all divs inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data .= $d->outertext;
}
}
print_r($data)
I hope that helps.
I think, you may want something like this
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
foreach ($html->find('div.item-list div.item') as $div) {
echo $div . '<br />';
};
This will give you something like this (if you add the proper style sheet, it'll be displayed nicely)

PHP DOMDocument error handling

I'm having trouble trying to write an if statement for the DOM that will check if $html is blank. However, whenever the HTML page does end up blank, it just removes everything that would be below DOM (including what I had to check if it was blank).
$html = file_get_contents("http://example.com/");
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementById('dividhere')->getElementsByTagName('img');
foreach ($links as $link)
{
echo $link->getAttribute('src');
}
All this does is grab an image URL in the specified div, which works perfectly until the page is a blank HTML page.
I've tried using SimpleHTMLDOM, which didn't work either (it didn't even fetch the image on working pages). Did I happen to miss something with this one or am I just missing something in both?
include_once('simple_html_dom.php')
$html = file_get_html("http://example.com/");
foreach($html->find('div[id="dividhere"]') as $div)
{
if(empty($div->src))
{
continue;
}
echo $div->src;
}
Get rid on the $html variable and just load the file into $dom by doing #$dom->loadHTMLFile("http://example.com/");, then have an if statement below that to check if $dom is empty.

Categories