PHP Simple HTML DOM Parser some pages issue

PHP Simple HTML DOM Parser some pages issue - php

I have a code for read all inputs in a form.
The code works in my demo page an others, but not work in some pages.
For the example issue:
facebook:
$url = 'https://www.facebook.com';
$html = file_get_html($url);
$post = $html->find('form[id=reg]'); //id for the register facebook page
print_r($post);
Print an empty array.
Functional example:
$url = 'http://www.monografias.com/usuario/registro';
$html = file_get_html($url);
$post = $html->find('form[name=myform]');
print_r($post);
Print a form content

Facebook won't give you registration form directly, it will only respond with basic html, and the rest will be created with javascript. see for yourself
$url = 'https://www.facebook.com';
$html = file_get_html($url);
echo htmlspecialchars($html);
there is no form with "reg" ID in the html they send you.

simple_html_dom.php contains a line limiting the max file size it will parse:
define('MAX_FILE_SIZE', 600000);
For files larger than this size, file_get_html() will just return false.

Related

Warning file_get_contents in DOM Parser

My case is, I want to scrap a website, which is success, and I'm using PHP cURL. The problem start when I want to use the DOM Parser to get the content I want. Here is the warning came out:
the error image is here
And the code I use is here. Before this code, I scrap a website using cURL, it's working, but just this part got error :
include 'simple_html_dom.php';
//Here is where I scraping, no need to show it
$fp = fopen(dirname(__FILE__) . '/airpaz.html', 'w');
//$html contain the page I scrap
fwrite($fp, $html);
fclose($fp);
$html_content = file_get_contents(dirname(__FILE__) . '/airpaz.html');
echo $html_content;
$html2 = new simple_html_dom();
$html2->load_file($html_content);
Hope you guys can help, thanks

It looks like you are trying to read a file 3 times:
$read_file = fread($fr, filesize(dirname(__FILE__) . '/airpaz.html'));
and:
$html_content = file_get_contents($read_file);
and:
$html2->load_file($html_content);
In the last two instances, instead of a file-name you pass html contents to the function so that will not work.
You should read the file only once and use string functions on the contents you receive. Or you open the url directly in $html2->load_file().

try this code
include 'simple_html_dom.php';
$html_content = file_get_html(dirname(__FILE__) . '/airpaz.html');
echo $html_content;
$html2 = new simple_html_dom();
$html2->load_file($html_content);

Printing out links from a site using PHP web crawler

I have made this code for printing links in a given url/site. But unfortunately it prints them as text instead of anchor links.
so far this code works as printing links as text::
<?php
include_once('simple_html_dom.php');
$url = "http://www.sitename.com";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link) {
echo $url.$link->href."<br/>";
}
?>
while i tried this, to prints them a anchor links but it prints blank page (not even a error)
<?php
include_once('simple_html_dom.php');
$url = "http://www.sitename.com";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link){
echo "<a href='".$url.$link->href."'>"."</a>"."<br/>";
}
?>
And also , I was wondering how to extracts only links from particular area (div element) instead of whole page?
Any Help is greatly appreciated..

How to retrieve a script tag source content?

I have a small problem with my code..I accessed the src value in the script tag to get the content of the JavaScript page that is found at the server side..It is ok as i get what i wanted but the problem is that i am getting the html code also..I dont want the html code. Here below is what i have done..Please help?
<?php
//simple_html_dom.php caters for malformed html
include('simple_html_dom.php');
$html = new simple_html_dom();
//load the All code file
$html->load_file("test.txt");
$file = fopen("externalScript.txt","w");
$Script=$html->find("script");
$temp="";
$url="http://www.xyz.com";
foreach($Script AS $Spt){
$src=$Spt->src;
//check if the script src has "http://" prefix
if(strpos($src,'http://')!==0){
$src=$url."/".$src;
}
$get_script=file_get_contents($src);
$temp.=$get_script.PHP_EOL;
}
fwrite($file,($temp));
fclose($file);
?>

How to get page title in php?

I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.

Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags

Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});
Demo : http://jsfiddle.net/WQNT8/1/

try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;

PHP Simple DOM Parser Parse Current PHP Page

I am using the PHP Simple DOM parser to extract all of the image sources on a given page like so:
// Include the library
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://google.com/');
// Retrieve all images and print their SRCs
foreach($html->find('img') as $e)
echo $e->src . '<br>';
Instead of using Google.com, I wish to use a page on Wordpress's admin (backend) area. These pages are PHP pages, not HTML (but the page has standard HTML throughout). How would I use the current page as the $html variable? PHP newbie over here.

Using this library dxtool found here.
Login
require 'WebGet.php';
$w = new WebGet();
// using cache to prevent repetitive download
$w->useCache = true;
$w->cacheLocation = '/tmp';
$w->cacheMaxAge = 3600;
$w->cookieFile = '/tmp/cookie.txt';
// $login_get_data and $login_post_data is associative array
$login = $w->requestContent($login_url, $login_get_data, $login_post_data);
Visiting Image containing page
// $image_page_url is the url of the page where your images exist.
$image_page = $w->requestContent($image_page_url);
Parse images and display
$dom = new DOMDocument();
$dom->loadHTML($image_page);
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img){
echo $img->getAttribute("src");
}
Disclaimer: I am the author of this class

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Simple HTML DOM Parser some pages issue - php

simple_html_dom.php contains a line limiting the max file size it will parse: define('MAX_FILE_SIZE', 600000); For files larger than this size, file_get_html() will just return false.

Related

Warning file_get_contents in DOM Parser

Printing out links from a site using PHP web crawler

How to retrieve a script tag source content?

How to get page title in php?

PHP Simple DOM Parser Parse Current PHP Page

Categories

Resources