My case is, I want to scrap a website, which is success, and I'm using PHP cURL. The problem start when I want to use the DOM Parser to get the content I want. Here is the warning came out:
the error image is here
And the code I use is here. Before this code, I scrap a website using cURL, it's working, but just this part got error :
include 'simple_html_dom.php';
//Here is where I scraping, no need to show it
$fp = fopen(dirname(__FILE__) . '/airpaz.html', 'w');
//$html contain the page I scrap
fwrite($fp, $html);
fclose($fp);
$html_content = file_get_contents(dirname(__FILE__) . '/airpaz.html');
echo $html_content;
$html2 = new simple_html_dom();
$html2->load_file($html_content);
Hope you guys can help, thanks
It looks like you are trying to read a file 3 times:
$read_file = fread($fr, filesize(dirname(__FILE__) . '/airpaz.html'));
and:
$html_content = file_get_contents($read_file);
and:
$html2->load_file($html_content);
In the last two instances, instead of a file-name you pass html contents to the function so that will not work.
You should read the file only once and use string functions on the contents you receive. Or you open the url directly in $html2->load_file().
try this code
include 'simple_html_dom.php';
$html_content = file_get_html(dirname(__FILE__) . '/airpaz.html');
echo $html_content;
$html2 = new simple_html_dom();
$html2->load_file($html_content);
Related
I have a code for read all inputs in a form.
The code works in my demo page an others, but not work in some pages.
For the example issue:
facebook:
$url = 'https://www.facebook.com';
$html = file_get_html($url);
$post = $html->find('form[id=reg]'); //id for the register facebook page
print_r($post);
Print an empty array.
Functional example:
$url = 'http://www.monografias.com/usuario/registro';
$html = file_get_html($url);
$post = $html->find('form[name=myform]');
print_r($post);
Print a form content
Facebook won't give you registration form directly, it will only respond with basic html, and the rest will be created with javascript. see for yourself
$url = 'https://www.facebook.com';
$html = file_get_html($url);
echo htmlspecialchars($html);
there is no form with "reg" ID in the html they send you.
simple_html_dom.php contains a line limiting the max file size it will parse:
define('MAX_FILE_SIZE', 600000);
For files larger than this size, file_get_html() will just return false.
I´m parsing some itunes links with dom parser in php. With most of the links it works perfectly. Others which are totally the same type it doesn`t?! I need the "img" tag and the "src-swap-high-dpi" attribute. It drives me nuts. That´s a part of my php-code
$url = "https://itunes.apple.com/us/podcast/id278981407";
$htmlContent = str_get_html(file_get_contents($url));
foreach ($htmlContent->find("img") as $element) {
$value = $element->getAttribute("src-swap-high-dpi");
echo $value;
}
So e.g. I can parse the following links:
https://itunes.apple.com/us/podcast/id201671138
https://itunes.apple.com/us/podcast/id523121474
https://itunes.apple.com/us/podcast/id152249110
But this e.g. not:
https://itunes.apple.com/us/podcast/id278981407
I do not get any output.
Edit:
New Code doesnt work as well:
Still not working for me. Very strange. Thats my new complete code now:
<?php
ini_set("display_errors",1); error_reporting(E_ALL);
require_once ('simple_html_dom.php');
$url = "https://itunes.apple.com/us/podcast/id278981407";
$htmlContent = str_get_html(file_get_contents($url));
foreach($htmlContent->find("div.artwork") as $div) {
$value = $div->find("img",0)->getAttribute("src-swap-high-dpi");
echo $value."<br/>";
}
?>
I get the Output:
Fatal error: Call to a member function find() on a non-object in /home/www/whatever/delete.php on line 10
line 10 is the line starting with "foreach". Your code works fine with the links provided above which I declared as working. But as soon as I take one of the designated one which doesnt work I get the error message provided above. ?!
I think this is one of the cases Simple DOM gets a bit confused and you need to provide it with a parent:
$url = "https://itunes.apple.com/us/podcast/id278981407";
$htmlContent = str_get_html(file_get_contents($url));
foreach($htmlContent->find("div.artwork") as $div) {
$value = $div->find("img",0)->getAttribute("src-swap-high-dpi");
echo $value."<br/>";
}
UPDATE
Here are the results using the above fragment:
http://a3.mzstatic.com/us/r30/Podcasts/v4/61/cc/7f/61cc7f25-131f-7616-6549-5553e6444b87/mza_7489225285918350214.150x150-75.jpg
http://a2.mzstatic.com/us/r30/Podcasts6/v4/04/a9/64/04a964d7-7c10-72d6-871b-97619cf89066/mza_1416781107029663068.150x150-75.jpg
http://a5.mzstatic.com/us/r30/Podcasts4/v4/bb/a6/f4/bba6f4b6-eeab-d7d9-8591-adb2bd277ccb/mza_5223368352447971673.150x150-75.jpg
http://a1.mzstatic.com/us/r30/Podcasts5/v4/aa/54/16/aa541600-cc8b-772b-9c0a-824efe8fdc42/mza_6772270613386652594.150x150-75.jpg
http://a2.mzstatic.com/us/r30/Podcasts3/v4/95/3d/2f/953d2f75-c2c2-4815-a752-f30fdcc0b9fb/mza_9037746738018570312.150x150-75.jpg
http://a4.mzstatic.com/us/r30/Podcasts4/v4/a2/1c/f5/a21cf5a4-2d8d-1ed7-983f-1c90f2f4f948/mza_7120473049241631392.340x340-75.jpg
http://a2.mzstatic.com/us/r30/Podcasts4/v4/5d/21/8d/5d218d2a-2980-0ac9-0bc7-9321ea6eb334/mza_6358466742996313573.150x150-75.jpg
http://a1.mzstatic.com/us/r30/Podcasts/b2/bb/bf/ps.ykmejwzs.150x150-75.jpg
http://a4.mzstatic.com/us/r30/Podcasts6/v4/17/ea/31/17ea3187-ef8c-4756-e488-0c65adced988/mza_7931750363714403933.150x150-75.jpg
http://a1.mzstatic.com/us/r30/Podcasts2/v4/0b/3c/7d/0b3c7d2b-19bf-f7a2-7c50-ca15338b8316/mza_2792239161425784587.150x150-75.jpg
Can you verify you're not getting errors at all ? Say, just write some weird characters in your PHP file, does the PHP shows the error? If not, try to add this in your .htaccess file.
<IfModule mod_php5.c>
# do not display errors
php_value display_errors 1
</IfModule>
UPDATE 2
$url = "https://itunes.apple.com/us/podcast/id278981407";
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,FALSE);
$html = curl_exec($ch);
curl_close($ch);
//$htmlContent = str_get_html(file_get_contents($url));
$htmlContent = str_get_html($html);
foreach($htmlContent->find("div.artwork") as $div) {
$value = $div->find("img",0)->getAttribute("src-swap-high-dpi");
echo $value."<br/>";
}
The reason i didn't use file_get_html of Simple Dom is because it simply uses file_get_contents internally.
I have been working with this php code, which should modify Google Calendars layout. But when I put the code to page, it makes everything below it disappear. What's wrong with it?
<?php
$your_google_calendar=" PAGE ";
$url= parse_url($your_google_calendar);
$google_domain = $url['scheme'].'://'.$url['host'].dirname($url['path']).'/';
// Load and parse Google's raw calendar
$dom = new DOMDocument;
$dom->loadHTMLfile($your_google_calendar);
// Change Google's CSS file to use absolute URLs (assumes there's only one element)
$css = $dom->getElementByTagName('link')->item(0);
$css_href = $css->getAttributes('href');
$css->setAttributes('href', $google_domain . $css_href);
// Change Google's JS file to use absolute URLs
$scripts = $dom->getElementByTagName('script')->item(0);
foreach ($scripts as $script) {
$js_src = $script->getAttributes('src');
if ($js_src) { $script->setAttributes('src', $google_domain . $js_src); }
}
// Create a link to a new CSS file called custom_calendar.css
$element = $dom->createElement('link');
$element->setAttribute('type', 'text/css');
$element->setAttribute('rel', 'stylesheet');
$element->setAttribute('href', 'custom_calendar.css');
// Append this link at the end of the element
$head = $dom->getElementByTagName('head')->item(0);
$head->appendChild($element);
// Export the HTML
echo $dom->saveHTML();
?>
When I'm testing your code, I'm getting some errors because of wrong method call:
->getElementByTagName should be ->getElementsByTagName with s on Element
and
->setAttributes and ->getAttributes should be ->setAttribute and ->getAttribute without s at end.
I'm guessing that you don't have any error_reporting on, and because of that don't know anything went wrong?
I have a small problem with my code..I accessed the src value in the script tag to get the content of the JavaScript page that is found at the server side..It is ok as i get what i wanted but the problem is that i am getting the html code also..I dont want the html code. Here below is what i have done..Please help?
<?php
//simple_html_dom.php caters for malformed html
include('simple_html_dom.php');
$html = new simple_html_dom();
//load the All code file
$html->load_file("test.txt");
$file = fopen("externalScript.txt","w");
$Script=$html->find("script");
$temp="";
$url="http://www.xyz.com";
foreach($Script AS $Spt){
$src=$Spt->src;
//check if the script src has "http://" prefix
if(strpos($src,'http://')!==0){
$src=$url."/".$src;
}
$get_script=file_get_contents($src);
$temp.=$get_script.PHP_EOL;
}
fwrite($file,($temp));
fclose($file);
?>
I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;