PHP simple dom parser error handling - php

I am trying to use php simple dom parser to a bunch of pages but i have a problem with SOME of them. While in 90% of the pages everything working fine, in some urls i cant save the curl output to a string... The url exists of course...
curl_setOpt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
$html= str_get_html($data);
if ($html) {
....
}
My code is something like this but it never gets inside if. I can echo $data without any problem but i cant echo $html :S I also tried file_get_html but nothing. The weird is that i dont get any error. How i can configure php simple dom parser to throw me the error.

Oh my GOD! It was the file size of data... I changed (simple_html_dom.php) to something bigger and i am fine...
define('MAX_FILE_SIZE', 800000);
So poor error handling :(

Related

Getting HTML data from php page

I have a URL like this https://facebook.com/5 , I want to get HTML of that page, just like view source.
I tried using file_get_contents but that didn't returned me correct stuff.
Am I missing something ?
Is there any other function that I can utilize ?
If I can't get HTML of that page, what special thing did the developer do while coding the site to avoid this thing ?
Warning for being off topic
But does this task have be done using PHP?
Since this sounds like a task of web-scraping, I think you would gain more use in casperjs.
With this, you can target with precision what you would want to retrieved from the fb-page rather than grabbing the whole content, which I assume as of this writing is generated by multiple requests of content and rendered to you through a virtual DOM.
Please note that I haven't tried retrieving content from facebook, but I've done this with multiple services.
Good luck!
You may want to use curl instead: http://php.net/manual/en/curl.examples.php
Edit:
Here is an example of mine:
$url = 'https://facebook.com/5';
$ssl = true;
$ch = curl_init();
$timeout = 3;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, $ssl);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
Note that depending on the websites vhost configuration a slash at the end of the url can make a difference.
Edit: Sorry for the undefined variable.. I copied it out of a helper method i used. Now it should be alright.
Yet another Edit:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
By adding this option you will follow the redirects that are apperently happening in your example. Since you said it was an example I actually didnt run it before. Now I did and it works.

Get Content from Web Pages with PHP

I am working on a small project to get information from several webpages based on the HTML Markup of the page, and I do not know where to start at all.
The basic idea is of getting the Title from <h1></h1>s, and content from the <p></p>s tags and other important information that is required.
I would have to setup each case from each source for it to work the way it needs. I believe the right method is using $_GET method with PHP. The goal of the project is to build a database of information.
What is the best method to grab the information which I need?
First of all: PHP's $_GET is not a method. As you can see in the documentation $_GET is simply an array initialized with the GET's parameters your web server received during the current query. As such it is not what you want to use for this kind of things.
What you should look into is cURL that allows you to compose even fairly complex query, send to the destination server and retrieve the response. For example for a POST request you could do something like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.mysite.com/tester.phtml");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
"postvar1=value1&postvar2=value2&postvar3=value3");
// in real life you should use something like:
// curl_setopt($ch, CURLOPT_POSTFIELDS,
// http_build_query(array('postvar1' => 'value1')));
// receive server response ...
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
curl_close ($ch);
Source
Of course if you don't have to do any complex query but simple GET requests you can go with the PHP function file_get_contents
After you received the web page content you have to parse it. IMHO the best way to do this is by using PHP's DOM functions. How to use them should really be another question, but you can find tons of example without much effort.
<?php
$remote = file_get_contents('http://www.remote_website.html');
$doc = new DomDocument();
$file = #$doc->loadHTML($remote);
$cells = #$doc->getElementsByTagName('h1');
foreach($cells AS $cell)
{
$titles[] = $cell->nodeValue ;
}
$cells = #$doc->getElementsByTagName('p');
foreach($cells AS $cell)
{
$content[] = $cell->nodeValue ;
}
?>
You can get the HTML source of a page with:
<?php
$html= file_get_contents('http://www.example.com/');
echo $html;
?>
Then once you ahve the structure of the page you get the request tag with substr() and strpos()

Timeout when passing variables with the url and fopen() in php

I have done quite a bit of searching and cannot quite find my answer. My problem is that I am trying to call a link with GET variables attached to it and it just hangs and hangs until connection times out. When I just literally call the link in a web browser it works fine no problem.
Here is the fopen() php code example:
<?php
$url = "https://www.mysite.com/folder/second_folder/file.php?varA=val1&varB=val2&varC=val3&varD=val4&varE=val5";
$ch = fopen($url, 'r');
if(!$ch){
echo "could not open!!! $url";
} else {
echo "Success! ($url)";
}
?>
I can call file.php without the GET variables just fine. Returns with no error.
NOTE: I will say that file.php with one of the var's that get passed, does some functions and then does a header Location rewrite. I do not think it is even getting to this point when it does a connect timeout though because when I had problems, I put in a "check point" prior to the header Location point which should email me, and it does not email me.
Again, if I run the URL in a web browser it works just fine.
So what is going on if anyone can help me? I just need to run the URL as if PHP is clicking on the links. I have used fopen before but for some reason it does not work now. Also cURL did not work on this.
Try changing '' to " " in this case.
My working code is
<?php $handle = fopen("c:\\folder\\resource.txt", "r"); ?>
I think you want to be using
$ch = file_get_contents($url);
Edit: cURL option
// open
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS, 1);
curl_setopt($ch, CURLOPT_FORBID_REUSE, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$page_data = curl_exec($ch);
$page_info = curl_getinfo($ch);
// close
curl_close ($ch);

Php curl incorrect download

I'm attempting to use Youtube's API to pull a list of video and display them. To do this, I need to curl their api and get the xml file returned, which I will then parse.
When I run the following curl function
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
against the url
http://gdata.youtube.com/feeds/api/videos?q=Apple&orderby=relevance
The string that is saved is horribly screwed up. There are no < > tags, or half of the characters in most of it. It looks 100% different then if I view it in a browser.
I tried print, echo, and var dump and they all show it has completely different, which makes parsing it impossible.
How do I get the file properly from the server?
It's working for me. I'm pretty sure that the file is returned without errors, but when you print it, the <> tags aren't showed. But if you look on the source code you can see them.
Try this, you can see it work:
$content = get_url_contents('http://gdata.youtube.com/feeds/api/videos?q=Apple&orderby=relevance');
$xml = simplexml_load_string($content);
print_r($xml);
Make use of the client library that Google provides, it'll make your life easier.
http://code.google.com/apis/youtube/2.0/developers_guide_php.html

Execute PHP code returned by a cURL query

I am trying to execute code that is returned by a cURL query.
The following code queries a page on my webserver:
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, "http://web.com/foo.php");
curl_setopt ($ch, CURLOPT_HEADER, 0);
$res = curl_exec ($ch);
curl_close ($ch);
echo $res;
I would like to do so by only modifying the code in foo.php.
I have tried returning PHP code as the result in foo.php with an eval() command but it doesn't seem to work.
Any ideas?
EDIT: Guys, I am not doing this for a public website. It is for a private project, I will be the only user. I know it's a huge security concern, I would never do something like this that would be live on the internet.
Disclaimer: this is a terrible idea for security and you shouldn't do it.
That said
ensure that the allow_url_fopen option is permitted in your php.ini, or set it using ini_set
ensure that web.com is actually returning PHP code, not executing the PHP code and returning the output - that won't give you anything useful to run (unless the PHP code is generating other php code as output, but then you're really getting too far down the rabbit hole)
then just include "http://web.com/foo.php"
Now, to reiterate, don't do it unless you're really very sure of yourself, or you really like having your site hacked.
Note that eval does not need the leading <?php to work.
An alternative to eval would be to write the code into a file and then include said file.
Also, make sure you set the CURLOPT_RETURNTRANSFER option to true, otherwise, you might just display the code.
DISCLAIMER: THIS IS A HORRIBLE IDEA. I HIGHLY RECOMMEND THAT YOU USE SOME OTHER APPROACH.
Im going to guess that the file son your server are being interpreted by the server they are on so you get the PHP parse response. Try renaming them to something else like .phtml. Or turn PHP off on the remote server. Then it should just be a matter of:
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, "http://web.com/foo.phtml");
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$res = curl_exec ($ch);
curl_close ($ch);
$parsed = eval($res);
// echo or do whatever with $parsed
But as i said in my disclaimer, and everyone commenting answering this question has said... This is a securoty risk and even beyond that has all kinds of gotchas. If you ellaborate on why it is you want to do this we can probably find a better solution that doesnt make Jon Skeet cry.

Categories