Using CURL and PHPSimpleHTMLDOMParser gives me - 500 Internal Server error - php

I am using PHP Simple HTML DOM Parser, here you can check more about it: http://simplehtmldom.sourceforge.net/
And also i am using a CURL because this web adress http://www.sportsdirect.com is not loading on the normal examples from the SimpleHTMLDom.
So here is the code i use:
<?php
include_once('../simple_html_dom.php');
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.sportsdirect.com/');
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
$html= str_get_html($str);
echo $html->plaintext;
?>
When i try to load the script it gives me: 500 Internal Server Error
Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, webmaster#superweb.bg and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.
This script is just not working for this web adress, because when i try to load other website like mandmdirectDOTcom it is woking OKEY!
Where is my mistake and how i can make this thing works?

Try this for the curl fetch. It works for me in this case. This is a standard set of curl options & settings I use that work well:
include_once('simple_html_dom.php');
$url = "http://www.sportsdirect.com";
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_SSLVERSION, 3);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$str = curl_exec($curl);
curl_close($curl);
$html = str_get_html($str);
echo $html->plaintext;
I believe the issue with your original curl settings was the missing user agent. Try the same script with the CURLOPT_USERAGENT line commented out to see what I mean.
Many servers have firewall settings that disallow curl requests from users making requests without a proper user agent setting. The user agent I have set here is a fairly generic Firefox user agent, so feel free to experiment with that to use something else.

Try setting a Host header in the request. It's possible that the target domain is on a shared server, and without a Host header, the server doesn't know what to do.
curl_setopt($curl, CURLOPT_HTTPHEADER, array('Host: www.sportsdirect.com'));

Related

How can I get HTML data from a site who use CloudFlare?

First at all, sorry for my bad English.
I'm trying to get the HTML code from https://www.uasd.edu.do/ but when I try to catch the code with the PHP function "file_get_contents()" or using cURL, it just simply doesn't work.
With "file_get_contents()" it returns with a 403 HTTP error. With cURL, it returns with a fictional captcha that just do not appear.
I tried sending Cookies with cURL, setting a user-agent, but I'm still on the same point. Also I tried to find the real IP address of the site, but with not success. Please help me! I'll really appreciate that.
The code:
$curl = curl_init();
if (!$curl) {
die("Is not working");
}
curl_setopt($curl, CURLOPT_URL, "https://uasd.edu.do/");
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_TIMEOUT, 50);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$html = curl_exec($curl);
echo $html;
curl_close($curl);
The output:
Please enable cookies. One more step Please complete the security
check to access www.uasd.edu.do Why do I have to complete a CAPTCHA?
Completing the CAPTCHA proves you are a human and gives you temporary
access to the web property. What can I do to prevent this in the
future?
If you are on a personal connection, like at home, you can run an
anti-virus scan on your device to make sure it is not infected with
malware.
If you are at an office or shared network, you can ask the network
administrator to run a scan across the network looking for
misconfigured or infected devices.
Cloudflare Ray ID: 4fcbf50d18679f88 • Your IP: ... •
Performance & security by Cloudflare
Note: The "please enable cookies" appear using and not using cookies.

cURL 35 error when connecting to http website

I couldn't find answer on this questions. Sometimes* while trying to retrieve data from http (NOT https) site I get 35 error - SSL connect error.
URL that I'm trying to reach is ie. http://www.aliexpress.com/item//32566080839.html. Then i get redirected to "full url": http://www.aliexpress.com/item/NEW-Sport-Headband-Bike-Halloween-Skull-face-mask-balaclava-Skull-Bandana-Paintball-Ski-Motorcycle-Helmet-Neck/32566080839.html
My cURL code:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://aliexpress.com/item//'. $id .'.html');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 3);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($curl);
I've been trying to add curl_setopt($curl, CURLOPT_SSLVERSION , 3); but it doesn't help.
Why http site gives a 35 error? Is it normal?
Is it possible that aliexpress i blocking my requests?
Sometimes I also get 28 error which is timeout reached - even with 10 seconds timeout.
*Sometimes - I mean it's working for a few hours then not working for about 10 minutes and then still working.
It looks like you are trying to spider on their site using the Id. And as a consequence the site blocks you. As you are referring to SSL error, it is very likely that during the blockade period they are redirecting you to an error page that starts with https://
For the debugging purpose you can enable the verbose mode and observe the header and you'll find what is inside the Location: response header.
curl_setopt ($curl, CURLOPT_VERBOSE, true);

Curl PHP to get HLS return 403 Forbidden

I'm trying to get HLS stream in other site, but they always return 403 Forbidden :( This is my function, it works well on localhost, but not in my server.
function getPage($url, $referer, $header){
$timeout=30;
$curl = curl_init();
if(strstr($referer,"://")){
curl_setopt ($curl, CURLOPT_REFERER, $referer);
}
curl_setopt ($curl, CURLOPT_URL, $url);
curl_setopt ($curl, CURLOPT_TIMEOUT, $timeout);
curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0",rand(4,5)));
curl_setopt ($curl, CURLOPT_HEADER, $header);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
$html = curl_exec ($curl);
curl_close ($curl);
return $html;
}
echo getPage("http://www.wezatv.com/dooball/assp1.php", "http://www.wezatv.com/", 1);
Anyone can help me?
Thanks,
make sure that your server IP is not a IP
use curl proxy to test ,
your link http://www.wezatv.com/dooball/assp1.php works normally on my server
Your code is 100% correct, I think the problem is your server blocked that request or the target server blocked your server's IP
Make sure your server doesn't block the out connection on port 80
If no firewall, could you try to execute the command "wget http://www.wezatv.com/dooball/assp1.php" to see can it download the content? If it also cannot get the content with error code is 403, it mean the server wezatv.com blocked every request from your server.
Some server temporary block every request if it found too many request in a short moment. So if you want to claw it's data, you should reduce the number of request, such as implement a delay

cURL shows blank page

my curl code show me a blank page
<?
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_URL,"http://mysite/scripts/showsomething.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result=curl_exec ($ch);
curl_close ($ch);
$core = explode('<!--- BEGIN -->', $result);
$main = explode('<!--- END -->', $core[1]);
echo $main[0];
?>
this code works fine on localhost, but not on server...
There can be several reasons behind your problem.
1) Change <? into <?php and see whether it works or not.
2) For a short test, run this code from your server and check whether it shows you the output or not.
<?php
echo "sabuj";
?>
3) Some site seek for useragent string on their website request. When they found no useragent, they use to return blank page. You can overcome this with setting an useragent like below.
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
4) Another thing you can do is, access the server with ssh client(if you have any) and run the url with wget tool and see whether you can download your page or not.
wget "http://yoursite/page/blabla/...php"
5) Finally, run your curl code with verbose mode enabled. It will help you to debug your curl requests.
curl_setopt($ch, CURLOPT_VERBOSE,true);
Working Code...
$ch = curl_init();
$timeout = 10;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
var_dump($data);
Hope it helps...
Turn your error reporting all the way to 11.
error_reporting(E_ALL);
This will tell you a little more about the issue you are facing. The most common culprit when working with curl on a dev environment and moving to a server is a missing curl package.
You can check to see if you have curl installed by doing the following:
if(!function_exists('curl_version')) {
throw new Exception('Curl package missing');
}
Thus "PHP Fatal error: Call to undefined function curl_init() in..." is a very common error that is thrown.
Some additional debugging tips are to..
print_r the response of curl_exec($handle);
print_r curl_error($handle) this will give you a curl error code.
setup a curl proxy using set_opt and CURLOPT_PROXY, set the value to your ip:port, open the port 8888 on your router and install charles proxy. This will show you an exact printout of the request and response.
I've read the previous answers and it seems that your code is fine but problem is connectivity to remote server, please try this:
<?
//debug
error_reporting(E_ALL);
ini_set('display_errors', 1);
//debug end
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_URL,"http://mysite/scripts/showsomething.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result=curl_exec ($ch);
curl_close ($ch);
echo result;
?>
what does it output ?
Notes:
Do you administer the other server?
if not, is there a possibility that your ip was blocked by the remote server?
Does script http://mysite/scripts/showsomething.php contains any errors ?
If you're able to edit it, please enable php errors on showsomething.php by adding the following code at the top of it :
error_reporting(E_ALL);
ini_set('display_errors', 1);
I've solved the problem by adding the following code.
The problem is SSL.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
I hope it helps.

checking US-only website status using curl

[Problem]
There is a website which works for US-citizens only (shows info "A" for US-citizens, info "B" for non-US citizens). I need to constantly monitor this webpage for changes ("A" info) - an email should be sent when something is changed! How do I do it? The problem is that I live in Europe!
[Already accomplished]
I have a linux server, daemon and curl PHP script which accomplishes the following task! It works great for all "non-US-only" websites.
[Question]
One way to solve the problem might be to rent a US server but that's not acceptable at all and it is going to cost a lot! I believe that another way to solve the problem might be - to use a US VPN on my server, but for some reasons I won't do that. Is there a way to run curl through proxy maybe? Any ideas?
Current code is the following:
function getrequest($url_site/*,$post_data*/) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url_site);
curl_setopt ($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3');
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE_FILE); // Cookie management.
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE_FILE);
$result = curl_exec($ch); // run the whole process
curl_close($ch);
return $result;
}
and
$sleep_time = 1;
$login_wp_url = "http://www.mysite.com";
set_time_limit(60*10);
$result = getrequest($login_wp_url);
How do I grab contents from US-only website?
P.S. to get the idea of what I mean - try visiting the Hulu from Europe countries.
P.P.S. that's not a Hulu, not a homework.
Many cloud service providers, e.g. Heroku and Amazon, offer their smallest instances for free. You could simply set up one of these for free, make sure that you are provisioned on an US-located server and run your script there.
Another possibility would be to use a (free) proxy for these requests. Here is a list of free proxie servers: http://www.xroxy.com/proxy-country-US.htm.
curl_setopt($ch, CURLOPT_PROXY, "http://160.76.xxx.xxx:8080");
curl_setopt($ch, CURLOPT_PROXYPORT, 8080);
curl_setopt ($ch, CURLOPT_PROXYUSERPWD, "xxx:xxx");

Categories