I have an XML file localy. It contains data from marketplace.
It roughly looks like this:
<offer id="2113">
<picture>https://anotherserver.com/image1.jpg</picture>
<picture>https://anotherserver.com/image2.jpg</picture>
</offer>
<offer id="2117">
<picture>https://anotherserver.com/image3.jpg</picture>
<picture>https://anotherserver.com/image4.jpg</picture>
</offer>
...
What I want is to save those images in <picture> node localy.
There are about 9,000 offers and about 14,000 images.
When I iterate through them I see that images are being copied from that another server but at some point it gives 504 Gateway Timeout.
Thing is that sometimes error is given after 2,000 images sometimes way more or less.
I tried getting only one image 12,000 times from that server (i.e. only https://anotherserver.com/image3.jpg) but it still gave the same error.
As I've read, than another server is blocking my requests after some quantity.
I tried using PHP sleep(20) after every 100th image but it still gave me the same error (sleep(180) - same). When I tried local image but with full path it didn't gave any errors. Tried second server (non local) the same thing occured.
I use PHP copy() function to move image from that server.
I've just used file_get_contents() for testing purposes but got the same error.
I have
set_time_limit(300000);
ini_set('default_socket_timeout', 300000);
as well but no luck.
Is there any way to do this without chunking requests?
Does this error occur on some one image? Would be great to catch this error or just keep track of the response delay to send another request after some time if this can be done?
Is there any constant time in seconds that I have to wait in order to get those requests rollin'?
And pls give me non-curl answers if possible.
UPDATE
Curl and exec(wget) didn't work as well. They both gone to same error.
Can remote server be tweaked so it doesn't block me? (If it does).
p.s. if I do: echo "<img src = 'https://anotherserver.com/image1.jpg'" /> in loop for all 12,000 images, they show up just fine.
Since you're accessing content on a server you have no control over, only the server administrators know the blocking rules in place.
But you have a few options, as follows:
Run batches of 1000 or so, then sleep for a few hours.
Split the request up between computers that are requesting the information.
Maybe even something as simple as changing the requesting user agent info every 1000 or so images would be good enough to bypass the blocking mechanism.
Or some combination of all of the above.
I would suggest you to try following
1. reuse previously opened connection using CURL
$imageURLs = array('https://anotherserver.com/image1.jpg', 'https://anotherserver.com/image2.jpg', ...);
$notDownloaded = array();
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
foreach ($imageURLs as $URL) {
$filepath = parse_url($URL, PHP_URL_PATH);
$fp = fopen(basename($filepath), "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_URL, $URL);
curl_exec($ch);
fclose($fp);
if (curl_getinfo($ch, CURLINFO_RESPONSE_CODE) == 504) {
$notDownloaded[] = $URL;
}
}
curl_close($ch);
// check to see if $notDownloaded is empty
If images are accessible via both https and http try to use http instead. (this will at least speed up the downloading)
Check response headers when 504 is returned as well as when you load url your browser. Make sure there are no X-RateLimit-* headers. BTW what is the response headers actually?
Related
I had previously asked a question, and got the answer, but I think I've run into another problem.
The php script I'm using does this:
1 - transfers a file to my server from my backup server
2 - when it's done transfering it sends some post data to it using curl, which creates a zip file
3 - when it's done, the result is echoed and depending on what the result is; transfers the file, or does nothing.
My problem is this:
When the file is small enough (under 500MB) it creates it, and transfers back no problem. When it's larger, it timesout, finishes creating the zip on the remote server, but because it timed out it doesn't get transfered.
I'm running this from a command line on the backup server. I have this in the php script:
set_time_limit(0); // ignore php timeout
ignore_user_abort(true); // keep on going even if user pulls the plug*
while(ob_get_level())ob_end_clean(); // remove output buffers
But it still timesout when I run sudo php backup.php
Is using curl making it timeout like a browser on the other end where the zip is being made? I think the problem is the response isn't being echo'd out.
Edits:
(#symcbean)
I'm not seeing anything, which is why I'm struggling. When I run it from the browser, I see the loading thing in the address bar. After about 30 seconds it just stops. When I do it from the command line, same deal. 30 seconds and it just stops. This only happens when large zips need to be created.
It's being invoked via a file. The file loads a class, sends the connection information to the class. Which contacts the server to make the zip, transfers the zip back, does some stuff to it then transfers it to S3 for archiving.
It logs into the remote server, uploads a file with curl. upon a valid response, it curls again with the location of the file as a url (I'll always know what it is), which fires up the php file I just transfered over. The zip ALWAYS gets created no problem, even up to 22GB, just sometimes takes a long time of course. After that it waits for a response of "created". Waiting for that response is where it dies.
So the zip always gets created, but the waiting time is what "I think" is making it die.
Second Edit:
I tried this from the command line:
$ftp_connect= ftp_connect('domain.com');
$ftp_login = ftp_login($ftp_connect,'user','pass');
ftp_pasv($ftp_connect, true);
$upload = ftp_put($ftp_connect, 'filelist.php', 'filelist.php', FTP_ASCII);
$get_remote = 'filelist.php';
$post_data = array (
'last_bu' => '0'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'domain.com/'.$get_remote);
curl_setopt($ch, CURLOPT_HEADER, 0 );
// adding the post variables to the request
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
//echo the following to get response
$response = curl_exec($ch);
curl_close($ch);
echo $response;
and got this:
<HTML>
<HEAD>
<TITLE>500 Internal Server Error</TITLE>
</HEAD><BODY>
<H1>Internal Server Error</H1>
The server encountered an internal error or
misconfiguration and was unable to complete
your request.<P>
Please contact the server administrator to inform of the time the error occurred
and of anything you might have done that may have
caused the error.<P>
More information about this error may be available
in the server error log.<P>
<HR>
<ADDRESS>
Web Server at domain.com
</ADDRESS>
</BODY>
</HTML>
Again, the error log is blank, the zip still gets created, but because of the timeout around 650MB of creation I can't get the response.
The problem is in the server code that generates the file to be returned.
Check the php error log
It may be timing out for a few reasons but the log shouldl tell you why.
I fixed it guys, thank you so much to everyone who helped me, it pointed me in the right directions.
In the end, the problem was on the remote server. What was happening was that it was timing out the cURL connection, which didn't send the result I needed back.
What I did to fix it was add a function to my class that (again) using curl, checks for the zip file http code I know it's creating When it finishes, then throw the result locally. If it's not finished, sleep for a few seconds and check again.
private function watchDog(){
$curl = curl_init($this->host.'/'.$this->grab_file);
//don't fetch the actual page, you only want to check the connection is ok
curl_setopt($curl, CURLOPT_NOBODY, true);
//do request
$result = curl_exec($curl);
//if request did not fail
if ($result !== false) {
//if request was ok, check response code
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 404) {
sleep(7);
self::watchDog();
}
else{
return 'zip created';
}
}
curl_close($curl);
}
I have had an application running successfully for a couple months that relies on a cron job to get an xml feed of air pollution statistics. Since January it has run without error, but this morning from 7:00 it has not read the data. The relevant code is as follows:
<?php
define('FEED_URL', 'http://www.beijingaqifeed.com/BeijingAQI/BeijingAir.xml');
$contents = file_get_contents(FEED_URL);
if ($contents === false) echo "READ FAILED";
echo "FILE_GET_CONTENTS SIZE IS " . strlen($contents) . "<br>\n";
If I run this on my machine at home, it works:
FILE_GET_CONTENTS SIZE IS 21538
If it runs on my server it does not:
FILE_GET_CONTENTS SIZE IS 0
I have confirmed with support at the server site that they can browse the url and see the xml data, so there is no firewall or anything blocking this. And, as I say, this has worked successfully over 1000 times (as measured by entries in my database) until this morning, and now it always fails. I have no connection at the data supplier so I can't investigate from their side.
Can anyone suggest why this started failing, and what I could try doing? I have tried fread() and file(), with the same results.
Thanks...
(I have checked allow_url_fopen is turned on)
It is in this case probaly something on server blocking Your PHP , might be OS update , or something like that. In past I had similar problems , but , mines was about unkillable daemon , linked with cron job , so , me and support team had big headaches of turning it off. In this case , this is crucial for further investigation , this line: FILE_GET_CONTENTS SIZE IS 21538 , if someone could obtain it and read it , there's the catch. This answer might not be helpful at all , but , as I stated , that error line is key.
Odd , I've just checked XML URL , and it works , normally , as it should.
probably permission issue. Try to add the following after file_get_contents to see their response
if (!empty($http_response_header))
{
var_dump($http_response_header);
//to see what tou get back
}
First I thought it would be permissions but, that isn't the case.
Try changing server, maybe your IP is blocked or something?
<?php
function download($website){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
$XML = download('http://www.beijingaqifeed.com/BeijingAQI/BeijingAir.xml');
var_dump($XML);
Perform:
wget http://www.beijingaqifeed.com/BeijingAQI/BeijingAir.xml via SSH (if possible) and see the response.
It is most likely that there is 500 error , so - their side. Depends what do they use , but many admins (like me) avoid to point out server errors , replacing them with useless comments or simply , by removing them. This is done to prevent intruders , as error code could stick attacker to server under my administration , and if it gets down - my fault.
This is not a final answer, but it clarifies things somewhat. I tried uploading the file to the server and reading it from there the same way (http:/young-0/testfile.xml) and it succeeded. Then I tried getting "http://www.beijingaqifeed.com" from the server - and that failed. So the bom was a red herring, the connection is being blocked either by my provider (who says it is not them) or the site is refusing connections from my server - thanks to everyone who helped.
For now I have returned to using the twitter feed, which is far less reliable but does have the advantage that I am able to read it.
I'm working on a bit of PHP code that depends on a remote file which happens to be hosted on pastebin. The server I am working on has all the necessary functions enabled, as running it with FILE_URL set to http://google.com returns the expected results. I've also verified through php.ini for extra measure.
Everything should work, but it doesn't. Calling file() on a URL formed as such, http://pastebin.com/raw.php?i=<paste id here>, returns a 500 server error. Doing the same on the exact same file hosted locally or on google.com returns a reasonable result.
I have verified that the URL is set to the correct value and verified that the remote page is where I think that it is. I'm at a loss.
ini_set("allow_url_fopen", true);
// Prefer remote (up-to-date) file, fallback to local file
if( ini_get("allow_url_fopen") ){
$file = file( FILE_URL );
}
if(!isset( $file ) || !$file ) {
$file = file( LOCAL_FILE_PATH );
}
I wasn't able to test this, but you should use curl, try something like this:
<?php
$url = "http://pastebin.com/2ZdFcEKh";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
Pastebin appear to use a protection system that will automatically block IP addresses that issue requests that are "bot-like".
In the case of your example, you will get a 500 server error since the file() command never completes (since their protection system never closes the connection) and there is no timeout facility in your call. The script is probably considered "bot-like" since file() does not pass through all the standard HTTP headers a typical browser would.
To solve this problem, I would recommend investigating cURL and perhaps look at setting a browser user agent as a starting point to grant access to your script. I should also mention that it would be in your interests to investigate whether or not this is considered a breach of the Pastebin user agreement. While I cannot see any reference to using scripts in their FAQ (as of 2012/12/29), they have installed protection against scripts for a reason.
I'm trying to make a PHP script that will check the HTTP status of a website as fast as possible.
I'm currently using get_headers() and running it in a loop of 200 random urls from mysql database.
To check all 200 - it takes an average of 2m 48s.
Is there anything I can do to make it (much) faster?
(I know about fsockopen - It can check port 80 on 200 sites in 20s - but it's not the same as requesting the http status code because the server may responding on the port - but might not be loading websites correctly etc)
Here is the code..
<?php
function get_httpcode($url) {
$headers = get_headers($url, 0);
// Return http status code
return substr($headers[0], 9, 3);
}
###
## Grab task and execute it
###
// Loop through task
while($data = mysql_fetch_assoc($sql)):
$result = get_httpcode('http://'.$data['url']);
echo $data['url'].' = '.$result.'<br/>';
endwhile;
?>
You can try CURL library. You can send multiple request parallel at same time with CURL_MULTI_EXEC
Example:
$ch = curl_init('http_url');
curl_setopt($ch, CURLOPT_HEADER, 1);
$c = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_HTTP_CODE);
print_r($info);
UPDATED
Look this example. http://www.codediesel.com/php/parallel-curl-execution/
I don't know if this is an option that you can consider, but you could run all of them almost at the same using a fork, this way the script will take only a bit longer than one request
http://www.php.net/manual/en/function.pcntl-fork.php
you could add this in a script that is ran in cli mode and launch all the requests at the same time, for example
Edit: you say that you have 200 calls to make, so a thing you might experience is the database connection loss. the problem is caused by the fact that the link is destroyed when the first script completes. to avoid that you could create a new connection for each child. I see that you are using the standard mysql_* functions so be sure to pass the 4th parameter to be sure you create a new link each time. also check the maximum number of simultaneous connections on your server
I've got a simple php script to ping some of my domains using file_get_contents(), however I have checked my logs and they are not recording any get requests.
I have
$result = file_get_contents($url);
echo $url. ' pinged ok\n';
where $url for each of the domains is just a simple string of the form http://mydomain.com/, echo verifies this. Manual requests made by myself are showing.
Why would the get requests not be showing in my logs?
Actually I've got it to register the hit when I send $result to the browser. I guess this means the webserver only records browser requests? Is there any way to mimic such in php?
ok tried curl php:
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "getcorporate.co.nr");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
same effect though - no hit registered in logs. So far it only registers when I feed the http response back from my script to the browser. Obviously this will only work for a single request and not a bunch as is the purpose of my script.
If something else is going wrong, what debugging output can I look at?
Edit: D'oh! See comments below accepted answer for explanation of my erroneous thinking.
If the request is actually being made, it would be in the logs.
Your example code could be failing silently.
What happens if you do:
<?PHP
if ($result = file_get_contents($url)){
echo "Success";
}else{
echo "Epic Fail!";
}
If that's failing, you'll want to turn on some error reporting or logging and try to figure out why.
Note: if you're in safe mode, or otherwise have fopen url wrappers disabled, file_get_contents() will not grab a remote page. This is the most likely reason things would be failing (assuming there's not a typo in the contents of $url).
Use curl instead?
That's odd. Maybe there is some caching afoot? Have you tried changing the URL dynamically ($url = $url."?timestamp=".time() for example)?