UPDATE #2:
I have confirmed with my contacts at NOAA that they are having big time interconnectivity problems all across NOAA. For example, they are only getting precipitation data from 2 locations. I am sure this is related. I let NOAA know about this thread and the work you all did to identify this as a connectivity issue.
UPDATE: Now the wget command works from my local server but not from the 1and1.com server. I guess that explains why it works from my browser. Must be a connection issue back east as some of you are also having the same problem. Hopefully this will clear itself as it looks like I can't do anything about it.
EDIT: It is clear that the fetch problem I am having it
unique to NOAA addresses in that there is no problem with my code and other sites
that all fetches work just fine in a normal browser
that no way I have been able to try will fetch the file with code.
My question is how can I make code that will fetch the file as well as the browser?
I have used this command to get an external web page for almost 2 years now
wget -O <my web site>/data.txt http://www.ndbc.noaa.gov/data/latest_obs/latest_obs.txt
I have tried this from two different servers with the same result so I am sure I am not being blocked.
Suddenly this morning it quit working. To make matters worse, it would leave processes running on the server until there were enough that it shut down my account and all my web sites were erroring out until we did a kill one at a time to the 49 sleeping processes.
I got no help from 1and1 tech support. They said it was my cron script, which was just the one line above.
So I decided to re-write the file get using php. I tried file_get_contents. I have tried curl, fgets as well. But none of this worked so I tried lynx.
Nothing loads this particular URL but everything I tried works fine on other urls.
But if I just copy http://www.ndbc.noaa.gov/data/latest_obs/latest_obs.txt into a browser, no problem - the file displays promptly.
Obviously it is possible to read this file because the browser is doing it. I have tried Chrome, IE, and Firefox and none had a problem loading this page but nothing I have tried in code works.
What I want to do is read this file and write it to the local server to buffer it. Then my code can parse it for various data requests.
What is a reliable way to read this external web page?
It was suggested I add a user agent so I changed my code to the following
function read_url($url){
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$output = curl_exec($ch);
if(curl_errno($ch)){
echo "<!-- curl_error($ch) -->";
}
curl_close($ch);
return $output;
}
Again, it works on other external web sites but not on this one.
I tried running the wget manually: Here is what I got
(uiserver):u49953355:~ > wget -O <my site>/ships_data.txt http://www.ndbc.noaa.gov/data/realtime2/ship_obs.txt
--2013-11-17 15:55:21-- http://www.ndbc.noaa.gov/data/realtime2/ship_obs.txt
Resolving www.ndbc.noaa.gov... 140.90.238.27
Connecting to www.ndbc.noaa.gov|140.90.238.27|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 690872 (675K) [text/plain]
Saving to: `<my site>/ships_data.txt'
0% [ ] 1,066 --.-K/s eta 7h 14m
It just stays at 0%
NOTE <my-site> is the web address where my data is stored. I did not want to publish the address of my bugger area but it is like mydomain/buffer/
I just tried the same thing from another server (not 1and1)
dad#myth_desktop:~$ wget -O ships_data.txt http://www.ndbc.noaa.gov/data/realtime2/ship_obs.txt
--13:14:32-- http://www.ndbc.noaa.gov/data/realtime2/ship_obs.txt
=> `ships_data.txt'
Resolving www.ndbc.noaa.gov... 140.90.238.27
Connecting to www.ndbc.noaa.gov|140.90.238.27|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 690,872 (675K) [text/plain]
3% [====> ] 27,046 --.--K/s ETA 34:18
It is stuck at 3% this time.
Both your wget commands worked for me.
It also seems that NOAA is not blocking your requests either since you get the 200 response code and HTTP headers (content length, type, etc) and part of the data (1066 bytes are somewhere in the row 7-8 of the data).
It may be that your connection (in general or specifically to NOAA) is slow or passing via some buffering proxy. Until the proxy gets all or most of the data, to wget it will look like connection is staling. Does it work to retrieve this file: http://www.ndbc.noaa.gov/robots.txt?
Option --debug of wget might also help to find out the problem.
Anyways, about hanging wget processes, you can use --timeout=60 option to limit the waiting time before failing (http://www.gnu.org/software/wget/manual/wget.html).
wget -O ships_data.txt http://www.ndbc.noaa.gov/data/realtime2/ship_obs.txt --timeout=10
If you want to set an user agent (like you did in the PHP script), you can use "--user-agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)" option.
wget -O ships_data.txt http://www.ndbc.noaa.gov/data/realtime2/ship_obs.txt "--user-agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"
About curl vs wget, you can just replace the wget commands with curl command (instead of doing it in PHP):
curl -o ships_data.txt http://www.ndbc.noaa.gov/data/realtime2/ship_obs.txt --user-agent "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"
Andrei
The file is available, but even though its small it is taking a long while to download. In a couple attempts I experienced up to 3 minutes and 47 seconds to fetch this tiny file of 23KB.
It is clearly some issue with their network, not much you can do about it.
Consider using set_time_limit(600) to allow your PHP script to take longer (10 minutes) to download the file, but at the same time not too long so it won't get stuck if it fails.
Since initially, the OP was able to not run the wget command manually, my guess was that the server IP was blocked.
Manually running the following command hung up, so it added weight to my said speculation.
wget -O <my web site>/data.txt http://www.ndbc.noaa.gov/data/latest_obs/latest_obs.txt on the hosted server
On checking if wget itself was working, OP did wget to a dummy endpoint. wget -O <web-site>/google.log www.google.com which worked.
Since OP mentioned that downloads proceeded sometimes, but not always and it worked from another server from the same hosted solution, I think we can now pin it to be an issue on the other website's network.
My guess is, the crons are being run at a very small frequency (say every minute), like
* * * * * wget -O <my web site>/data.txt http://www.ndbc.noaa.gov/data/latest_obs/latest_obs.txt
(or at a similar small frequency), and due to whatever kind of server load the external website has, the earlier requests either time out, or do not finish within the time period stipulated for them (1 minute).
Beacuse of this, OP is facing some race condition in which multiple cron processes are trying to write to the same file, but none of them are able to actually write to it completely because of the delay in receving packets for the file (Example, one process hanging from 12:10 AM, another one started at 12:11 AM, and one more started at 12:12 AM, none of them over)
The solution to this would be to make them little more infrequent, or if OP wants to use the same frequencies, then redownload only if a previous version of the download is not currently in progress. For checking if a process is already running, check this
Related
I set up a crontab to execute a php file every minute.
Now I need to create the php file but I’m clueless on what the contents should be.
All the code needs to do is visit the website url.
No need to save anything.
It just needs mimic loading the home page just like a browser would.
That in turn triggers a chain of events which are already in place.
It is an extremely low traffic site so that’s the reason for it.
I know, I could do it with curl.
But for reasons I won’t get into, it needs to be a php file.
Can anyone point me in the right direction please. Not expecting you to provide code, just direction.
Thanks!
You can use curl in PHP to just send a request to the page:
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, "the.url-of-the-page.here");
curl_exec($curl_handle);
curl_close($curl_handle);
curl
Example
You could also do it with one line (note that the whole HTML of the page is retrieved which takes a bit longer):
file_get_contents('URL');
As Prince Dorcis stated you could also use curl. If the website is not yours you should maybe (or have to) use curl and send a request with a useragent (you can find a list here):
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
Marco M is right, but there is a catch (it may not be for most but there is sometimes)
file_get_contents("https://example.com");
normally does the trick (i use that more than i should) BUT !
There is a setting in php.ini that needs to be on for that function to enable it to open URLs !
I had that once with a webhoster, they did not allow that ;)
I use curl in custom Zend Framework library to make a GET request to a Drupal website. On the Drupal end I use rest export pages that receive get request and return some data.
This is my curl request structure in the ZF2
$this->chanel = curl_init();
curl_setopt($this->chanel, CURLOPT_URL, "SOME URL LINK");
curl_setopt($this->chanel, CURLOPT_TIMEOUT, 30);
curl_setopt($this->chanel, CURLOPT_RETURNTRANSFER,1);
curl_setopt($this->chanel, CURLOPT_USERAGENT, 'Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101');
$result=curl_exec ($this->chanel);
curl_close ($this->chanel);
Both Drupal and Zend Framework websites are located on my localhost.
The execution time normally takes around 15 seconds. This is too long.
I tried the same link with a Restlet Client (Chrome Extension) and it takes around 1 second or less to execute and retrieve the data.
Do you have any suggestions why it is so slow and how I can improve the speed?
Try putting some loggers in your code, put time stamps in various code blocks and inside functions, check if curl is taking time or something else? Put timestamps and loggers after each line to debug the performance issue.
Also try using it from command line as follows:
curl --get "URL HERE"
And check if its fast or not, if its fast, the code you assume to be slow, try executing the direct command from your code.
Please use ip address instead of hostname.
If your Drupal in same machine with your ZF2 app, you can use 127.0.0.1.
I think that can be caused by DNS look up.
I have a few servers that fetch images from other sites.
After working for months. Apache started crashing every few hours. (see config at the bottom of the post)
Investigation using logging in the code, shows that sometimes file_get_contents hangs keeping the apache process in W state forever. Sample URL of fetched file that hanged: https://www.mxstore.com.au/assets/thumb/3104041-c.jpg
I have set timeouts in 3 locations and still the Apache process hangs forever
set_time_limit (10);
ini_set('default_socket_timeout',10);
And also in the context (see timeout=>3) :
$opts = array( 'http'=>array('header'=>" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0" ,'timeout'=>3 ) );
$context = stream_context_create($opts);
$data= file_get_contents($product[p_img], false, $context,-1,1500000);
How can I either make timeout work and/or understand why the image is not fetched?
Config:
PHP Version 5.5.9-1ubuntu4.19
Apache/2.4.7 (Ubuntu)
Apache API Version 20120211
Unfortunately all my searches didnt yield a solution. I implemented Curl for the calls.
I am running a PHP script through cron every 30 minutes which parses and save some pages of my site on the same server. I need to run the script as Firefox or chrome useragent, since the parsed pages has some interface dependency on CSS3 styles.
I tried this within my script:
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
But the Firefox or Chrome dependent stylesheets doesn't load with it. I tried with both double and single quotes.
My question is: Is it possible to spoof useragent for scripts run through server and not browser and how.
NOTE: I know that my browser dependency for interface is bad. But I want to know if this is even possible.
EDIT
My script runs through the sitemap on the server and create a html cache of the pages in sitemap. It don't need to execute any js or css file. Only thing is to spoof useragent so that the cache generated contains the extra js and css files for that browser that are included in the header.
You can consider that I am generating cache files for all browser type - IE, webkits and firefox. So, that I can serve the cache file to the user based on their browser. At this time I am serving the same files to all users, that is without the extra css files.
I think I will need to hardcode the css file into my page so that it is always included in the cache (non-compatible browser won't show any change but it will only increase the file over-head for them). Thanks anyways
When you run a php script through Cron, the idea is that it is a script, not a webpage being requested. Even if you could spoof the useragent, the css and javascript isn't going to excecuted as if it would be running inside a real web browser. The point of cron is to run scripts, raw scripts, that do, for example, file operations.
Well, at first I would look at your user agent identification. I think it is unnecessary complicated, try simply Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1).
If this should not work for then you could try to execute the curl call as a shell command with exec(). In this case you could run into problems that the page is not really rendered itself. You could workaround this by using a X virtual framebuffer. This would make your page render in memory, not showing any screen output - ergo behave like a browser.
You could do it like this:
exec("xvfb-run curl [...]");
You can also set the user agent by using ini_set('user_agent', 'your-user-agent');
Maybe that will help you.
I have written a PHP script based on a piece of code I've found using Google. It's purpose is to check particular site's position in Google, given a particular keyword. Firstly, it prepares an appropriate URL to query Google (something like: "http://www.google.com/search?q=the+keyword&ie=utf-8&oe=utf-8&num=50"), then it downloads the source of a site located at the URL prepared before. After that, it counts the position using regular expressions and the knowledge about what div's classes does Google use for results.
The script works fine when the URL I want to download from is in the domain "google.com". But since I it's intended to check position for polish people, I would like it to use "google.pl". I wouldn't care, but the search results can really vary between the two (even more than 100 positions of difference). Unfortunately, when I try to use the "pl" domain, the cURL just doesnt't return anything (it waits for the timeout first). However, when I ran my script on another server, it worked perfectly on both of "google.com" and "google.pl" domains. Do you have an idea why can something like this happen? Is there a possibility that my server was banned from querying the "google.pl" domain?
Here, my cURL code:
private function cURL($url)
{
$ch = curl_init($url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,5);
return curl_exec($ch);
curl_close($ch);
}
First of all, I cannot reproduce your problem. I used the following 3 cURL commands to simulate your situation:
curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.5 (KHTML, like Gecko) Version/5.1 Safari/534.51.3" http://www.google.com/search?q=the+keyword
curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.5 (KHTML, like Gecko) Version/5.1 Safari/534.51.3" http://www.google.pl/search?q=the+keyword
curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.5 (KHTML, like Gecko) Version/5.1 Safari/534.51.3" http://www.google.nl/search?q=the+keyword
The first one is .com, because this should work as your reference point. Positive.
The second one is .pl, because this is where you are encountering problems with. This also just works for me.
The third one is .nl, because this is where I live (so basically what's .pl for you). This too just works for me.
I'm not sure, but this could be one possible explanation:
Google.com is international, when I enter something at google.nl for example, I still go to google.com/search?q=... (the only difference is the additional lang-param).
Since google.nl/search?q=... redirects to google.com (302). Its actual body is empty.
I don't know, but it is possible cURL isn't able to handle redirects, or you need to set an additional flag.
If this is true (which I'll check now), you need to use google.com as domain and add an additional lang-param, instead of using google.pl.
The reason your other server does the trick, can be because cURL's configuration varies, or the cURL version isn't the same.
Also, it's blocking cURL's default user-agent string, so I'ld also suggest you to change it into something like:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.5 (KHTML, like Gecko) Version/5.1 Safari/534.51.3
This has nothing to do with the problems you're encountering, but you don't actually close your cURL socket, since you return before you close it (everything after return ... will be 'skipped').