I have a php script making requests to some web site. I run this script from command line so no web server on my side is involved. Just pure PHP and a shell.
The response is split into pages so I need to make multiple requests to gain all the data with one script run. Obviously, the request's URL is identical except one parameter. Nothing complicated:
$base_url = '...';
$pages = ...; // a number I receive elsewhere
$delay = ...; // a delay to avoid too many requests
$p = 0;
while ($p < $pages) {
$url = $base_url . "&some_param=$p";
... // Here cURL takes it's turn because of cookies
sleep($delay);
}
The pages I get this way look all the same - like the first one that was requested. (So I get just a repetitive list multiplied by the number of pages.)
I decided that it happens because of some caching on the web server's end which persists despite of an additional random parameter I pass. Closing and reinitializing cURL session doesn't help as well.
I also noticed that if I quickly fix the initial $p value manually (so requests start from different page) and then launch the script again, the result changes. I do it quicker than $delay value.
It means that two different requests made from the same script run give same result, while two different requests made from two different script runs give different results, regardless of delay between the requests. So it can't be just caching on the responded side.
I tried to work that around and wrapped the actual request in a separate script which I run using exec() from the main script. So there is (should be, I consider) a separate shell instance for any single page request, and those requests should not share any kind of cache between them.
Despite of that, I keep getting the same page again. The code looks something like that:
$pages = ...;
$delay = ...;
$p = 0;
$command_stub = 'php get_single_page.php';
while ($p < $pages) {
$command = $command_stub . " $p";
exec($command, $response);
// $response is the same again for different $p's
sleep($delay);
}
If I again change the starting page manually in the script, I get a result for that page all over again. Until I change it once more. And so on. Several minutes may pass between two runs of the main script, and it still yields identical result until I switch the number by hand.
I can't comprehend why this is happening. Can somebody explain it?
The short answer is no. Curl certainly doesn't retain anything between executions unless configured to do so (e.g.: setting a cookie file).
I suspect the server is expecting a session token of some sort (cookie or other HTTP header are my guess). Without the session token it will just ignore the request for subsequent pages.
Related
I'm having a curiosity issue and don't seem to find the correct phrases for expressing what I mean, for a successful Google search query.
Some sites (that mostly do price queries) do an ajax query to something (let's assume it's php script) with user set criteria and the data doesn't get displayed all at once when the query is finished, but you see some parts being displayed from the response earlier (as I assume they become available earlier) and some later.
I'd imagine the ajax request is done to a php script which in turn queries different sources and returns data as soon as possible, meaning quicker query responses get sent first.
Core question:
How would such mechanism be built that php script can return data
multiple times and ajax script doesn't just wait for A response?
I'm rather sure there's information about this available, but unfortunately have not been able to find out even by what terms to search for it.
EDIT:
I though of a good example being cheap flight ticket booking services, which query different sources and seem to output data as soon as it's available, meaning different offers from different airlines appear at different times.
Hope someone can relieve my curiosity.
Best,
Alari
On client side you need onprogress. See the following example (copied from this answer):
var xhr = new XMLHttpRequest()
xhr.open("GET", "/test/chunked", true)
xhr.onprogress = function () {
console.log("PROGRESS:", xhr.responseText)
}
xhr.send()
xhr.responseText will keep accumulating the response given by the server. The downside here is that xhr.responseText contains an accumulated response. You can use substring on it for getting only the current response.
On the server side, you could do output buffering to chunk the response, e.g. like:
<?php
header( 'Content-type: text/html; charset=utf-8' );
for($i = 0; $i < 100; $i++){
echo "Current Response is {$i} \r\n";
flush();
ob_flush();
// sleep for 2 seconds
sleep(2);
}
I have a simple script that makes redirection to mobile version of a website if it finds that user is browsing on mobile phone. It uses Tera-WURFL webservice to acomplish that and it will be placed on other hosting than Tera-WURFL itself. I want to protect it, in case of Tera-WURFL hosting downtime. In other words, if my script takes more than a second to run, then stop executing it and just redirect to regular website. How to do it effectively (so that the CPU would not be overly burdened by the script)?
EDIT: It looks that TeraWurflRemoteClient class have a timeout property. Read below. Now I need to find how to include it in my script, so that it would redirect to regular website in case of this timeout.
Here is the script:
// Instantiate a new TeraWurflRemoteClient object
$wurflObj = new TeraWurflRemoteClient('http://my-Tera-WURFL-install.pl/webservicep.php');
// Define which capabilities you want to test for. Full list: http://wurfl.sourceforge.net/help_doc.php#product_info
$capabilities = array("product_info");
// Define the response format (XML or JSON)
$data_format = TeraWurflRemoteClient::$FORMAT_JSON;
// Call the remote service (the first parameter is the User Agent - leave it as null to let TeraWurflRemoteClient find the user agent from the server global variable)
$wurflObj->getCapabilitiesFromAgent(null, $capabilities, $data_format);
// Use the results to serve the appropriate interface
if ($wurflObj->getDeviceCapability("is_tablet") || !$wurflObj->getDeviceCapability("is_wireless_device") || $_GET["ver"]=="desktop") {
header('Location: http://website.pl/'); //default index file
} else {
header('Location: http://m.website.pl/'); //where to go
}
?>
And here is source of TeraWurflRemoteClient.php that is being included. It has optional timeout argument as mentioned in documentation:
// The timeout in seconds to wait for the server to respond before giving up
$timeout = 1;
TeraWurflRemoteClient class have a timeout property. And it is 1 second by default, as I see in documentation.
So, this script won't be executed longer than a second.
Try achieving this by setting a very short timeout on the HTTP request to TeraWurfl inside their class, so that if the response doesn't come back in like 2-3 secs, consider the check to be false and show the full website.
The place to look for setting a shorter timeout might vary depending on the transport you use to make your HTTP request. Like in Curl you can set the timeout for the HTTP request.
After this do reset your HTTP request timeout back to what it was so that you don't affect any other code.
Also I found this while researching on it, you might want to give it a read, though I would say stay away from forking unless you are very well aware of how things work.
And just now Adelf posted that TeraWurflRemoteClient class has a timeout of 1 sec by default, so that solves your problem but I will post my answer anyway.
I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.
I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.
In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.
The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.
My questions are
How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.
When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:
public function launch() {
$channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
$activeJobs = array();
$running = 0;
do {
// pick jobs for free channels:
while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
// take free channel, (re)init curl handle and let
// queued object set options
$chId = key($freeChannels);
if (empty($channels[$chId])) {
$channels[$chId] = curl_init();
}
$job = array_pop($this->jobQueue);
$job->init($channels[$chId]);
curl_multi_add_handle($this->master, $channels[$chId]);
$activeJobs[$chId] = $job;
unset($freeChannels[$chId]);
}
$pending = count($activeJobs);
// launch them:
if ($pending > 0) {
while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
// poke it while it wants
curl_multi_select($this->master);
// wait for some activity, don't eat CPU
while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
// some connection(s) finished, locate that job and run response handler:
$pending--;
$chId = array_search($info['handle'], $channels);
$content = curl_multi_getcontent($channels[$chId]);
curl_multi_remove_handle($this->master, $channels[$chId]);
$freeChannels[$chId] = NULL;
// free up this channel
if ( !array_key_exists($chId, $activeJobs) ) {
// impossible, but...
continue;
}
$activeJobs[$chId]->onComplete($content);
unset($activeJobs[$chId]);
}
}
} while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}
In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
With this structure new requests will start as soon as something out of the pool finishes.
Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)
P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"
You can use curl_multi: http://www.somacon.com/p537.php
You may also want to consider doing this client side and using Javascript.
Another solution is to write a hunter/gatherer that you submit an array of URLs to, then it does the parallel work and returns a JSON array after it's completed.
Put another way: if you had 100 URLs you could POST that array (probably as JSON as well) to mysite.tld/huntergatherer - it does whatever it wants in whatever language you want and just returns JSON.
Aside from the curl multi solution, another one is just having a batch of gearman workers. If you go this route, I've found supervisord a nice way to start a load of deamon workers.
Things you should look at in addition to CURL multi:
Non-blocking streams (example: PHP-MIO)
ZeroMQ for spawning off many workers that do requests asynchronously
While node.js, ruby EventMachine or similar tools are quite great for doing this stuff, the things I mentioned make it fairly easy in PHP too.
Try execute from PHP, python-pycurl scripts. Easier, faster than PHP curl.
Ok here is my problem.
I have a file which outputs an XML based on an input X
I have another file which calls the above(1) file with 10000 (i mean many) times with different numbers for X
When an user clicks "Go" It should go through all those 10000 Xs and simultaneously show him a progress of how many are done. (hmm may be updated once every 10sec).
How do i do it? I need ideas. I know how to AJAX and stuff, but whats the structure my program should take?
EDIT
So according to the answer given below i did store my output in a session variable. It then outputs the answer. What is happening is:
When i execute a loong script. It gets executed say within 1min. But in the mean time if i open (in a new window) just the file which outputs my SESSION variable, then it doesnt output will the first script has run. Which is completely opposite to what i want. Whats the problem here? Is it my syste/server which doesnt handle multiple requests or what?
EDIT 2
I use the files approach:
To read what i want
> <?php include_once '../includeTop.php'; echo
> util::readFromLog("../../Files/progressData.tmp"); ?>
and in another script
$processed ++;
util::writeToLog($dir.'/progressData.tmp', "Files processed: $processed");
where the functions are:
public static function writeToLog($file,$data) {
$f = fopen($file,"w");
fwrite($f, $data);
fclose($f);
}
public static function readFromLog($file) {
return file_get_contents($file);
}
But still the same problem persist :(. I can manually see the file gettin updated like 1, 2, 3 etc. But when i run my script to do from php it just waits till my original script is output.
EDIT 3
Ok i finally found the solution. Instead of seeking the output from the php file i directly goto the log now and seek it.
Put the progress (i.e. how far are you into the 2nd file) into a memcached directly from the background job, then deliver that value if requested by the javascript application (triggered by a timer, as long as you did not reach a 100%). The only thing you need to figure out is how to pass some sort of "transaction ID" to both the background job and the javascript side, so they access the same key in memcached.
Edit: I was wrong about $_SESSION. It doesn't update asynchronously, i.e. the values you store in it are not accessible until the script has finished. Whoops.
So the progress needs to be stored in something that does update asynchronously: Memory (like pyroscope suggests, and which is still the best solution), a file, or the database.
In other words, instead of using $_SESSION to store the value, it should be stored by memcached, in a file or in the database.
I.e. using the database
$progress = 0;
mysql_query("INSERT INTO `progress` (`id`, `progress`) VALUES ($uid, $progress)");
# loop starts
# processing...
$progress += $some_increment;
mysql_query("UPDATE `progress` SET `progress`=$progress WHERE `id`=$uid");
# loop ends
Or using a file
$progress = 0;
file_put_contents("/path/to/progress_files/$uid", $progress);
# loop starts
# processing...
$progress += $some_increment;
file_put_contents("/path/to/progress_files/$uid", $progress);
# loop ends
And then read the file/select from the database, when requesting progress via ajax. But it's not a pretty solution compared to memcached.
Also, remember to remove the file/database row once it's all done.
You could put the progress in a $_SESSION variable (you'll need a unique name for it), and update it while the process runs. Meanwhile your ajax request simply gets that variable at a specific interval
function heavy_process($input, $uid) {
$_SESSION[$uid] = 0;
# loop begins
# processing...
$_SESSION[$uid] += $some_increment;
# loop ends
}
Then have a url that simply spits out the $_SESSION[$uid] value when it's requested via ajax. Then use the returned value to update the progress bar. Use something like sha1(microtime()) to create the $uid
Edit: pyroscope's solution is technically better, but if you don't have a server with memcached or the ability to run background processes, you can use $_SESSION instead
i have a php script that accepts a POST request as a listener to a web service then process all the data to two final arrays,
I'm looking for a way to initiate a second script that GET's those serialized arrays and do some more processing.
include() will not be good for me since i actually want to "free" or "end" the first script after passing the data
your help is much appreciated as always :)
EDIT - OK so looks like queue might be the solution! i never did anything like this before any examples or reference?
Does it need to happen immediately? Otherwise you could set up a cronjob that does that every X minutes. You'll have to make some kind of queue in which your first script sticks "requests" to the second script. The cronjob then processes the requests in the queue.
You should get into the habit of writing php scripts that are just a collection of functions (no auto-ran scripts, per se). This way you can include a script file at the top of the script your talking about and then call the function that does what you want.
For instance:
<?php
include('common_functions.php');
$array_1 = whatever_you_do_with_post_values();
$array_2 = other_thing_you_do_with_post_values();
// this function is located in 'common_functions.php'
do_stuff_with_arrays($array_1,$array_2);
?>
In Fact:
Just to be consistent with what I'm saying:
<?php
include('common_functions.php');
do_your_stuff();
function do_your_stuff() {
$array_1 = whatever_you_do_with_post_values();
$array_2 = other_thing_you_do_with_post_values();
// this function is located in 'common_functions.php'
do_stuff_with_arrays($array_1,$array_2);
}
?>
Obviously you should use better function & variable names, haha.
I'd do it all in one request. It cuts down on latency and makes the whole operation more efficient.
Remember you can have a long running request, but still service other requests. Apache will just spawn another php process to handle the other request from the webservice even though the first has not completed. As long as the script doesn't lock a shared resource (database file etc) this will work just fine.
That said, you should use cURL to call the second script. then post the unserialized array. cUrl will handle the rest.