I want to send ~50 requests to different pages on the same domain and then, I'm using DOM object to gain urls to articles.
The problem is that this number of requests takes over 30 sec.
for ($i = 1; $i < 51; $i++)
{
$url = 'http://example.com/page/'.$i.'/';
$client = new Zend_Http_Client($url);
$response = $client->request();
$dom = new Zend_Dom_Query($response); // without this two lines, execution is also too long
$results = $dom->query('li'); //
}
Is there any way to speed this up?
It's a generel problem by design - not the code itself. If you're doing a for-loop over 50 items each opening an request to an remote uri, things get pretty slow since every requests waits until responde from the remote uri. e.g.: a request takes ~0,6 sec to been completed, multiple this by 50 and you get an exection time of 30 seconds!
Other problem is that most webserver limits its (open) connections per client to an specific amount. So even if you're able to do 50 requests simultaneously (which you're currently not), things won't speed up measurably.
In my option there is only one solution (without any deep going changes):
Change the amout of requests per exection. Make chunks from e.g. only 5 - 10 per (script)-call and trigger them by an external call (e.g. run them by cron).
Todo:
Build a wrapper function which is able to save the state of its current run ("i did request 1 - 10 at my last run, so now I have to call 11 - 20) into a file or database and trigger this function by an cron.
Example Code (untested) for better declaration;
[...]
private static $_chunks = 10; //amout of calls per run
public function cronAction() {
$lastrun = //here get last run parameter saved from local file or database
$this->crawl($lastrun);
}
private function crawl($lastrun) {
$limit = $this->_chunks + $lastrun;
for ($i = $lastrun; $i < limit; $i++)
{
[...] //do stuff here
}
//here set $lastrun parameter to new value inside local file / database
}
[...]
I can't think of a way to speed it up but you can increase the timeout limit in PHP if that is your concern:
for($i=1; $i<51; $i++) {
set_time_limit(30); //This restarts the timer to 30 seconds starting now
//Do long things here
}
Related
I have a web application that allows the users to upload DBF files and the app will store contents into an SQL database. The row count range from a few thousands to about 80,000 rows and I have the following code
if($file){
$totalRows = dbase_numrecords($file);
for($i = 1; $i <= $totalRows; $i++){
$row = dbase_get_record_with_names($file, $i);
//echo $row["BILL_NO"]." ";
if(!empty(trim($row["STATUS"]))){ //save to database if column is not empty
$data = [
//array data from the row
];
$db->table("item_menu")->replace($data);
}
if($i%1000 == 0) //Sleep call here every 1000 rows
sleep(1);
}
echo "done";
}
This function, once done will be called once per day and ideally just called/run in the background. However, when I do not place the sleep function, the server doesn't serve any pages until the loop completes, which can take from a few seconds to about a minute of unresponsiveness, but when the sleep function is added, the server continuously serve pages to different users.
My question is, does the sleep function help free up the current thread and process other requests during the sleep period?
If you use sleep() function then you will end up executing all the thing in a single thread causing a pause on the whole process. You should go for php v8.1 for that kind of process handling.
I am working in Yii and want to export Large data approx 2 Lack records at a time. Problem is When I try to export data server is stop working and hang all process in system. I have to kill all service and restart server again,m Can anyone tell me appropriate way to export data in csv file.
$count = Yii::app()->db->createCommand('SELECT COUNT(*) FROM TEST_DATA')->queryScalar();
$maxRows = 1000:
$maxPages = ceil($count / $maxRows);
for ($i=0;$i<$maxPages;$i++)
{
$offset = $i * $maxRows;
$rows = $connection->createCommand("SELECT * FROM TEST_DATA LIMIT $offset,$maxRows")->query();
foreach ($rows as $row)
{
// Here your code
}
}
May be it is because of the processing the code without closing the session. When you start the process and do not close session, in the period of processing code, you can not load any page of the site (in the same browser) because of session (it will be busy). It could be accepted as "hanging of the server" but server is running as it should. You can check it by loading the site on different browser, if it loads, it means the process is running as it should be.
In my experience, i used some table to save processing data (successfully processed offset, last_iterated_time) and see the current state of the processing. Fore example table "processing_data" with variables 'id' (int), 'stop_request'(tinyint, for stopping process, if 1 stop the iteration), 'offset'(int), 'last_iterated_time'(datetime). Add only one record on this table, and on every iteration check the 'stop_request' variable, if it gets the value 1 you can break iteration. And on every iteration you can save current offset value a current datetime. By doing this you can stop processing and continue.
And you can use while (to reduce memory usage) to iterate without counting:
set_time_limit(0);
$offset=0;
$nextRow= $connection->createCommand("SELECT * FROM TEST_DATA LIMIT $offset, 1")->queryRow();
while($nextRow) {
//Here your code
$processingData= ProcessingData::model()->findByPk(1);
$processingData->offset=$offset;
$processingData->last_iterated_time=new CDbExpression('NOW()');
$processingData->save();
if($processingData->stop_request==1) { break; }
$offset++;
$nextRow= $connection->createCommand("SELECT * FROM TEST_DATA LIMIT $offset, 1")->queryRow();
}
I have a simple script that counts from 1 to 5000 with a for loop. It flushes output in real time to browser and shows a progress bar with %.
What I have: If I leave the page, the process interrupts. If I come back, it starts from 0.
What I want to achieve: If I leave the page, the process continues and, If I come back , it shows the right percentage.
Example: I run the process, it counts till 54, I leave the page for 10 seconds, when I come back it shows me 140 and continues to flush.
Is it possible?
I would suggest you to use server workers - scripts which are intended to run independently from webserver context.
The most common way of doing it - usage of message queues (RabbitMQ, Qless, etc). Event should be initiated by the script in web context, but the actual task should be executed by queue listener in a different context.
What you have asked seems quite simple to do with a session. (Purely assuming on the use case given). This is not running any process in the background, it just simply keep track of the time and show the progress. That's why I said "based on what you asked". If you want to keep track of any real background tasks, then I believe the case would be totally different, and you will have to change the wordings of your question as well ;)
Something like this would do.
<?php
session_start();
$s = &$_SESSION;
$sleep = 1; //seconds
//check if we have a value set in session before, if not set default = 0.
if(!isset($s['last'])){
$s['last'] = 0;
}
//check if we have a last time set in session before. if not set a default = curret time.
if(!isset($s['time'])){
$s['time'] = time();
}
//get the idle time of the user.
$idle = time() - $s['time'];
//start the loop..and set starting point.
$start = $s['last'] + ($idle / $sleep);
for( $i = $start; $i < 100; $i++){
echo $i . '<br />';
$s['last']++;
$s['time'] = time();
flush();
sleep($sleep);
}
Hope it helps!!
I have very strange problem. I'm working on shop based on zend framework. I'm creating integration with some auction service (allegro.pl). I need to download via soap all items. After my function finish job I get "MySQL server has gone away".
Here is my code:
private function getProducts()
{
$items = [];
$filterOptions = /* doesn't matter for this question */;
$allegroItems = $this->allegro->getCore()->doGetItemsList(0, 1, $filterOptions, 3, null);
$itemsCount = $allegroItems->itemsCount;
$perPage = 1000;
$maxPage = ceil(round($itemsCount, 0) / $perPage);
for ($i = 0; $i < $maxPage; $i++) {
$allegroItems = $this->allegro->getCore()->doGetItemsList($perPage * $i, $perPage, $filterOptions, 3, null)->itemsList->item;
if (!is_array($allegroItems)) {
$allegroItems = [$allegroItems];
}
foreach ($allegroItems as $item) {
$items[(string)$item->itemId] = $item;
}
}
return $items;
}
There are ~3000 items currently. I get error when I download over 2500-3000 items (didn't calculate exact number). It doesn't matter if I set $perPage to 1, 100 or 1000. It doesn't depend on execution time - I can set sleep(100) and download 1000 products with no error. Just before last line of this function I can call any DB query with no problems but then, when framework's built in function tries to update tasks table I get error.
Error seems to depend on nothing... Not execution time (time is ~30 sec and it's working fine with sleep(100)), not memory limit (I can unset variables each loop or set memory limit to 1GB, didn't help), not execution time of soap functions (downloading 2000 items one by one works fine although it takes few minutes). And the most strange for me - db queries are working inside that function just before the last line like I said.
I'm not using clear zend framework but "shoper" which is based on zend.
Any ideas?
The problem was with BLOB data. Increasing max_allowed_packet solved it. However I have no idea what was in that BLOB and how my function could affect it because I wrote it as completely independent function :-)
I am working on a real estate website and we're about to get an external feed of ~1M listings. Assuming each listing has ~10 photos associated with it, that's about ~10M photos, and we're required to download each of them to our server so as to not "hot link" to them.
I'm at a complete loss as to how to do this efficiently. I played with some numbers and I concluded, based on a 0.5 second per image download rate, this could take upwards of ~58 days to complete (download ~10M images from an external server). Which is obviously unacceptable.
Each photo seems to be roughly ~50KB, but that can vary with some being larger, much larger, and some being smaller.
I've been testing by simply using:
copy(http://www.external-site.com/image1.jpg, /path/to/folder/image1.jpg)
I've also tried cURL, wget, and others.
I know other sites do it, and at a much larger scale, but I haven't the slightest clue how they manage this sort of thing without it taking months at a time.
Sudo code based on the XML feed we're set to receive. We're parsing the XML using PHP:
<listing>
<listing_id>12345</listing_id>
<listing_photos>
<photo>http://example.com/photo1.jpg</photo>
<photo>http://example.com/photo2.jpg</photo>
<photo>http://example.com/photo3.jpg</photo>
<photo>http://example.com/photo4.jpg</photo>
<photo>http://example.com/photo5.jpg</photo>
<photo>http://example.com/photo6.jpg</photo>
<photo>http://example.com/photo7.jpg</photo>
<photo>http://example.com/photo8.jpg</photo>
<photo>http://example.com/photo9.jpg</photo>
<photo>http://example.com/photo10.jpg</photo>
</listing_photos>
</listing>
So my script will iterate through each photo for a specific listing and download the photo to our server, and also insert the photo name into our photo database (the insert part is already done without issue).
Any thoughts?
I am surprised the vendor is not allowing you to hot-link. The truth is you will not serve every image every month so why download every image? Allowing you to hot link is a better use of everyone's bandwidth.
I manage a catalog with millions of items where the data is local but the images are mostly hot linked. Sometimes we need to hide the source of the image or the vendor requires us to cache the image. To accomplish both goals we use a proxy. We wrote our own proxy but you might find something open source that would meet your needs.
The way the proxy works is that we encrypt and URL encode the encrypted URL string. So http://yourvendor.com/img1.jpg becomes xtX957z. In our markup the img src tag is something like http://ourproxy.com/getImage.ashx?image=xtX957z.
When our proxy receives an image request, it decrypts the image URL. The proxy first looks on disk for the image. We derive the image name from the URL, so it is looking for something like yourvendorcom.img1.jpg. If the proxy cannot find the image on disk, then it uses the decrypted URL to fetch the image from the vendor. It then writes the image to disk and serves it back to the client. This approach has the advantage of being on demand with no wasted bandwidth. I only get the images I need and I only get them once.
You can save all links into some database table (it will be yours "job queue"),
Then you can create a script which in the loop gets the job and do it (fetch image for a single link and mark job record as done)
The script you can execute multiple times f.e. using supervisord. So the job queue will be processed in parallel. If it's to slow you can just execute another worker script (if bandwidth does not slow you down)
If any script hangs for some reason you can easly run it again to get only images that havnt been yet downloaded. Btw supervisord can be configured to automaticaly restart each script if it fails.
Another advantage is that at any time you can check output of those scripts by supervisorctl. To check how many images are still waiting you can easy query the "job queue" table.
Before you do this
Like #BrokenBinar said in the comments. Take into account how many requests per second the host can provide. You don't want to flood them with requests without them knowing. Then use something like sleep to limit your requests per whatever number it is they can provide.
Curl Multi
Anyway, use Curl. Somewhat of a duplicate answer but copied anyway:
$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i = 0; $i < $node_count; $i++)
{
$results[] = curl_multi_getcontent ( $curl_arr[$i] );
}
print_r($results);
From: PHP Parallel curl requests
Another solution:
Pthread
<?php
class WebRequest extends Stackable {
public $request_url;
public $response_body;
public function __construct($request_url) {
$this->request_url = $request_url;
}
public function run(){
$this->response_body = file_get_contents(
$this->request_url);
}
}
class WebWorker extends Worker {
public function run(){}
}
$list = array(
new WebRequest("http://google.com"),
new WebRequest("http://www.php.net")
);
$max = 8;
$threads = array();
$start = microtime(true);
/* start some workers */
while (#$thread++<$max) {
$threads[$thread] = new WebWorker();
$threads[$thread]->start();
}
/* stack the jobs onto workers */
foreach ($list as $job) {
$threads[array_rand($threads)]->stack(
$job);
}
/* wait for completion */
foreach ($threads as $thread) {
$thread->shutdown();
}
$time = microtime(true) - $start;
/* tell you all about it */
printf("Fetched %d responses in %.3f seconds\n", count($list), $time);
$length = 0;
foreach ($list as $listed) {
$length += strlen($listed["response_body"]);
}
printf("Total of %d bytes\n", $length);
?>
Source: PHP testing between pthreads and curl
You should really use the search feature, ya know :)