I have a PHP script that will generate a report using PHPExcel from data queried from a MySQL DB. Currently, it is linear in processing in that it gets the data back from MySQL, reads in the Excel template, writes the data to the template, then outputs it. I have optimized the code to the point that the data is only iterated over once, and there is very little processing done on the PHP side. The query returns hundreds of lines in less than .001 seconds, so it is running fast enough. After some timing I have found my bottlenecks to be (surprise, surprise) reading the template and writing the output.
I would like to do this:
Spawn a thread/process to read the template
Spawn a thread/process to fetch the data
Return back to parent thread - Parent thread will wait until both are complete
Proceed on as normal
My main questions are is this possible, is it worth it? If yes to both, how would you tackle it?
Also, it is PHP 5 on CentOS
It is generally not a good idea to fork an Apache process. That can cause undetermined results. Instead, using some kind of queuing mechanism is preferable. Gearman is an open source queuing mechanism you can use. I also have a blog post on the Zend Server Job Queue that talks about running tasks asynchronously Do you queue? Introduction to the Zend Server Job Queue.
You could also use something like the Zend Framework Queuing classes to implement some of the asynchronous work. Zend_Queue
#Swisstack, also I will disagree with your assertion that PHP is not created for high performance. Very seldom are language features the cause of slow performance. Perhaps by doing a raw language test comparing $a++ among different languages you will see that, but that type of testing is irrelevant. I've done consulting on PHP for several years and I have never seen a performance problem that was due to the language.
I would try to figure out if you can cache or store the template in some faster to read format. I don't know if that's possible, but the PHPExcel forum is pretty good and is watched by the developers.
You can't multithread but you can fork (pcntl_fork, pcntl_wait). As I'm sure know, you'll want to test carefully the process spawn times to make sure that this is even worth it for your situation.
$pid = pcntl_fork();
if ($pid == -1) {
// fork failed
} elseif ($pid > 0) {
// we're the parent! Wait for child to finish
pcntl_waitpid($pid);
} else {
// we're the child
}
If both reading the template, AND the db query were slow, then I'd say there's a decent chance that worthwhile performance could be gained by running the tasks in parallel. But, you said it yourself, reading the template is slow, and the db query is fast. So, even ignoring any additional overhead created by introduced by the additions needed to run the tasks in parallel, in the best case, you stand to save 0.001 seconds(the time needed for db query).
Running multiple tasks in parallel will always still require the time of the slowest task. Running tasks in series is the sum of all tasks. In your case, templateTime + queryTime(0.001)
Not worth it imo.
Usually the database is the turtle in the equation. You can do that part async without too much effort. See the newly added mysqli_poll() and friend functions.
You can definitely spawn processes on CentOS with PHP (http://php.net/manual/en/function.pcntl-fork.php). Before doing that though, I'd consider at least one thing... If bottleneck appears to be on reading the template and writing the output, it might be an I/O bound issue only and therefore dealing with multiple processess might not help much... Personally I'd try to see if it's possible to do some caching instead...
Read the template once, then do a clone for each workbook that you need to create from the data
Related
I would like to run a PHP script as a cronjob every night. The PHP script will import a XML file with about 145.000 products. Each product contains a link to an image which will be downloaded and saved on the server as well. I can imagine that this may cause some overload. So my question is: is it a better idea to split the PHP file? And if so, what would be a better solution? More cronjobs, with several minutes pause between each other? Run another PHP file using exec (guess not, cause I can't imagine that would make much of a difference), or someting else...? Or just use one script to import all products at once?
Thanks in advance.
It depends a lot on how you've written it in terms of whether it doesn't leak open files or database connections. It also depends on which version of php you're using. In php 5.3 there was a lot done to address garbage collection:
http://www.php.net/manual/en/features.gc.performance-considerations.php
If it's not important that the operation is transactional, i.e all or nothing (for example, if it fails half way through) then I would be tempted to tackle this in chunks where each run of the script processed the next x items, where x can be a variable depending on how long it takes. So what you'll need to do then is keep on repeating the script until nothing is done.
To do this, I'd recommend using a tool called the Fat Controller:
http://fat-controller.sourceforge.net
It can keep on repeating the script and then stop once everything is done. You can tell the Fat Controller that there's more to do, or that everything is done using exit statuses from the php script. There are some use cases on the Fat Controller website, for example: http://fat-controller.sourceforge.net/use-cases.html#generating-newsletters
You can also use the Fat Controller to run processes in parallel to speed things up, just be careful you don't run too many in parallel and slow things down. If you're writing to a database, then ultimately you'll be limited by the hard disc, which unless you have something fancy will mean your optimum concurrency will be 1.
The final question would be how to trigger this - and you're probably best off triggering the Fat Controller from CRON.
There's plenty of documentation and examples on the Fat Controller website, but if you need any specific guidance then I'd be happy to help.
To complete the previous answer, the best solution is to optimize your scripts:
Prefer JSON to XML, parsing JSON is faster (vastly).
Use one or few concurrent connection to database.
Alter multiple rows in one time (Insert 10-30 rows in one query, select 100 rows, delete multiple, not more to not overload memory and not less to make your transaction profitable).
Minimize the number of queries. (following previous point)
Skip definitively already up to date rows, use dates (timestamp, datetime).
You can also let the proc whisper with usleep(30) call.
To use multiple PHP process, use popen().
Brief overview about my usecase: Consider a database (most probably mongodb) having a million entries. The value for each entry needs to be updated everyday by calling an API. How to design such a cronjob? I know Facebook does something similar. The only thing I can think of is to have multiple jobs which divide the database entries into batches and each job updates a batch. I am certain there are smarter solutions out there. I am also not sure what technology to use. Any advise is appreciated.
-Karan
Given the updated question context of "keeping the caches warm", a strategy of touching all of your database documents would likely diminish rather than improve performance unless that data will comfortably fit into available memory.
Caching in MongoDB relies on the operating system behaviour for file system cache, which typically frees cache by following a Least Recently Used (LRU) approach. This means that over time, the working data set in memory should naturally be the "warm" data.
If you force data to be read into memory, you could be loading documents that are rarely (or never) accessed by end users .. potentially at the expense of data that may actually be requested more frequently by the application users.
There is a use case for "prewarming" the cache .. for example when you restart a MongoDB server and want to load data or indexes into memory.
In MongoDB 2.2, you can use the new touch command for this purpose.
Other strategies for prewarming are essentially doing reverse optimization with an explain(). Instead of trying to minimize the number of index entries (nscanned) and documents (nscannedObjects), you would write a query that intentionally will maximize these entries.
With your API response time goal .. even if someone's initial call required their data to be fetched into memory, that should still be a reasonably quick indexed retrieval. A goal of 3 to 4 seconds response seems generous unless your application has a lot of processing overhead: the default "slow" query value in MongoDB is 100ms.
From a technical standpoint, You can execute scripts in the mongodb shell, and execute them via cron. If you schedule cron to run a command like:
./mongo server:27017/dbname--quiet my_commands.js
Mongodb will execute the contents of the my_commands.js script. Now, for an overly simple example just to illustrate the concept. If you wanted to find a person named sara and insert an attribute (yes, unrealistic example) you could enter the following in your .js script file.
person = db.person.findOne( { name : "sara" } );
person.validated = "true";
db.people.save( person );
Then everytime the cron runs, that record will be updated. Now, add a loop and a call to your api, and you might have a solution. More information on these commands and example can be found in the mongodb docs.
However, from a design perspective, are you sure you need to update every single record each night? Is there a way to identify a more reasonable subset of records that need to be processed? Or possibly can the api be called on the data as it's retrieved and served to whomever is going to consume it?
I am writing a ruby on rails application and one of the most important featuers of the website is live voting. We fully expect that we will get 10k voting requests in as little as 1 minutes. Along with other requests that means we could be getting a ton of requests.
My initial idea is to set up the server to use apache + phusion, however, for the voting specifically I'm thinking about writing a php script on side and to write/read the information in memcached. The data only needs to persist for about 15 minutes, so writing to the database 10,000 times in 1 minute seems pointless. We also need to mark the ip of the user so they don't vote twice thus being extra complicated in memcached.
If anyone has any suggestions or ideas to make this work as best as possible, please help.
If you're architecting an app for this kind of massive influx, you're going to need to strip down the essential components of it to the absolute minimum.
Using a full Rails stack for that kind of intensity isn't really practical, nor necessary. It would be much better to build a very thin Rack layer that handles the voting by making direct DB calls, skipping even an ORM, basically being a wrapper around an INSERT statement. This is something Sinatra and Sequel, which serves as an efficient query generator, might help with.
You should also be sure to tune your database properly, plus run many load tests against it to be sure it performs as expected, with a healthy margin for higher loading.
Making 10,000 DB calls in a minute isn't a big deal, each call will take only a fraction of a millisecond on a properly tuned stack. Memcached could offer higher performance especially if the results are not intended to be permanent. Memcached has an atomic increment operator which is exactly what you're looking for when simply tabulating votes. Redis is also a very capable temporary store.
Another idea is to scrap the DB altogether and write a persistent server process that speaks a simple JSON-based protocol. Eventmachine is great for throwing these things together if you're committed to Ruby, as is NodeJS if you're willing to build out a specialized tally server in JavaScript.
10,000 operations in a minute is easily achievable even on modest hardware using a specialized server processes without the overhead of a full DB stack.
You will just have to be sure that your scope is very well defined so you can test and heavily abuse your implementation prior to deploying it.
Since what you're describing is, at the very core, something equivalent to a hash lookup, the essential code is simply:
contest = #contest[contest_id]
unless (contest[:voted][ip])
contest[:voted][ip] = true
contest[:votes][entry_id] += 1
end
Running this several hundred thousand times in a second is entirely practical, so the only overhead would be wrapping a JSON layer around it.
I have a process that has to be ran against certain things and it isn't suitable to be ran at the users end (15+ seconds to process) so I considered using a cron job but again, this is also unsuitable because it will create a back log. I have narrowed my options down to either running an endless process that monitors for mysql changes, or configuring mysql to trigger the script when it detects a change but the latter is not something I want to get into unless it's my only option, which leaves me with the "endless" monitoring option.
The sort of thing I'm considering with PHP is:
while (true) {
$db->query('SELECT * FROM database');
while($row = $db->fetch_assoc()){
// do the stuff here
}
sleep(5);
}
and then running it via the command line. Now this is theoretically sound but in practice it isn't doing as well as I hoped, using more resources than I would expect (but not out of my range, just not what I'm aiming for optimally). So my questions are as follows:
Is PHP the wrong language to do this in? PHP is what I work with, but I understand that there are times when it's the wrong choice and I think maybe this is. If it is, what language should I use?
Is there a better approach that I haven't considered and that isn't any of the ideas I have listed?
If PHP is the correct option, how can I optimise the code I posted, is there a method better than sleeping for 5 seconds after each completed operation?
Thanks in advance! I'm open to any ideas as long as they're not too far out there, I'm running my own server with free reign so there's no theoretical limit on what I can do.
I recommend moving the loop out into a shell script and then executing a new PHP process for every iteration. This way PHP will never use unbounded resources (even if there is a memory/connection leak somewhere) since the process is terminated on each iteration. Something like the following should be fine (Bash):
while true; do
php /path/to/your/script.php 2>&1 | logger ...(logger options)
sleep 5
done
I've found this approach to be far more robust for long-running scripts in PHP, probably because this is very like the way PHP operates when run as a CGI script.
You should always work with the language you're most familiar with. If this is PHP, then it's not a wrong choice.
Disconnect from the database before sleeping. This way your script won't keep a connection reserved, and it will work fine even after database restart.
Free mysql result after using it. Always check for error conditions in daemonized processes, and deal with them appropriately.
PHP might be the wrong language as it's really designed for serving requests on an ad-hoc basis, rather than creating long-running daemons. (It was originally created as a preprocessor language, then later on came into general use as a web application language.)
Something like Python might work better for your needs; it's a little more naturally designed for "daemon-like" programs.
That said, it is possible to do what you want in PHP.
what kind of problems are you experiencing?
i dont know about the database class you have there in $db, but it could generate a memory leak.
furthermore i would suggest closing all your connections and unsetting all your variables if necessary at the end of the loop and re open on the beginning!
if its only 5 second sleep maby only on every 10th interation or something. you can do a counter for that...
theese points considered theres nothing wrong with this approach.
For multiple running PHP scripts (10 to 100) to communicate, what is the least memory intensive solution?
Monitor flat files for changes
Keep running queries on a DB to check for new data
Other techniques I have heard of, but never tried:
Shared memory (APC, or core functions)
Message queues (Active MQ and company)
In general, a shared memory based solution is going to be the fastest and have the least overhead in most cases. If you can use it, do it.
Message Queues I don't know much about, but when the choice is between databases and flat files, I would opt for a database because of concurrency issues.
A file you have to lock to add a line to it, possibly causing other scripts to fail to write their messages.
In a database based solution, you can work with one record for each message. The record would contain a unique ID, the recipient, and the message. The recipient script can easily poll for new messages, and after reading, quickly and safely remove the record in question.
This is hard to answer without knowing:
How much data will they send in each message (2 bytes or 4 megabytes)?
Will they run in the same machine? (This looks like a yes, else you wouldn't be considering shared memory)
What are the performance requirements (one message a minute or a zillion per second)?
What resource is most important to you?
And so on...
Using a DB is probably easiest to setup in a PHP environment and, depending on how many queries per minute and the type of those queries, that might indeed be the sanest solution. Personally I'd try that first and then see if it's not enough.
But, again, hard to tell for sure without more information on the application.