Running 600+ threads with PHP pthreads - what about the overhead - php

I have a server with 2 physical CPU which have together 24 cores and 10 GB RAM.
The PHP program is calculating a statistic and I could run each section totally independent of the others. Once all calculations finished I have only to "merge" them.
Therefore I had the idea to perform each calculation phase in a separate thread created/controlled by "pthread".
Each calculation takes around 0,10 seconds but the amount of the calculations lets it take that long when they are serialized.
My questions:
Is there a limitation when creating a new "thread" with "pthreads"?
What is the overhead when creating a new thread? I must consider this to avoid a new delay.
I can imagine that for several seconds the load would be very high but then it ends suddenly once each calculation finished. This is not the problem. It is "my" server and I do not have to take care regarding other users [or when it is a shared server].

While "waiting" for an answer :-) I started to rewrite the class.
I can summarize it like this:
There is no way to start 600 threads at once. I expected it but I wanted to know where is the limit. My configuration "allowed" around 160 threads to be started.
When starting more than these 150 threads the PHP script stopped working without any further notice.
As Franz Gleichmann pointed out the whole process took longer when starting lot of threads. I found out that starting 20 threads has the best performance.
The achieved performance gain is between 20% and 50% - I am satisfied.
I don't know if it is a bug in the pthread library but I could not access any class members. I had to move the class members inside the function. Due to the fact the calculation is in one function it did not bother me and I do not investigate it further.

Related

Running multiple processes in parallel in php

Context:
I am doing a robot to read the news block on the first page of google
results. I need the results for 200 search queries (totally need to
read 200 pages).
To avoid being blocked by google, must wait some time to do the next
search (from the same ip). If you wait 30 seconds between each
search, reading the 200 pages it will take (200 * 30/60) = 1h40m.
But as news of google results change very fast, I need those 200
pages are accessed almost simultaneously. So reading all 200 pages
should take only a few minutes.
If work is divided between 20 proxies (ips), it will take (200/20 *
30/60) = 5m (20 proxies running simultaneously)
I was planning to use pthreads through cli.
Question / Doubt:
Is it possible to run 20 threads simultaneously? Is it advisable to run only a few trheads?
What if I want to run 100 threads (using 100 proxies)?
What other options do I have?
Edit:
I found another option, using php curl_multi or the many libraries written over curl_multi for this purpose. But I think I'll stick to pthreads.
Is it possible to run 20 threads simultaneously?
Some hardware has more than 20 cores, in those cases, it is a no brainer.
Where your hardware has less than 20 cores, it is still not a ridiculous amount of threads, given that the nature of the threads will mean they spend some time blocking waiting for I/O, and a whole lot more time purposefully sleeping so that you don't anger Google.
Ordinarily, when the threading model in use is 1:1, as it is in PHP, it's a good idea in general to schedule about as many threads as you have cores, this is a sensible general rule.
Obviously, the software that started before you (your entire operating system) has likely already scheduled many more threads than you have cores.
The best case scenario still says you can't execute more threads concurrently than you have cores available, which is the reason for the general rule. However, many of the operating systems threads don't actually need to run concurrently, so the authors of those services don't go by the same rules.
Similarly to those threads started by the operating system, you intend to prohibit your threads executing concurrently on purpose, so you can bend the rules too.
TL;DR yes, I think that's okay.
What if I want to run 100 threads ?
Ordinarily, this might be a bit silly.
But since you plan to force threads to sleep for a long time in between requests, it might be okay here.
You shouldn't normally expect that more threads equates to more throughput. However, in this case, it means you can use more outgoing addresses more easily, sleep for less time overall.
Your operating system has hard limits on the number of threads it will allow you to create, you might well be approaching these limits on normal hardware at 100 threads.
TL;DR in this case, I think that's okay.
What other options do I have?
If it weren't for the parameters of your operation; that you need to sleep in between requests, and use either specific interfaces or proxies to route requests through multiple addresses, you could use non-blocking I/O quite easily.
Even given the parameters, you could still use non-blocking I/O, but it would make programming the task much more complex than it needs to be.
In my (possibly bias) opinion, you are better off using threads, the solution will be simpler, with less margin for error, and easier to understand when you come back to it in 6 months (when it breaks because Google changed their markup or whatever).
Alternative to using proxies
Using proxies may prove to be unreliable and slow, if this is to be a core functionality for some application then consider obtaining enough IP addresses that you can route these requests yourself using specific interfaces. cURL, context options, and sockets, will allow you to set outbound address, this is likely to be much more reliable and faster.
While speed is not necessarily a concern, reliability should be. It is reasonable for a machine to be bound to 20 addresses, it is less reasonable for it to be bound to 100, but if needs must.
Why don't you just make a single loop, which walks through the proxies ?
This way it's just one process at a time, and also you can filter out dead proxies, and still you can get the desired frequency of updates.
You could do something like this:
$proxies=array('127.0.0.1','192.168.1.1'); // define proxies
$dead=array(); // here you can store which proxies went dead (slow, not responding, up to you)
$works=array('http://google.com/page1','http://google.com/page2'); // define what you want to do
$run=true; $last=0; $looptime=(5*60); // 5 minutes update
$workid=0; $proxyid=0;
while ($run)
{
if ($workid<sizeof($works))
{ // have something to do ...
$work=$works[$workid]; $workid++; $success=0;
while (($success==0)and($proxyid<sizeof($proxies)))
{
if (!in_array($proxyid,$dead))
{
$proxy=$proxies[$proxyid];
$success=launch_the_proxy($work,$proxy);
if ($success==0) {if (!in_array($proxyid,$dead)) {$dead[]=$proxyid;}}
}
$proxyid++;
}
}
else
{ // restart the work sequence once there's no more work to do and loop time is reached
if (($last+$looptime)<time()) {$last=time(); $workid=0; $proxyid=0;}
}
sleep(1);
}
Please note, this is a simple example, you have to work on the details. You must also keep in mind, this one requires at least equal number of proxies compared to the work. (you can tweak this later as you wish, but that needs a more complex way to determine which proxy can be used again)

Optimizing mysql / PHP based website | 300 qps

Hey,
I currently have over 300+ qps on my mysql. There is roughly 12000 UIP a day / no cron on fairly heavy PHP websites. I know it's pretty hard to judge if is it ok without seeing the website but do you think that it is a total overkill?
What is your experience? If I optimize the scripts, do you think that I would be able to get substantially lower of qps? I mean if I get to 200 qps that won't help me much. Thanks
currently have over 300+ qps on my mysql
Your website can run on a Via C3, good for you !
do you think that it is a total overkill?
That depends if it's
1 page/s doing 300 queries, yeah you got a problem.
30-60 pages/s doing 5-10 queries each, then you got no problem.
12000 UIP a day
We had a site with 50-60.000, and it ran on a Via C3 (your toaster is a datacenter compared to that crap server) but the torrent tracker used about 50% of the cpu, so only half of that tiny cpu was available to the website, which never seemed to use any significant fraction of it anyway.
What is your experience?
If you want to know if you are going to kill your server, or if your website is optimizized, the following has close to zero information content :
UIP (unless you get facebook-like numbers)
queries/s (unless you're above 10.000) (I've seen a cheap dual core blast 20.000 qps using postgres)
But the following is extremely important :
dynamic pages/second served
number of queries per page
time duration of each query (ALL OF THEM)
server architecture
vmstat, iostat outputs
database logs
webserver logs
database's own slow_query, lock, and IO logs and statistics
You're not focusing on the right metric...
I think you are missing the point here. If 300+ qps are too much heavily depends on the website itself, on the users per second that visit the website, that the background scripts that are concurrently running, and so on. You should be able to test and/or compute an average query throughput for your server, to understand if 300+ qps are fair or not. And, by the way, it depends on what these queries are asking for (a couple of fields, or large amount of binary data?).
Surely, if you optimize the scripts and/or reduce the number of queries, you can lower the load on the database, but without having specific data we cannot properly answer your question. To lower a 300+ qps load to under 200 qps, you should on average lower your total queries by at least 1/3rd.
Optimizing a script can do wonders. I've taken scripts that took 3 minutes before to .5 seconds after simply by optimizing how the calls were made to the server. That is an extreme situation, of course. I would focus mainly on minimizing the number of queries by combining them if possible. Maybe get creative with your queries to include more information in each hit.
And going from 300 to 200 qps is actually a huge improvement. That's a 33% drop in traffic to your server... that's significant.
You should not focus on the script, focus on the server.
You are not saying if these 300+ querys are causing issues. If your server is not dead, no reason to lower the amount. And if you have already done optimization, you should focus on the server. Upgrade it or buy more servers.

Why is it so bad to run a PHP script continuously?

I have a map. On this map I want to show live data collected from several tables, some of which have astounding amounts of rows. Needless to say, fetching this information takes a long time. Also, pinging is involved. Depending on servers being offline or far away, the collection of this data could vary from 1 to 10 minutes.
I want the map to be snappy and responsive, so I've decided to add a new table to my database containing only the data the map needs. That means I need a background process to update the information in my new table continuously. Cron jobs are of course a possibility, but I want the refreshing of data to happen as soon as the previous interval has completed. And what if the number of offline IP addresses suddenly spike and the loop takes longer to run than the interval of the Cron job?
My own solution is to create an infinite loop in PHP that runs by the command line. This loop would refresh the data for the map into MySQL as well as record other useful data such as loop time and failed attempts at pings etc, then restart after a short pause (a few seconds).
However - I'm being repeatedly told by people that a PHP script running for ever is BAD. After a while it will hog gigabytes of RAM (and other terrible things)
Partly I'm writing this question to confirm if this is in fact the case, but some tips and tricks on how I would go about writing a clean loop that doesn't leak memory (If that is possible) wouldn't go amiss. Opinions on the matter would also be appreciated.
The reply I feel sheds the most light on the issue I will mark as correct.
The loop should be in one script which will activate/call the actual script as a different process...much like cron is doing.
That way, even if memory leaks, and non collected memory is accumulating, it will/should be free after each cycle.
However - I'm being repeatedly told by people that a PHP script running for ever is BAD. After a while it will hog gigabytes of RAM (and other terrible things)
This used to be very true. Previous versions of PHP had horrible garbage collection, so long-running scripts could easily accidentally consume far more memory than they were actually using. PHP 5.3 introduced a new garbage collector that can understand and clean up circular references, the number one cause of "memory leaks." It's enabled by default. Check out that link for more info and pretty graphs.
As long as your code takes steps to allow variables to go out of scope at proper times and otherwise unset variables that will no longer be used, your script should not consume unnecessary amounts of memory just because it's PHP.
I don't think its bad, as with anything that you want to run continuously you have to be more careful.
There are libraries out there to help you with the task. Have a look at System_Daemon, which release RC 1 just over a month ago, which allows you to "Set options like max RAM usage".
Rather than running an infinite loop I'd be tempted to go with the cron option you mention in conjunction with a database table entry or flat-file that you'd use to store a "currently active" status bit to ensure that you didn't have overlapping processes attempting to run at the same time.
Whilst I realise that this would mean a minor delay before you perform the next iteration, this is probably a better idea anyway as:
It'll let the RDBMS perform any pending low-priority updates, etc. that may well been on-hold due to the amount of activity that you've been carrying out.
Even if you neatly unset all the temporary variables you've been using, it's still possible that PHP will "leak" memory, although recent improvements (5.2 introduced a new memory management system and garbage collection was overhauled in 5.3) should hopefully mean that this less of an issue.
In general, it'll also be easier to deal with other issues (if the DB connection temporarily goes down due to a config change and restart for example) if you use the cron approach, although in an ideal world you'd cater for such eventualities in your code anyway. (That said, the last time I checked, this was far from an ideal world.)
First I fail to see how you need a daemon script in order to provide the functionality you describe.
Cron jobs are of course a possibility, but I want the refreshing of data to happen as soon as the previous interval has completed
The neither a cron job nor a daemon are the way to solve the problem (unless the daemon becomes the data sink for the scripts). I'd spawn a dissociated process when the data is available using a locking strategy to aoid concurrency.
Long running PHP scripts are not intrinsically bad - but there reference counting garbage collector does not deal with all possible scenarios for cleaning up memory - but more recent implementations have a more advanced collector which should clean up a lot more (circular reference checker).

PHP and CPU - Process of chat + notifications

My site has a PHP process running, for each window/tab open, that runs in a maximum of 1 minute, and it returns notifications/chat messages/people online or offline. When JavaScript gets the output, it calls the same PHP process again and so on.
This is like Facebook chat.
But, seems it is taking too much CPU when it is running. Have you something in mind how Facebook handles this problem? What do they do so their processes don't take too much CPU and put their servers down?
My process has a "while(true)", with a "sleep(1)" at the end. Inside the cycle, it checks for notifications, checks if one of the current online people got offline/changed status, reads unread messages, etc.
Let me know if you need more info about how my process works.
Does calling other PHPs from "system()" (and wait for its output) alleviate this?
I ask this because it makes other processes to check notifications, and flushes when finished, while the main PHP is just collecting the results.
Thank you.
I think your main problem here is the parallelism. Apache and PHP do not excell at tasks like this where 100+ Users have an open HTTP-Request.
If in your while(true) you spend 0.1 second on CPU-bound workload (checking change status or other useful things) and 1 second on the sleep, this would result in a CPU load of 100% as soon as you have 10 users online in the chat. So in order so serve more users with THIS model of a chat you would have to optimize the workload in your while(true) cycle and/or bring the sleep interval from 1 second to 3 or higher.
I had the same problem in a http-based chat system I wrote many years ago where at some point too many parallel mysql-selects where slowing down the chat, creating havy load on the system.
What I did is implement a fast "ring-buffer" for messages and status information in shared memory (sysv back in the day - today I would probably use APC or memcached). All operations write and read in the buffer and the buffer itself gets periodicaly "flushed" into the database to persist it (but alot less often than once per second per user). If no persistance is needed you can omit a backend of course.
I was able to increase the number of user I could serve by roughly 500% that way.
BUT as soon as you solved this isse you will be faced with another: Available System Memory (100+ apache processes a ~5MB each - fun) and process context switching overhead. The more active processes you have the more your operating system will spend on the overhead involved with assigning "fair enough" CPU-slots AFAIK.
You'll see it is very hard to scale efficently with apache and PHP alone for your usecase. There are open source tools, client and serverbased to help though. One I remember places a server before the apache and queues messages internally while having a very efficent multi-socket communication with javascript clients making real "push" events possible. Unfortunatly I do not remember any names so you'll have to research or hope on the stackoverflow-community to bring in what my brain discarded allready ;)
Edit:
Hi Nuno,
the comment field has too few characters so I reply here.
Lets get to the 10 users in parallel again:
10*0.1 second CPU time per cycle (assumed) is roughly 1s combined CPU-time over a period of 1.1 second (1 second sleep + 0.1 second execute). This 1 / 1.1 which I would boldly round to 100% cpu utilization even though it is "only" %90.9
If there is 10*0.1s CPU time "stretched" over a period of not 1.1 seconds but 3.1 (3 seconds sleep + 0.1 seconds execute) the calculation is 1 / 3.1 = %32
And it is logical. If your checking-cycle queries your backend three times slower you have only a third of the load on your system.
Regarding the shared memory: The name might imply it but if you use good IDs for your cache-areas, like one ID per conversation or user, you will have private areas within the shared memory. Database tables also rely on you providing good IDs to seperate private data from public information so those should be arround allready :)
I would also not "split" any more. The fewer PHP-processes you have to "juggle" in parallel the easier it is for your systems and for you. Unless you see it makes absolutly sense because one type of notification takes alot more querying ressources than another and you want to have different refresh-times or something like that. But even this can be decided in the whyile cycle. users "away"-status could be checked every 30 seconds while the messages he might have written could get checked every 3. No reason to create more cycles. Just different counter variables or using the right divisor in a modulo operation.
The inventor of PHP said that he believes man is too limited to controll parallel processes :)
Edit 2
ok lets build a formula. We have these variables:
duration of execution (e)
duration of sleep (s)
duration of one cycle (C)
number of concurrent users (u)
CPU load (l)
c=e+s
l=ue / c #expresses "how often" the available time-slot c fits into the CPU load generated by 30 CONCURRENT users.
l=ue / (e+s)
for 30 users ASSUMING that you have 0.1s execution time and 1 second sleep
l=30*0.1 / (0.1 + 1)
l=2.73
l= %273 CPU utilization (aka you need 3 cores :P)
exceeding capab. of your CPU measn that cycles will run longer than you intend. the overal response time will increase (and cpu runs hot)
PHP blocks all sleep() and system() calls. What you really need is to research pcntl_fork(). Fortunately, I had these problems over a decade ago and you can look at most of my code.
I had the need for a PHP application that could connect to multiple IRC servers, sit in unlimited IRC chatrooms, moderate, interact with, and receive commands from people. All this and more was done in a process efficient way.
You can check out the entire project at http://sourceforge.net/projects/phpegg/ The code you want is in source/connect.inc.

php memory how much is too much

I'm currently re-writing my site using my own framework (it's very simple and does exactly what I need, i've no need for something like Zend or Cake PHP). I've done alot of work in making sure everything is cached properly, caching pages in files so avoid sql queries and generally limiting the number of sql queries.
Overall it looks like it's very speedy. The average time taken for the front page (taken over 100 times) is 0.046152 microseconds.
But one thing i'm not sure about is whether i've done enough to reduce php memory usage. The only time i've ever encountered problems with it is when uploading large files.
Using memory_get_peak_usage(TRUE), which I THINK returns the highest amount of memory used whilst the script has been running, the average (taken over 100 times) is 1572864 bytes.
Is that good?
I realise you don't know what it is i'm doing (it's rather simple, get the 10 latest articles, the comment count for each, get the user controls, popular tags in the sidebar etc). But would you be at all worried with a script using that sort of memory getting hit 50,000 times a day? Or once every second at peak times?
I realise that this is a very open ended question. Hopefully you can understand that it's a bit of a stab in the dark and i'm really just looking for some re-assurance that it's not going to die horribly come re-launch day.
EDIT: Just an mini experiment I did for myself. I downloaded and installed Wordpress and a default installation with no extra add ons, just one user and just one post and it used 10.5 megabytes of memory or "11010048 bytes". Quite pleased with my 1.5mb now.
Memory usage values can vary heavily and are subject to fluctuation, but as you already say in your update, a regular WordPress instance is much, much fatter than that. I have had great troubles to get the WordPress backend running with a memory_limit of sixteen megabytes - let alone when Plug-ins come into play. So from that, I'd say a peak of 1,5 Megabytes performing normal tasks is quite okay.
Generation time is extremely subject to the hardware your site runs on, obviously. However, a generation time of 0.046152 seconds (I assume you mean seconds here) sounds very okay to me under normal circumstances.
It is a subjective question. PHP has a lot of overhead and when calling the function with TRUE, that overhead will be included. You'll see what I mean when you call the function in a simple Hello World script. Also keep in mind that results can differ greatly depending on whether PHP is run as an apache module or FastCGI.
Unfortunately, no one can provide assurances. There will always be unforseen variables that can bring down a site. Perform load testing. Use a code profiler to narrow down the location of any bottlenecks to see if there are ways to make those code blocks more efficient
Encyclopaedia Britannica thought they were prepared when they launched their ad-supported encyclopedia ten years ago. The developers didn't know they would be announcing it on Good Morning America the day of the launch. The whole thing came crashing down for days.
As long as your systems aren't swapping, your memory usage is reasonable. Any additional concern is just premature optimization.

Categories