I use PHP to do a lot of data processing ( realizing I'm probably pushing into territories where I should be using other languages and/or techniques ).
I'm doing entity extraction with a PHP process that loads an array containing ngrams to look for into memory. That array uses 3GB of memory and takes about 20 seconds to load each time I launch a process. I generate it once locally on the machine and each process loads it from a .json file. Each process then tokenizes the text it's processing and does an array_intersect between these two arrays to extract entities.
Is there any way to preload this into memory on the machine that is running all these processes and then share the resource across all the processes?
Since it's probably not possible with PHP: What type of languages/methods should I be researching to do this sort of entity extraction more efficiently?
If the array never gets modified after it's loaded, then you could use pcntl_fork() and fork off a bunch of copies of the script. With copy-on-write semantics, they'd all be reading from the exact same memory copy of the array.
However, as soon as the array gets modified, then you'll pay a huge penalty as the array gets copied into each forked child's memory space. This would be especially true if any of the scripts finish their run early - they'd shut down, that PHP process starts shutdown cleanup, and that'd count as a write on the array's memory space, causing the copying.
In your case, the best way of sharing might be read only mmap access.
I don't know if this is possible in PHP. A lot of languages will allow you to mmap a file into memory - and your operating system will be smart enough to realize that read-only maps can be shared. Also, if you don't need all of it, the operating system can reclaim the memory, and load it again from disk as necessary. In fact, it may even allow you to map more memory than you physically have.
mmap is really elegant. But nevertheless, dealing with such mapped data in PHP will likely be a pain, and sloooow. In general PHP is slow. In benchmarks, it is common to see PHP come in at 40-50 times the runtime of a good C program. This is much worse than e.g. Java, where a good Java program is only twice as slow as a highly optimized C; there it may pay off to have the powerful development tools of Java as opposed to having to debug low-level C code. But PHP does not have any key benefit: it is neither elegant to write, nor does it have a superior toolchain, nor it is fast...
Related
We are working on an image-processing application. It involves applying filters, Gaussian etc. We want to make it a highly concurrent application.
This will be on multiple single core ec2 instances.
Since Imageprocessing is an cpu intensive operation, we are thinking node.js gets blocked in event loop, so thinking to use php. We are not able to find any benchmarks in this area. Any inputs on this will be a great help.
Its a CPU-bound task. Really well optimized PHP or Node will probably perform similarly. I/O concurrency will not affect CPU bound tasks on single core. On many core the I/O may come into play, but realistically most platforms including PHP have efficient strategies for concurrent I/O now. Also you are likely to end up calling out to C or C++ code regardless.
If you really want (cost-effective) performance, drop the single core thing, put some large basically gaming or bitcoin mining PCs in the office, find a nice way to distribute the tasks among the machine(s) and a way to process multiple images concurrently on the GPUs. None of that is in actuality tied to a particular programming language.
PHP is not highly concurrent and each request will block until it's done. Node would be fine as long as it's mainly doing I/O, or waiting for another process to return, e.g. calling convert (ImageMagick), rather than doing any processing itself. The more CPU cores you have to run the actual conversion on, the better.
For image processing, I recommend to use PHP instead of Node.js because there are many great PHP packages that can help you work with images easily. Don't worry about the performance of PHP7 or HHVM :)
I have a multi-process PHP (CLI) application that runs continuously. I am trying to optimize the memory usage because the amount of memory used by each process limits the number of forks that I can run at any given time (since I have a finite amount of memory available). I have tried several approaches. For example, following the advice given by preinheimer, I re-compiled PHP, disabling all extensions and then re-enabling only those needed for my application (mysql, curl, pcntl, posix, and json). This, however, did not reduce the memory usage. It actually increased slightly.
I am nearly ready to abandon the multi-process approach, but I am making a last ditch effort to see if anyone else has any better ideas on how to reduce memory usage. I will post my alternative approach, which involves significant refactoring of my application, below.
Many thanks in advance to anyone who can help me tackle this challenge!
Mutli-process PHP applications (e.g. an application that forks itself using pcntl_fork()) are inherently inefficient in terms of memory because each child process loads an entire copy of the php executable into memory. This can easily equate to 10 MB of memory per process or more (depending on the application). Compiling extensions as shared libraries, in theory, should reduce the memory footprint, but I have had limited success with this (actually, my attempts at this made the memory usage worse for some unknown reason).
A better approach is to use multi-threading. In this approach, the application resides in a single process, but multiple actions can be performed *concurrently** in separate threads (i.e. multi-tasking). Traditionally PHP has not been ideal for multi-threaded applications, but recently some new extensions have made multi-threading in PHP more feasible. See for example, this answer to a question about multithreading in PHP (whose accepted answer is rather outdated).
For the above problem, I plan to refactor my application into a multi-theaded one using pthreads. This requires a significant amount of modifications, but it will (hopefully) result in a much more efficient overall architecture for the application. I will update this answer as I proceed and offer some re-factoring examples for anyone else who would like to do something similar. Others feel free to provide feedback and also update this answer with code examples!
*Footnote about concurrence: Unless one has a multi-core machine, the actions will not actually be performed concurrently. But they will be scheduled to run on the CPU in different small time slices. From the user perspective, they will appear to run concurrently.
Right now I'm running 50 PHP (in CLI mode) individual workers (processes) per machine that are waiting to receive their workload (job). For example, the job of resizing an image. In workload they receive the image (binary data) and the desired size. The worker does it's work and returns the resized image back. Then it waits for more jobs (it loops in a smart way). I'm presuming that I have the same executable, libraries and classes loaded and instantiated 50 times. Am I correct? Because this does not sound very effective.
What I'd like to have now is one process that handles all this work and being able to use all available CPU cores while having everything loaded only once (to be more efficient). I presume a new thread would be started for each job and after it finishes, the thread would stop. More jobs would be accepted if there are less than 50 threads doing the work. If all 50 threads are busy, no additional jobs are accepted.
I am using a lot of libraries (for Memcached, Redis, MogileFS, ...) to have access to all the various components that the system uses and Python is pretty much the only language apart from PHP that has support for all of them.
Can Python do what I want and will it be faster and more efficient that the current PHP solution?
Most probably - yes. But don't assume you have to do multithreading. Have a look at the multiprocessing module. It already has an implementation of a Pool included, which is what you could use. And it basically solves the GIL problem (multithreading can run only 1 "standard python code" at any time - that's a very simplified explanation).
It will still fork a process per job, but in a different way than starting it all over again. All the initialisations done- and libraries loaded before entering the worker process will be inherited in a copy-on-write way. You won't do more initialisations than necessary and you will not waste memory for the same libarary/class if you didn't actually make it different from the pre-pool state.
So yes - looking only at this part, python will be wasting less resources and will use a "nicer" worker-pool model. Whether it will really be faster / less CPU-abusing, is hard to tell without testing, or at least looking at the code. Try it yourself.
Added: If you're worried about memory usage, python may also help you a bit, since it has a "proper" garbage collector, while in php GC is a not a priority and not that good (and for a good reason too).
Linux has shared libraries, so those 50 php processes use mostly the same libraries.
You don't sound like you even have a problem at all.
"this does not sound very effective." is not a problem description, if anything those words are a problem on their own. Writing code needs a real reason, else you're just wasting time and/or money.
Python is a fine language and won't perform worse than php. Python's multiprocessing module will probably help a lot too. But there isn't much to gain if the php implementation is not completly insane. So why even bother spending time on it when everything works? That is usually the goal, not a reason to rewrite ...
If you are on a sane operating system then shared libraries should only be loaded once and shared among all processes using them. Memory for data structures and connection handles will obviously be duplicated, but the overhead of stopping and starting the systems may be greater than keeping things up while idle. If you are using something like gearman it might make sense to let several workers stay up even if idle and then have a persistent monitoring process that will start new workers if all the current workers are busy up until a threshold such as the number of available CPUs. That process could then kill workers in a LIFO manner after they have been idle for some period of time.
We have a large management software that is producing big reports of all kinds, based on numerous loops, with database retrievals, objects creations (many), and so on.
On PHP4 it could run happily with a memory limit of 64 MB - now we have moved it on a new server and with the same database - same code, the same reports won't come up without a gig of memory limit...
I know that PHP5 has changed under the hood quite a lot of things, but is there a way to make it behave ?
The question at the end is, what strategies do you apply when you need to have your scripts on a diet ?
A big problem we have run into was circular references between objects stopping them from freeing memory when they become out of scope.
Depending on your architecture you may be able to use __destruct() and manually unset any references. For our problem i ended up restructuring the classes and removing the circular references.
When I need to optimize resources on any script, I try always to analyze, profile and debug my code, I use xDebug, and the xDebug Profiler, there are other options like APD, and Benchmark Profiler.
Additionally I recommend you this articles:
Make PHP apps fast, faster, fastest..
Profiling PHP Applications (PDF)
PHP & Performance (PDF)
Since moving to the new server, have you verified that your MySQL and PHP system variables are identical to the way they were on your old server?
PHP5 introduced a lot of new functionality but due to its backward compatibility mantra, I don't believe that the differences between PHP5 and PHP4 should be causing this large an affect on the performance of an application who's code and database has not been altered.
Are you also running on the same version of Apache or IIS?
It sounds like a problem that is more likely related to your new system environment than to an upgrade from PHP4 to 5.
Bertrand,
If you are interested in refactoring the existing code then I would recommend that you first monitor your CPU and Memory usage while executing reports. Are you locking up your SQL server or are you locking up Apache (which happens if a lot of stress is being put onto the system by the PHP code)?
I worked on a project that initially bogged down MySQL so severely that we had to refactor the entire report generation process. However, when we finished the load was simply transferred to Apache (through the more complex PHP code). Our final solution was to refactor the database design to provide for better performance for reporting functions and to use PHP to pick up the slack on what we couldn't do natively in MySQL.
Depending on the nature of the reports you might consider denormalizing the data that is being used for the reports. You might even consider constructing a second database that serves as a data warehouse and is designed around OLAP principles rather than OLTP principles. You can start at Wikipedia for a general explanation of OLAP and data warehousing.
However, before you start looking at serious refactoring, have you verified that your environments are sufficiently similar by looking at phpinfo(); for PHP and SHOW VARIABLES;
in MySQL?
A gig!?!
even 64MB is big.
ignoring the discrepancy between environments, (which does sound very peculiar), it sounds like the code may need some re-factoring.
any chance you can re factor your code so that the result sets from database queries are not dumped into arrays. I would recommend that you construct an iterator for your result sets. (thence you can treat them as array for most purposes).there is a big difference between handling one record at a time, and handling 10,000 records at a time.
secondly, have a look at weather your code is creating multiple instances of the data. Can you pass the objects by reference. (use the '&'). We had to do a similar thing when using an early variant of the horde framework. a 1 MB attachment would blow out to 50MB from numerous calls which passed the whole dataset as a copy, rather than as a reference.
To put it simply i am a fairly new PHP coder and i was wondering if anyone could guide me towards the best ways to improve performance in code as well as stopping those pesky memory leaks, my host is one of those that doesn't have APC or the like installed so it would all have to be hand coded -_-
I don't think ordinary memory leaks (like forgetting to dispose of objects or strings) are common in PHP, but resource leaks in general are. I've had issues with:
database connections -- you should really call pg_close/mysql_close/etc. when you're done with the connection. Though I think PHPs connection pooling mitigates this (but can have problems of its own).
Images -- if you use the gd2 extension to open or create images, you need to image_destroy these, because otherwise they'll occupy memory forever. And images tend to be big in terms of data size.
Note that if your scripts run as pure CGI (no HTTP server modules), then the resources will effectively be cleaned up when the script exits. However there may still be memory issues during the script's runtime, especially in the case of images where it's not uncommon to perform many manipulations in a single script execution.
In general, php scripts can't leak memory. The php runtime manages all memory for its scripts. The script itself may leak memory, but this will be reclaimed when the php process ends. Since php is mainly used for processing http-requests and these generally run for a very short time, this makes it a non-issue if you leak a bit of memory underway. So memory leaks should only really concern you if you use php for non-http tasks. Performance should be a bigger concern for you than memory usage. Use a tool such as xdebug to profile your code.