Limit of exec() command in PHP? - php

I am using exec command in PHP to execute C++ code where record_generate.cpp is code which generate output(100 to millions of records) based on hard coded parameters.
exec('./record_generater 2>&1', $output);
print_r($output);
When number of output lines are limited to few thousand it gives output but when it reaches to 100,000s to million it seems to be crashed. How can i avoid such pblms?

The first thing you should do is to see if running the C++ program from a shell causes a similar problem.
If so, it's a problem with the C++ code itself and nothing to do with PHP exec.
If it works okay standalone, then it's probably going to be related to storing millions of records into the $output variable.
While a string in PHP can be pretty big (2G from memory), there's a limited total space available to scripts, specified by memory_limit_ in thephp.ini` file.
Even at 128M (8M prior to 5.2), this may not be enough to hold millions of lines.
You could try increasing that variable to something larger and see if it helps.
However, you will probably still be better off finding a different way to get the information from your C++ executable into your PHP code, such as writing it to a file/database and processing it in PHP a bit at a time, rather than trying to store the lot in memory at once.
In any case, given that it's not really a good user experience to have to look through millions of rows anyway, it might be worthwhile examining what you really need from this data. For example, it may be possible to aggregate or partition it in some manner before outputting.
Any advice we give on that front will need substantially more information than we currently have.

Related

Optimize huge file CSV treatment

I know this question can be too broad, but I need to find a way to optimize the treatment of a CSV file which contains 10 000 rows.
Each row must be parsed and at every row, I will need to call Google API and do calculations, then I need to write CSV file with new informations.
Right now, I am using PHP and the treatment takes around 1/2 hours.
Is there a way to optimize this ? I thought about using NodeJS to parallelize treatments of rows ?
You can use curl_multi_select to paralelize the Google API requests. — Load the input into a queue, run queries in parallel, write output and load more as the result is finished. Something like TCP Sliding Window algorithm.
Alternatively, you can load all data into a (SQLite) database (10 000 rows is not much) and then run the calculations in parallel. The database will be easier to implement than creating the sliding window.
I don't think the NodeJS would be much faster. Certainly not that much to be worth rewriting the existing code you already have.
You can debug the code by checking how long does it take to read the 10K rows and update them with some random extra columns or extra info. This will give you some sense of how long it takes to read and write to a CSV with 10K rows. I believe this shouldn't take long.
The google api calls might be culprit. If you know node.js it is good option, but if that is too much of a pain, you can use php curl to send multiple requests at once without waiting for the response for each request. This might help speed up the process. You can refer to this site for more info http://bytes.schibsted.com/php-perform-requests-in-parallel/
10,000 rows should be no problem but when opening in Python 3.6, make sure you use readlines and read all at once. Using the csv reader should also help with any separator issues and quote characters such as '"'. I've been reading 1.3million rows and its not an issue. Mine takes about 6-8 minutes to process, so your should be of the order of a few seconds.
Are you using a machine with enough memory? If you are using a raspberry pi, small virtual machine or really old laptop I could imagine that this would greatly hamper your processing time. Otherwise, you should be having no issues at all with python.

Good idea to run a PHP file for a few hours as cronjob?

I would like to run a PHP script as a cronjob every night. The PHP script will import a XML file with about 145.000 products. Each product contains a link to an image which will be downloaded and saved on the server as well. I can imagine that this may cause some overload. So my question is: is it a better idea to split the PHP file? And if so, what would be a better solution? More cronjobs, with several minutes pause between each other? Run another PHP file using exec (guess not, cause I can't imagine that would make much of a difference), or someting else...? Or just use one script to import all products at once?
Thanks in advance.
It depends a lot on how you've written it in terms of whether it doesn't leak open files or database connections. It also depends on which version of php you're using. In php 5.3 there was a lot done to address garbage collection:
http://www.php.net/manual/en/features.gc.performance-considerations.php
If it's not important that the operation is transactional, i.e all or nothing (for example, if it fails half way through) then I would be tempted to tackle this in chunks where each run of the script processed the next x items, where x can be a variable depending on how long it takes. So what you'll need to do then is keep on repeating the script until nothing is done.
To do this, I'd recommend using a tool called the Fat Controller:
http://fat-controller.sourceforge.net
It can keep on repeating the script and then stop once everything is done. You can tell the Fat Controller that there's more to do, or that everything is done using exit statuses from the php script. There are some use cases on the Fat Controller website, for example: http://fat-controller.sourceforge.net/use-cases.html#generating-newsletters
You can also use the Fat Controller to run processes in parallel to speed things up, just be careful you don't run too many in parallel and slow things down. If you're writing to a database, then ultimately you'll be limited by the hard disc, which unless you have something fancy will mean your optimum concurrency will be 1.
The final question would be how to trigger this - and you're probably best off triggering the Fat Controller from CRON.
There's plenty of documentation and examples on the Fat Controller website, but if you need any specific guidance then I'd be happy to help.
To complete the previous answer, the best solution is to optimize your scripts:
Prefer JSON to XML, parsing JSON is faster (vastly).
Use one or few concurrent connection to database.
Alter multiple rows in one time (Insert 10-30 rows in one query, select 100 rows, delete multiple, not more to not overload memory and not less to make your transaction profitable).
Minimize the number of queries. (following previous point)
Skip definitively already up to date rows, use dates (timestamp, datetime).
You can also let the proc whisper with usleep(30) call.
To use multiple PHP process, use popen().

Best approach for running an "endless" process monitoring MySQL?

I have a process that has to be ran against certain things and it isn't suitable to be ran at the users end (15+ seconds to process) so I considered using a cron job but again, this is also unsuitable because it will create a back log. I have narrowed my options down to either running an endless process that monitors for mysql changes, or configuring mysql to trigger the script when it detects a change but the latter is not something I want to get into unless it's my only option, which leaves me with the "endless" monitoring option.
The sort of thing I'm considering with PHP is:
while (true) {
$db->query('SELECT * FROM database');
while($row = $db->fetch_assoc()){
// do the stuff here
}
sleep(5);
}
and then running it via the command line. Now this is theoretically sound but in practice it isn't doing as well as I hoped, using more resources than I would expect (but not out of my range, just not what I'm aiming for optimally). So my questions are as follows:
Is PHP the wrong language to do this in? PHP is what I work with, but I understand that there are times when it's the wrong choice and I think maybe this is. If it is, what language should I use?
Is there a better approach that I haven't considered and that isn't any of the ideas I have listed?
If PHP is the correct option, how can I optimise the code I posted, is there a method better than sleeping for 5 seconds after each completed operation?
Thanks in advance! I'm open to any ideas as long as they're not too far out there, I'm running my own server with free reign so there's no theoretical limit on what I can do.
I recommend moving the loop out into a shell script and then executing a new PHP process for every iteration. This way PHP will never use unbounded resources (even if there is a memory/connection leak somewhere) since the process is terminated on each iteration. Something like the following should be fine (Bash):
while true; do
php /path/to/your/script.php 2>&1 | logger ...(logger options)
sleep 5
done
I've found this approach to be far more robust for long-running scripts in PHP, probably because this is very like the way PHP operates when run as a CGI script.
You should always work with the language you're most familiar with. If this is PHP, then it's not a wrong choice.
Disconnect from the database before sleeping. This way your script won't keep a connection reserved, and it will work fine even after database restart.
Free mysql result after using it. Always check for error conditions in daemonized processes, and deal with them appropriately.
PHP might be the wrong language as it's really designed for serving requests on an ad-hoc basis, rather than creating long-running daemons. (It was originally created as a preprocessor language, then later on came into general use as a web application language.)
Something like Python might work better for your needs; it's a little more naturally designed for "daemon-like" programs.
That said, it is possible to do what you want in PHP.
what kind of problems are you experiencing?
i dont know about the database class you have there in $db, but it could generate a memory leak.
furthermore i would suggest closing all your connections and unsetting all your variables if necessary at the end of the loop and re open on the beginning!
if its only 5 second sleep maby only on every 10th interation or something. you can do a counter for that...
theese points considered theres nothing wrong with this approach.

Speed up forum conversion

I'm converting a forum from myBB to IPBoard (the conversion is done through a PHP script), however I have over 4 million posts that need to be converted, and it will take about 10 hours at the current rate. I basically have unlimited RAM and CPU, what I want to know is how can I speed this process up? Is there a way I can allocate a huge amount of memory to this one process?
Thanks for any help!
You're not going to get a script to run any faster. By giving it more memory, you might be able to have it do more posts at one time, though. Change memory_limit in your php.ini file to change how much memory it can use.
You might be able to tell the script to do one forum at a time. Then you could run several copies of the script at once. This will be limited by how it talks to the database table and whether the script has been written to allow this -- it might do daft things like lock the target table or do an insanely long read on the source table. In any case, you would be unlikely to get more than three or four running at once without everything slowing down, anyway.
It might be possible to improve the script, but that would be several days' hard work learning the insides of both forums' database formats. Have you asked on the forums for IPBoard? Maybe someone there has experience at what you're trying to do.
not sure how the conversion is done, but if you are importing a sql file , you could split it up to multiple files and import them at the same time. hope that helps :)
If you are saying that you have the file(s) already converted, you should look into MySQL Load Data In FIle for importing, given you have access to the MySQL Console. This will load data considerably faster than executing the SQL Statements via the source command.
If you do not have them in the files and you are doing them on the fly, then I would suggest having the conversion script write the data to a file (set the time limit to 0 to allow it to run) and then use that load data command to insert / update the data.

Singular Value Decomposition (SVD) in PHP

I would like to implement Singular Value Decomposition (SVD) in PHP. I know that there are several external libraries which could do this for me. But I have two questions concerning PHP, though:
1) Do you think it's possible and/or reasonable to code the SVD in PHP?
2) If (1) is yes: Can you help me to code it in PHP?
I've already coded some parts of SVD by myself. Here's the code which I made comments to the course of action in. Some parts of this code aren't completely correct.
It would be great if you could help me. Thank you very much in advance!
SVD-python
Is a very clear, parsimonious implementation of the SVD.
It's practically psuedocode and should be fairly easy to understand
and compare/draw on for your php implementation, even if you don't know much python.
SVD-python
That said, as others have mentioned I wouldn't expect to be able to do very heavy-duty LSA with php implementation what sounds like a pretty limited web-host.
Cheers
Edit:
The module above doesn't do anything all by itself, but there is an example included in the
opening comments. Assuming you downloaded the python module, and it was accessible (e.g. in the same folder), you
could implement a trivial example as follow,
#!/usr/bin/python
import svd
import math
a = [[22.,10., 2., 3., 7.],
[14., 7.,10., 0., 8.],
[-1.,13.,-1.,-11., 3.],
[-3.,-2.,13., -2., 4.],
[ 9., 8., 1., -2., 4.],
[ 9., 1.,-7., 5.,-1.],
[ 2.,-6., 6., 5., 1.],
[ 4., 5., 0., -2., 2.]]
u,w,vt = svd.svd(a)
print w
Here 'w' contains your list of singular values.
Of course this only gets you part of the way to latent semantic analysis and its relatives.
You usually want to reduce the number of singular values, then employ some appropriate distance
metric to measure the similarity between your documents, or words, or documents and words, etc.
The cosine of the angle between your resultant vectors is pretty popular.
Latent Semantic Mapping (pdf)
is by far the clearest, most concise and informative paper I've read on the remaining steps you
need to work out following the SVD.
Edit2: also note that if you're working with very large term-document matrices (I'm assuming this
is what you are doing) it is almost certainly going to be far more efficient to perform the decomposition
in an offline mode, and then perform only the comparisons in a live fashion in response to requests.
while svd-python is great for learning, the svdlibc is more what you would want for such heavy
computation.
finally as mentioned in the bellegarda paper above, remember that you don't have to recompute the
svd every single time you get a new document or request. depending on what you are trying to do you could
probably get away with performing the svd once every week or so, in an offline mode, a local machine,
and then uploading the results (size/bandwidth concerns notwithstanding).
anyway good luck!
Be careful when you say "I don't care what the time limits are". SVD is an O(N^3) operation (or O(MN^2) if it's a rectangular m*n matrix) which means that you could very easily be in a situation where your problem can take a very long time. If the 100*100 case takes one minute, the 1000*1000 case would 10^3 minutes, or nearly 17 hours (and probably worse, realistically, as you're likely to be out of cache). With something like PHP, the prefactor -- the number multiplying the N^3 in order to calculate the required FLOP count, could be very, very large.
Having said that, of course it's possible to code it in PHP -- the language has the required data structures and operations.
I know this is an old Q, but here's my 2-bits:
1) A true SVD is much slower than the calculus-inspired approximations used, eg, in the Netflix prize. See: http://www.sifter.org/~simon/journal/20061211.html
There's an implementation (in C) here:
http://www.timelydevelopment.com/demos/NetflixPrize.aspx
2) C would be faster but PHP can certainly do it.
PHP Architect author Cal Evans: "PHP is a web scripting language... [but] I’ve used PHP as a scripting language for writing the DOS equivalent of BATCH files or the Linux equivalent of shell scripts. I’ve found that most of what I need to do can be accomplished from within PHP. There is even a project to allow you to build desktop applications via PHP, the PHP-GTK project."
Regarding question 1: It definitely is possible. Whether it's reasonable depends on your scenario: How big are your matrices? How often do you intend to run the code? Is it run in a web site or from the command line?
If you do care about speed, I would suggest writing a simple extension that wraps calls to the GNU Scientific Library.
Yes it's posible, but implementing SVD in php ins't the optimal approach. As you can see here PHP is slower than C and also slower than C++, so maybe it was better if you could do it in one of this languages and call them as a function to get your results. You can find an implementation of the algorithm here, so you can guide yourself trough it.
About the function calling can use:
The exec() Function
The system function is quite useful and powerful, but one of the biggest problems with it is that all resulting text from the program goes directly to the output stream. There will be situations where you might like to format the resulting text and display it in some different way, or not display it at all.
The system() Function
The system function in PHP takes a string argument with the command to execute as well as any arguments you wish passed to that command. This function executes the specified command, and dumps any resulting text to the output stream (either the HTTP output in a web server situation, or the console if you are running PHP as a command line tool). The return of this function is the last line of output from the program, if it emits text output.
The passthru() Function
One fascinating function that PHP provides similar to those we have seen so far is the passthru function. This function, like the others, executes the program you tell it to. However, it then proceeds to immediately send the raw output from this program to the output stream with which PHP is currently working (i.e. either HTTP in a web server scenario, or the shell in a command line version of PHP).
Yes. this is perfectly possible to be implemented in PHP.
I don't know what the reasonable time frame for execution and how large it can compute.
I would probably have to implement the algorithm to get a rought idea.
Yes I can help you code it. But why do you need help? Doesn't the code you wrote work?
Just as an aside question. What version of PHP do you use?

Categories