I have a LAMP server on which I run a PHP script that makes a SELECT query on a table containing about 1 million rows.
Here is my script (PHP 8.2 and mariaDB 10.5.18) :
$db = new PDO("mysql:host=$dbhost;dbname=$dbname;", $dbuser, $dbpass);
$req = $db->prepare('SELECT * FROM '.$f_dataset);
$req->execute();
$fetch = $req->fetchAll(PDO::FETCH_ASSOC);
$req->closeCursor();
My problem is that each execution of this script seems to consume about 500MB of RAM on my server, and this memory is not released at the end of the execution, so having only 2GB of RAM, after 3 executions, the server kills the Apache2 task, which forces me to restart the Apache server each time.
Is there a solution to this? A piece of code that allows to free the used memory?
I tried to use unset($fetch) and gc_collect_cycles() but nothing works and I haven't found anyone who had the same problem as me.
EDIT
After the more skeptical among you about my problem posted several responses asking for evidence as well as additional information, here is what else I can tell you:
I am currently developing a trading strategy testing tool where I set the parameters manually via an HTML form. This one is then processed by a PHP script that will first perform calculations in order to reproduce technical indicators (using the Trader library for some of them, and reprogrammed for others) from the parameters returned by the form.
In a second step, after having reproduced the technical indicators and having stored their values in my database, the PHP script will simulate a buy or sell order according to the values of the stock market price I am interested in, and according to the values of the technical indicators calculated just before.
To do this, I have in my database for example 2 tables, the first one stores the information of the candles of size 1 minute (opening price, closing price, max price, min price, volume ...), that is to say 1 candle per line, the second table stores the value of a technical indicator, corresponding to a candle, thus to a line of my 1st table.
The reason why I need to make calculations, and therefore to get my 1 million candles, is that my table contains 1 million candles of 1 minute on which I want to test my strategy. I could do this with 500 candles as well as with 10 million candles.
My problem now, is only with the candle retrieval, there are not even any calculations yet. I shared my script above which is very short and there is absolutely nothing else in it except the definitions of my variables $dbname, $dbhost etc. So look no further, you have absolutely everything here.
When I run this script on my browser, and I look at my RAM load during execution, I see that an apache process consumes up to 697 MB of RAM. I'd like to say that so far, nothing abnormal, the table I'm retrieving candles from is a little over 100 MB. The real problem is that once the script is executed, the RAM load remains the same. If I run my script a second time, the RAM load is 1400 MB. And this continues until I have used up all the RAM, and my Apache server crashes.
So my question is simple, do you know a way to clear this RAM after my script is executed?
What you describe is improbable and you don't say how you made these measurements. If your assertions are valid then there are a couple of ways to solve the memory issue, however this is the xy problem. There is no good reason to read a million rows into a web page script.
After several hours of research and discussion, it seems that this problem of unreleased memory has no solution. It is simply the current technical limitations of Apache compared to my case, which is not able to free the memory it uses unless it is restarted every time.
I have however found a workaround in the Apache configuration, by only allowing one maximum request per server process instead of the default 5.
This way, the process my script is running on gets killed at the end of the run and is replaced by another one that starts automatically.
Related
I'm created a Joomla extension in which i'm storing records from table A to table B. My script is working fine if table A contains less data.
If table A contains large amout of data. While inserting this huge data execution is getting exceed & showing this error 'Fatal error: Maximum execution time of 30 seconds exceeded in
/mysite/libraries/joomla/database/database/mysqli.php on line 382'.
I can overcome this problem by making change in ini file, but its Joomla extension which people gonna use it in their site so i can't tell them to make change in ini file infact i don't wanna tell them.
take a look into this
http://davidwalsh.name/increase-php-script-execution-time-limit-ini_set
ini_set('max_execution_time', 300);
use this way or
set_time_limit(0);
Use the below codes at the start of the page where you wrote the query codes
set_time_limit(0);
Technically, you can increase the maximum execution time using set_time_limit. Personally, I wouldn't mess with limits other people set on their servers, assuming they put them in for a reason (performance, security - especially in a shared hosting context, where software like Joomla! is often found). Also, set_time_limit won't work if PHP is run in safe mode.
So what you're left with is splitting the task into multiple steps. For example, if your table has 100000 records and you measure that you can process about 5000 records in a reasonable amount of time, then do the operation in 20 individual steps.
Execution time for each step should be a good deal less than 30 seconds on an average system. Note that the number of steps is dynamic, you programmatically divide the number of records by a constant (figure out a useful value during testing) to get the number of steps during runtime.
You need to split your script into two parts, one that finds out the number of steps required, displays them to the user and sequentially runs one step after another, by sending AJAX requests to the second script (like: "process records 5001 to 10000"), and marking steps as done (for the user to see) when the appropriate server respone arrives (i.e. request complete).
The second part is entirely server-sided and accepts AJAX requests. This script does the actual work on the server. It must receive some kind of parameters (the "process records 5001 to 10000" request) to understand which step it's supposed to process. When it's done with its step, it returns a "success" (or possibly "failure") code to the client script, so that it can notify the user.
There are variations on this theme, for instance you can build a script which redirects the user to itself, but with different parameters, so it's aware where it left off and can pick up from there with the next step. In general, you'd want the solution that gives the user the most information and control possible.
mostly I find the answers on my questions on google, but now i'm stuck.
I'm working on a scraper script, which first scrapes some usernames of a website, then gets every single details of the user. there are two scrapers involved, the first goes through the main page, gets the first name, then gets the details of it's profile page, then it goes forward to the next page...
the first site I'm scraping has a total of 64 names, displayed on one main page, while the second one, has 4 pages with over 365 names displayed.
the first one works great, however the second one keeps getting me the 500 internal error. I've tried to limit the script, to scrape only a few names, which works like charm, so I'm more then sure that the script itself is ok!
the max_execution_time in my php ini file is set to 1500, so I guess that's not the problem either, however there is something causing the error...
not sure if adding a sleep command after every 10 names for example will solve my situation, but well, i'm trying that now!
so if any of you have any idea what would help solve this situation, i would appreciate your help!
thanks in advance,
z
support said i can higher the memory upto 4gigabytes
Typical money gouging support answer. Save your cash & write better code because what you are doing could easily be run from the shared server of a free web hosting provider even with their draconian resource limits.
Get/update the list of users first as one job then extract the details in smaller batches as another. Use the SQL BULK Insert command to reduce connections to the database. It also runs much faster than looping through individual INSERTS.
Usernames and details is essentially a static list, so there is no rush to get all the data in realtime. Just nibble away with a cronjob fetching the details and eventually the script will catch up with new usernames being added to the incoming list and you end up with a faster,leaner more efficient system.
This is definitely a memory issue. One of your variables is growing past the memory limit you have defined in php.ini. If you do need to store a huge amount of data, I'd recommend writing your results to a file and/or DB at regular intervals (and then free up your vars) instead of storing them all in memory at run time.
get user details
dump to file
clear vars
repeat..
If you set your execution time to infinity and regularly dump the vars to file/db your php script should run fine for hours.
Hey,
I currently have over 300+ qps on my mysql. There is roughly 12000 UIP a day / no cron on fairly heavy PHP websites. I know it's pretty hard to judge if is it ok without seeing the website but do you think that it is a total overkill?
What is your experience? If I optimize the scripts, do you think that I would be able to get substantially lower of qps? I mean if I get to 200 qps that won't help me much. Thanks
currently have over 300+ qps on my mysql
Your website can run on a Via C3, good for you !
do you think that it is a total overkill?
That depends if it's
1 page/s doing 300 queries, yeah you got a problem.
30-60 pages/s doing 5-10 queries each, then you got no problem.
12000 UIP a day
We had a site with 50-60.000, and it ran on a Via C3 (your toaster is a datacenter compared to that crap server) but the torrent tracker used about 50% of the cpu, so only half of that tiny cpu was available to the website, which never seemed to use any significant fraction of it anyway.
What is your experience?
If you want to know if you are going to kill your server, or if your website is optimizized, the following has close to zero information content :
UIP (unless you get facebook-like numbers)
queries/s (unless you're above 10.000) (I've seen a cheap dual core blast 20.000 qps using postgres)
But the following is extremely important :
dynamic pages/second served
number of queries per page
time duration of each query (ALL OF THEM)
server architecture
vmstat, iostat outputs
database logs
webserver logs
database's own slow_query, lock, and IO logs and statistics
You're not focusing on the right metric...
I think you are missing the point here. If 300+ qps are too much heavily depends on the website itself, on the users per second that visit the website, that the background scripts that are concurrently running, and so on. You should be able to test and/or compute an average query throughput for your server, to understand if 300+ qps are fair or not. And, by the way, it depends on what these queries are asking for (a couple of fields, or large amount of binary data?).
Surely, if you optimize the scripts and/or reduce the number of queries, you can lower the load on the database, but without having specific data we cannot properly answer your question. To lower a 300+ qps load to under 200 qps, you should on average lower your total queries by at least 1/3rd.
Optimizing a script can do wonders. I've taken scripts that took 3 minutes before to .5 seconds after simply by optimizing how the calls were made to the server. That is an extreme situation, of course. I would focus mainly on minimizing the number of queries by combining them if possible. Maybe get creative with your queries to include more information in each hit.
And going from 300 to 200 qps is actually a huge improvement. That's a 33% drop in traffic to your server... that's significant.
You should not focus on the script, focus on the server.
You are not saying if these 300+ querys are causing issues. If your server is not dead, no reason to lower the amount. And if you have already done optimization, you should focus on the server. Upgrade it or buy more servers.
I have a map. On this map I want to show live data collected from several tables, some of which have astounding amounts of rows. Needless to say, fetching this information takes a long time. Also, pinging is involved. Depending on servers being offline or far away, the collection of this data could vary from 1 to 10 minutes.
I want the map to be snappy and responsive, so I've decided to add a new table to my database containing only the data the map needs. That means I need a background process to update the information in my new table continuously. Cron jobs are of course a possibility, but I want the refreshing of data to happen as soon as the previous interval has completed. And what if the number of offline IP addresses suddenly spike and the loop takes longer to run than the interval of the Cron job?
My own solution is to create an infinite loop in PHP that runs by the command line. This loop would refresh the data for the map into MySQL as well as record other useful data such as loop time and failed attempts at pings etc, then restart after a short pause (a few seconds).
However - I'm being repeatedly told by people that a PHP script running for ever is BAD. After a while it will hog gigabytes of RAM (and other terrible things)
Partly I'm writing this question to confirm if this is in fact the case, but some tips and tricks on how I would go about writing a clean loop that doesn't leak memory (If that is possible) wouldn't go amiss. Opinions on the matter would also be appreciated.
The reply I feel sheds the most light on the issue I will mark as correct.
The loop should be in one script which will activate/call the actual script as a different process...much like cron is doing.
That way, even if memory leaks, and non collected memory is accumulating, it will/should be free after each cycle.
However - I'm being repeatedly told by people that a PHP script running for ever is BAD. After a while it will hog gigabytes of RAM (and other terrible things)
This used to be very true. Previous versions of PHP had horrible garbage collection, so long-running scripts could easily accidentally consume far more memory than they were actually using. PHP 5.3 introduced a new garbage collector that can understand and clean up circular references, the number one cause of "memory leaks." It's enabled by default. Check out that link for more info and pretty graphs.
As long as your code takes steps to allow variables to go out of scope at proper times and otherwise unset variables that will no longer be used, your script should not consume unnecessary amounts of memory just because it's PHP.
I don't think its bad, as with anything that you want to run continuously you have to be more careful.
There are libraries out there to help you with the task. Have a look at System_Daemon, which release RC 1 just over a month ago, which allows you to "Set options like max RAM usage".
Rather than running an infinite loop I'd be tempted to go with the cron option you mention in conjunction with a database table entry or flat-file that you'd use to store a "currently active" status bit to ensure that you didn't have overlapping processes attempting to run at the same time.
Whilst I realise that this would mean a minor delay before you perform the next iteration, this is probably a better idea anyway as:
It'll let the RDBMS perform any pending low-priority updates, etc. that may well been on-hold due to the amount of activity that you've been carrying out.
Even if you neatly unset all the temporary variables you've been using, it's still possible that PHP will "leak" memory, although recent improvements (5.2 introduced a new memory management system and garbage collection was overhauled in 5.3) should hopefully mean that this less of an issue.
In general, it'll also be easier to deal with other issues (if the DB connection temporarily goes down due to a config change and restart for example) if you use the cron approach, although in an ideal world you'd cater for such eventualities in your code anyway. (That said, the last time I checked, this was far from an ideal world.)
First I fail to see how you need a daemon script in order to provide the functionality you describe.
Cron jobs are of course a possibility, but I want the refreshing of data to happen as soon as the previous interval has completed
The neither a cron job nor a daemon are the way to solve the problem (unless the daemon becomes the data sink for the scripts). I'd spawn a dissociated process when the data is available using a locking strategy to aoid concurrency.
Long running PHP scripts are not intrinsically bad - but there reference counting garbage collector does not deal with all possible scenarios for cleaning up memory - but more recent implementations have a more advanced collector which should clean up a lot more (circular reference checker).
I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.
USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/
pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here
This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.
CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)