I was just reviewing the answers to different questions to learn more. I saw an answer which says that it is bad practice in php to write
for($i=0;$i<count($array);$i++)
It says that calling the count function in the loop reduces the speed of the code. The discussion in the comments on this question was not clear. I want to know why it is not good practice. What should be the alternative way of doing this?
You should do this instead:
$count = count($array);
for($i=0;$i<$count;$i++)...
The reason for doing this is because if you put the count($array) inside the for loop then the count function would have to be called for every iteration which slows down speed.
However, if you put the count into a variable, it is a static number that won't have to be recalculated every time.
For every iteration, PHP is checking that part of the loop (the condition) to see if it should keep looping, and every time it checks it, it is calculating the length of the array.
An easy way to cache that value is...
for($i=0,$count=count($array);$i<$count;$i++) { ... }
It is probably not necessary in small loops, but could make a huge difference when iterating over thousands of items, and dependent on what function call is in the condition (and how it determines its return value).
This is also why you should use foreach() { ... } if you can, it uses an iterator on a copy of the array to loop over the set and you don't have to worry about caching the condition of the loop.
I heard of a database in a doctor's surgery that made exactly this mistake with a piece of software. It was tested with about 100 records, all worked fine. Within a few months, it was dealing with millions of records and was totally unusable, taking minutes to load results. The code was replaced as per the answers above, and it worked perfectly.
To think about it another way, a fairly powerful dedicated server that's not doing much else will take about 1 nanosecond to do count($array). If you had 100 for loops, each counting 1,000 rows then that's only 0.0001 of a second.
However, that's 100,000 calculations for EVERY page load. Scale that up to 1,000,000 users (and who doesn't want to have 1 million users?)... doing a 10 page loads and now you have 1,000,000,000,000 (1 trillion) calculations. That's going to put a lot of load on the server. It's a 1,000 seconds (about 16.5 minutes) that your processor spends running that code.
Now increase the time it takes the machine to process the code, the number of items in the arrays, and the number of for loops in the code... you're talking of literally many trillions of processes and many hours of processing time that can be avoided by just storing the result in a variable first.
It is not good practice because as written, count($array) will be called each time through the loop. Assuming you won't be changing the size of the array within the loop (which itself would be a horrible idea), this function will always return the same value, and calling it repeatedly is redundant.
For short loops, the difference probably won't be noticeable, but it's still best to call the function once and use the computed value in the loop.
Related
I already read the documentation, but when testing, I'm still not able to understand well the difference between them.
For example, with this simple file:
<?php
class StackOverflowBench
{
public function benchNothing()
{
}
}
When I set 1000 revolutions, and only one iteration, here is my result:
subject
set
revs
its
mem_peak
best
mean
mode
worst
stdev
rstdev
diff
benchNothing
0
10000
1
2,032,328b
10.052μs
10.052μs
10.052μs
10.052μs
0.000μs
0.00%
1.00x
the best, mean, mode and worst are always the same, which means they are based on the only iteration I made.
When I run it with 10 revolutions and still 1 iteration, I have this:
subject
set
revs
its
mem_peak
best
mean
mode
worst
stdev
rstdev
diff
benchNothing
0
10
1
2,032,328b
10.200μs
10.200μs
10.200μs
10.200μs
0.000μs
0.00%
1.00x
which seems to mean the times calculated are not a sum of all the revolutions, but something like an average for each iteration.
If I wanted to measure the best and worst execution time of each time the method is executed, I'd try 1000 iterations and only 1 revolution each, but it takes waay to much time. I launched it with 100 iterations of 1 revolution, here's the result :
subject
set
revs
its
mem_peak
best
mean
mode
worst
stdev
rstdev
diff
benchNothing
0
1
100
2,032,328b
20.000μs
25.920μs
25.196μs
79.000μs
5.567μs
21.48%
1.00x
This time, the time seems to be at least twice as long, and I'm wondering what I didn't understand well. I may be using these informations badly (I know my last example is a wrong one).
Is it necessary to measure the best and worst of each revolution, like I want to do ?
What are the interests of iterations ?
Revolution vs iteration
Let's take your example class:
class StackOverflowBench
{
public function benchNothing()
{
}
}
If you have 100 revolutions and 3 iterations, this is the pseudo code that will be run:
// Iterations
for($i = 0; $i < 3; $i++){
// Reset memory stats code here...
// Start timer for iteration...
// Create instance
$obj = new StackOverflowBench();
// Revolutions
for($j = 0; $j < 100; $j++){
$obj->benchNothing();
}
// Stop timer...
// Call `memory_get_usage` to get memory stats
}
What does the report mean?
Almost all of the calculated stats (mem_peak, best, mean, mode, worst, stdev and rstdev) in the output are based on individual iterations and are documented here.
The diff stat is the weird one and document here and mentioned elsewhere as:
the percentage difference from the lowest measurement
When you run a test, you can specify what column to report the difference on. So if you diff_column on run time, if iteration #1 takes 10 seconds and #2 takes 20 seconds, the diff for #1 would be 1.00 (since it is the lowest) and #2 would be 2.00 since it took twice as long. (Actually, I'm not 100% sure that is the exact usage of that column
Measuring revolution vs iteration
Some code needs to be run thousands or millions of times in a task/request/action/etc. which is why revolutions exist. If I run a simple but critical block of code just once, a report might tell me it takes 0.000 seconds which isn't helpful. That's why some blocks of code need to have their revolution count kicked up to get a rough idea, based on possible real-world usage, how they perform under load. Array sorting algorithms are great examples of a tightly-coupled call that will happen a lot in a single request.
Other code might only do a single thing, such as making an API or database request, and for those blocks of code we need to know how much system resources will they take up as a whole. So if I make a database call and consume 2MB of, and I'm expecting to have 1,000 concurrent users, those calls could take up 2GB of memory. (I'm simplifying but you should get the gist.)
If you look at my pseudo code above, you'll see that setting up each iteration is more expensive than each revolution. The revolution basically just invokes a method, but the iteration calculates memory and does instantiation-related work.
So, to your second-to-last question:
Is it necessary to measure the best and worst of each revolution, like I want to do?
Probably not, although there are tools out there that will tell you. You could for instance, find out how much memory was used before a method and after to determine if your code is sub-optimal, but you can also do that with PHPBench by making a 1 iteration, 1 revolution run and looking for methods with high memory.
I'd further say that if you have code that has great variance per revolution, it is almost 100% related to IO factors and not code, or it is related to the test dataset, and most probably size.
You should hopefully know all of your IO-related paths, so benchmarking the various problems related to those paths really isn't a factor of this tool.
For dataset-related problems, however, that is interesting and is a case where you'd want to know each run potentially. There, too, however, the measurements are there to know either how to fix/change your code, or to know that your code runs with a certain time complexity.
Short:
Is there a way to get the amount of queries that were executed within a certain timespan (via PHP) in an efficient way?
Full:
I'm currently running an API for a frontend web application that will be used by a great amount of users.
I use my own custom framework that uses models to do all the data magic and they execute mostly INSERTs and SELECTs. One function of a model can execute 5 to 10 queries on a request and another function can maybe execute 50 or more per request.
Currently, I don't have a way to check if I'm "killing" my server by executing (for example) 500 queries every second.
I also don't want to have surprises when the amount of users increases to 200, 500, 1000, .. within the first week and maybe 10.000 by the end of the month.
I want to pull some sort of statistics, per hour, so that I have an idea about an average and that I can maybe work on performance and efficiency before everything fails. Merge some queries into one "bigger" one or stuff like that.
Posts I've read suggested to just keep a counter within my code, but that would require more queries, just to have a number. The preferred way would be to add a selector within my hourly statistics script that returns me the amount of queries that have been executed for the x-amount of processed requests.
To conclude.
Are there any other options to keep track of this amount?
Extra. Should I be worried and concerned about the amount of queries? They are all small ones, just for fast execution without bottlenecks or heavy calculations and I'm currently quite impressed by how blazingly fast everything is running!
Extra extra. It's on our own VPS server, so I have full access and I'm not limited to "basic" functions or commands or anything like that.
Short Answer: Use the slowlog.
Full Answer:
At the start and end of the time period, perform
SELECT VARIABLE_VALUE AS Questions
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'Questions';
Then take the difference.
If the timing is not precise, also get ... WHERE VARIABLE_NAME = 'Uptime' in order to get the time (to the second)
But the problem... 500 very fast queries may not be as problematic as 5 very slow and complex queries. I suggest that elapsed time might be a better metric for deciding whether to kill someone.
And... Killing the process may lead to a puzzling situation wherein the naughty statement remains in "Killing" State for a long time. (See SHOW PROCESSLIST.) The reason why this may happen is that the statement needs to be undone to preserve the integrity of the data. An example is a single UPDATE statement that modifies all rows of a million-row table.
If you do a Kill in such a situation, it is probably best to let it finish.
In a different direction, if you have, say, a one-row UPDATE that does not use an index, but needs a table scan, then the query will take a long time and possible be more burden on the system than "500 queries". The 'cure' is likely to be adding an INDEX.
What to do about all this? Use the slowlog. Set long_query_time to some small value. The default is 10 (seconds); this is almost useless. Change it to 1 or even something smaller. Then keep an eye on the slowlog. I find it to be the best way to watch out for the system getting out of hand and to tell you what to work on fixing. More discussion: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog
Note that the best metric in the slowlog is neither the number of times a query is run, nor how long it runs, but the product of the two. This is the default for pt-query-digest. For mysqlslowdump, adding -s t gets the results sorted in that order.
I am running an algorithm in PHP which has a lot of data involved. All the processing happens within a nested for loop. Strangely, the outer for loop stops working after 'X' number of iterations (where 'X' is changing all the time I run the script). It takes anywhere between 5 mins to 30mins for the script to crash depending on 'X'. It does not throw out any errors, and only does an incomplete printout of my var_dump (in the first iteration of the outer loop)
These are the precautions I took:
1. I have set the timeout limit in php.ini to be 3600sec (60mins).
2. I am printing out the memory_get_usage() after every outer for loop iteration and i have verified that it is much lesser compared to the max memory allocated to php.
3. I am unsetting arrays once they are used
4. I reuse variable names to limit memory within the forloop
5. I have minimal calls to my DB
I have been solving this for a long time to no avail. So my question is what can be the cause of this problem/how do I go about debugging it. Thank you so much!
Extra: If i work with a much smaller test data size, everything works fine.
Obviously without code this is just a guess, but are you making sure to use a single connection to your database? If you are reconnecting every time you may get too many connections which could cause an error like this.
This sounds like an issue with utilisation of your server cores and a similar answer/workaround could be found here: Boost Apache2 up to 4 cores usage, running PHP
Try running your datasets in parallel.
I'm making a blocking algorithm, and I just realised that adding a timeout to such algorithm is not so easy if it should remain precise.
Adding timeout means, that the blocking algorithm should abort after X ms if not earlier. Now I seem to have two options:
Iterating time (has mistake, but is fast)
Check blocking condition
Iterate time_elapsed by 1 (which means 1e-6 sec with use of usleep)
Compare time_elapsed with timeout. (here is the problem I will talk about)
usleep(1)
Getting system time every iteration (slow, but precise)
I know how to do this, please do not post any answers about that.
Compating timeout with time_elapsed
And here is what bothers me. The timeout will be in milliseconds (10e-3) while usleep sleeps for 10e-6 seconds. So my time_elapsed will be 1000 times more precise than timeout. I want to truncate last three digits of time_elapsed (operation equal to floor($time_elapsed/1000) without dividing it. Division algorithm is too slow.
Summary
I want to make my variable 1000 times smaller without dividing it by 1000. I want just get rid of the data. In binary I'd use bit-shift operator, but have no idea how to apply it on decimal system.
Code sample:
Sometimes, when people on SO cannot answer the theoretical question, they really hunger for the code. Here it is:
floor($time_elapsed/1000);
I want to replace this code with something much faster. Please note that though the question itself is full of timeouts, the question title is only about truncating that data. Other users may find the solution useful for other purposes than timing.
Maybe this will help Php number format. Though this does cause rounding, if that is unacceptable then I don't think its possible because PHP is loosely typed that you cant define numbers with a particular level of precision.
try this:
(int)($time_elapsed*0.001)
this should be a lot faster
I have seemingly harmless while loop that goes through the result-set of a mysql query and compares the id returned from mysql, to one in a very large multidimensional array:
//mysqli query here
while($row = fetch_assoc())
{
if(!in_array($row['id'], $multiDArray['dimensionOne']))
{
//do something
}
}
When the script first executes, it is running through the results at about 2-5k per second. Sometimes more, rarely less. The result set brings back 7million rows, and the script peaks at 2.8GB of memory.
In terms of big data, this is not a lot.
The problem is, around the 600k mark, the loop starts to slow down, and by 800k, it is processing a few records a second.
In terms of server load and memory use, there are no issues.
This is behaviour I have noticed before in other scripts dealing with large data sets.
Is array seek time progressively slower as the internal pointer moves deeper?
That really depends on what happens inside the loop. I know you are convinced it's not a memory issue but it looks like one. Program usually get very slow when system tries to get extra RAM by using SWAP. Using hard drive is obviously very slow and that's what you might be experiencing. It's very easy to benchmark it.
In one terminal run
vmstat 3 100
Run you scrip and observe vmstat. Look into IO and SWAP. If that is really not the case then profile execution with XDEBUG. It might be tricky because you do many iterations and this will also cause major IO.