How to iterate through a "window" of data in a dataset? - php

I have a data set in mysql with 150 rows. I have a set of 2 for loops that run math calculations based on some user inputs and the dataset. The code does calculations for 30 row windows, and accumulates the results for each 30 row window in an array. What I mean is, I do a "cycle" of calculations on rows 0-29, then 1-30, then 2-31, etc... That would result in 120 "cycles".
Right now the for loop is set up like so (there are more fields, I just trimmed the code for simplicity of this question.
$period=30;
$query = "SELECT * FROM table";
$result = mysql_query($query);
while ($row = mysql_fetch_assoc($result)){
$data[] = array("Date" => $row['Date'], "ID" => $row['ID']);
}
for($i=0;$i<(count($data)-$window);$i++){
for($j=0;$j<$window;$j++){
//do calculations here with $data[]
$results[$i][$j]= calculations;
}
}
This works fine for the number of rows I have. However, I opened up the script to a larger dataset (1700 rows) with a different window (360 rows). This means there are exponentially more iterations. It gave me an out of memory error. Some quick use of memory_get_peak_usage() showed that memory would just continually increase.
I'm starting to think that having the loops search through that data array is extremely laborious, especially when the "window" overlaps on a lot of the "cycles". Example: Cycle 0 goes through rows 0-29. Cycle 1 goes through rows 1-30. So, both of those cycles share a row of data that they need, but I'm telling PHP to look for the new data each time.
Is there a way to structure this better? I'm getting kind of lost thinking about running these concurrent cycles.

I think the array that is blowing memory will be the $result array. In your small sample it will be a 2 dimensional array with 150x149 cells. array( 150, 149 ). At 144 bytes per element thats 3,218,400 bytes slightly over 3 Meg + remaining bucket space.
In you second larger sample it will be array(1700,1699). At 144 bytes per element thats 415,915,200 bytes, thats slightly over 406Meg + remaining bucket space, just to hold the results of your calculations.
I think you need to ask if you really need to hold all this data. If you really do, you may have to come up with another way of storing it.
I dont see any point attempting the 1000's odd database calls as this will only add to the overhead as you still have to maintain the hugh list of results in an array.

The SQL Way
You can accomplish this by using LIMIT
$period = 30;
$cycle = 0; //
$query = "SELECT * FROM table LIMIT $cycle,$period";
This will return only the results you need for each cycle. You will need to loop and increment $cycle. The way you are doing it now is probably better, however.
This won't loop back however and grab the first of the data, you will have to add additional logic to handle that case.

Related

Why is PHP using a lot of memory to store a query result

I use Laravel 8 to perform a query on a MySQL 8 table using the query builder directly to avoid Eloquent overhead but I'm getting a lot of memory consumption anyway.
To show you an example, I perform the following query to select exactly 300 000 elements.
My code looks like this:
$before = memory_get_usage();
$q_coords = DB::table('coords')->selectRaw('alt, lat, lng, id')
->where('active', 1)->take(300000)->get();
$after = memory_get_usage();
echo ($after - $before);
It displays 169760384 which means something like 169MB if I'm not mistaking..
Looks like a lot to me because in my query I only asked for 2 float and 2 bigInt, which represents something like 4 x 8 bytes (32 bytes).
And.. 32 x 300 000 records ~= 9600000 (almost 10MB).
How is that even possible that it uses so much memory? I am very surprised.
EDIT
I also tried using PDO directly, same result.
$query = DB::connection()->getPdo()->query("select alt, lat, lng, id from coords WHERE active = 1 LIMIT 300000");
$q_coords = $query->fetchAll();
Because thet are represented as PHP objects in memory and not just as their raw data usage.
However there is a solution to limit the memory usage: chunk
https://blackdeerdev.com/laravel-chunk-vs-cursor/
Chunk: It will “paginate” your query, this way you use less memory.
In PHP each variable is handled with a specific data structure to allow dynamic typing, garbage collection and more..
You can see here a (pretty old but still ok) article: link
You can also see that arrays have a more specific processing, because it need a bucket, for example to store array keys which are considered as Strings.
All of that means there is (according to the article) approximately 144 bytes of data used to store an element of an array.
Well, while I can't explain EXACTLY your result, I can still tell you that in your case have something like this:
300 000 * 144 * 4 = 172 800 000
Which means 300000 rows of 4 variables with 144 bytes by variable.
As you can see it's not that far away from what you got even if my maths are not taking into account the improvement done in PHP 7 and other factors...
Since Laravel Query Builder uses stdObj to represent it results, you will have a lot of overhead:
Each object will store the value of the row itself, and the names of each column. So your 32 bytes turns into a lot of bytes.

A theoretical thought experiment

I recently came upon this theoretical problem:
There are two PHP scripts in an application;
The first script connects to a DB each day at 00:00 and inserts in an existing DB table 1 million rows;
The second script has a foreach loop, iterating through the same DB table's rows; It then makes an API call which takes exactly 1 second to complete (request + response = 1s); Independently of the content of a response, it then deletes one row from the DB table;
Hence, each day the DB table gains 1 million rows, but only loses 1 row per second, i.e. 86400 rows per day, and because of that it grows infinitely big;
What modification to the second script should be changed so that the DB table size does not grow infinitely big?
Does this problem sound familiar to anyone? If so, is there a 'canonical' solution to it? Because the first thing that crossed my mind was, if the row deletion does not depend on the API response, why not just simply take the API call outside of the foreach loop? Unfortunately, I didn't have a chance to ask my question.
Any other ideas?

High performance PHP simaliraty checking on large database

I have 30,000 rows in a database that need to be similarity checked (using similar_text or another such function).
In order to do this it will require doing 30,000^2 checks for each columns.
I estimate I will be checking on average 4 columns.
This means I will have to do 3,600,000,000 checks.
What is the best (fastest, and most reliable) way to do this with PHP, bearing in mind request memory limits and time limits etc?
The server need to still actively serve webpages at the same time as doing this.
PS. The server we are using is an 8 core Xeon 32 GB ram.
Edit:
The size of each column is normally less that 50 characters.
I guess you just need FULL TEXT search.
If that not fits you, you have only one chance to solve this: cache the results.
So you will not have to parse 3bil of records for each requests
Anyway here how you can do it:
$result = array();
$sql = "SELECT * FROM TABLE";
while( $row = ... ) {
$result[] = $row; //> Append the current record
}
Now results contains all the rows from your table.
At this point you said you want to similar_text() all columns with each other.
To do that and cache the results you need at least a table (as I said in the comment).
//> Starting calculating the similarity
foreach($result as $k=>$v) {
foreach($result as $k2=>$v2) {
//> At this point you have 2 rows, $v and $v2 containing your column
$similarity = 0;
$similartiy += levensthein($v['column1'],$v2['column1']);
$similartiy += levensthein($v['column2'],$v2['column2']);
//> What ever comparison you need here between columns
//> Now you can finally store the result by inserting in a table the $similarity
"INSERT DELAYED INTO similarity (value) VALUES ('$similarity')";
}
}
2 Things you have to notice:
I used levensthein because it's much faster than similar_text (notice it's value it's the contrary of similar_text, because the greater the value levensthein returns the less the affinity between string)
I Used INSERT DELAYED to greatly lower the database cost
oy... similar_text() is O(n^3) !
do you really need a percentage similarity for each comparison or can you just do a quick compare of the first/middle/last X bytes of the strings to narrow the field?
if you're just looking for dups say... you can probably narrow down the number of comparisons you need to do, and that will be the most effective tack imho.

updating post views each time a post is viewed with php

I don't know if this is allowed, but I need clarification on a solution given to a question.
What is the best way to count page views in PHP/MySQL?
I have the exact question. I just have no idea how the solution makes any sense, here is the solution:
$sample_rate = 100;
if(mt_rand(1,$sample_rate) == 1) {
$query = mysql_query(" UPDATE posts
SET views = views + {$sample_rate}
WHERE id = '{$id}' ");
// execute query, etc
}
Any help?
Here mt_rand() generate random number between 1 to 100, so probability of that number to be one is 1/100.
If this generate 1, we are increasing value of views by 100.
So, effective increase in database per view
= ( Probability of increasing view ) * (increase in database )
= (1/ 100 )* 100
= 1
So in long run it will increase database value by 1 for each view.
This is tread-off between accuracy of post and speed. As MySQL query are more time extensive than PHP rand function calls.
For each user that views a page, a random number between 1 and 100 ($sample_rate) is generated. If the number equals 1, then the database is updated by the amount of possible values (sample rate).
This is simply a sampling technique used to save resources. This is a common technique used for larger websites.
If you are running a smaller operation, you should simply update the database each time the page is viewed, as oppose to using a sampling method.

Postgres - run a query in batches?

Is it possible to loop through a query so that if (for example) 500,000 rows are found, it'll return results for the first 10,000 and then rerun the query again?
So, what I want to do is run a query and build an array, like this:
$result = pg_query("SELECT * FROM myTable");
$i = 0;
while($row = pg_fetch_array($result) ) {
$myArray[$i]['id'] = $row['id'];
$myArray[$i]['name'] = $row['name'];
$i++;
}
But, I know that there will be several hundred thousand rows, so I wanted to do it in batches of like 10,000... 1- 9,999 and then 10,000 - 10,999 etc... The reason why is because I keep getting this error:
Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 3 bytes)
Which, incidentally, I don't understand how 3 bytes could exhaust 512M... So, if that's something that I can just change, that'd be great, although, still might be better to do this in batches?
Those last 3 bytes were the straw that broke the camel's back. Probably an allocation attempt in a long string of allocations leading to the failure.
Unfortunately libpq will try to fully cache result sets in memory before relinquishing control to the application. This is in addition to whatever memory you are using up in $myArray.
It has been suggested to use LIMIT ... OFFSET ... to reduce the memory envelope; this will work, but is inefficient as it could needlessly duplicate server-side sorting effort every time the query is reissued with a different offset (e.g. in order to answer LIMIT 10 OFFSET 10000, Postgres will still have to sort the entire result set, only to return rows 10000..10010.)
Instead, use DECLARE ... CURSOR to create a server-side cursor, followed by FETCH FORWARD x to fetch the next x rows. Repeat as many times as needed or until less-than-x rows are returned. Do not forget to CLOSE the cursor when you are done, even when/if an exception is risen.
Also, do not SELECT *; if you only need id and name, create your cursor FOR SELECT id, name (otherwise libpq will needlessly retrieve and cache columns you never use, increasing memory footprint and overall query time.)
Using cursors as illustrated above, libpq will hold at most x rows in memory at any one time. However, make sure you also clean up your $myArray in between FETCHes if possible or else you could still run out of memory on account of $myArray.
You can use LIMIT (x) and OFFSET (y)
The PostgreSQL server caches query results until you actually retrieve them, so adding them to the array in a loop like that will cause an exhaustion of memory no matter what. Either process the results one row at a time, or check the length of the array, process the results pulled so far, and then purge the array.
What the error means is that PHP is trying to allocate 3 bytes, but all the available portion of that 512MB is less than 3 bytes.
Even if you do it in batches, depending on the size of the resulting array you could still exhaust the available memory.
Perhaps you don't really need to get all the records?

Categories