How do I paginate or batch a call to mssql_execute()? - php

When I make a call to an MSSQL database using ad-hoc SQL (for example, "SELECT foo FROM tablename"), I can give a batch size for that call. This is very useful when I expect a lot of data returned.
In my case, I have a table with over 200 million rows, and I'm getting them all. Yes, I have reasons for being a big slurpy data hog like this.
My DB guys said, "Hey, stop using ad-hoc SQL, here, use this nifty SP. It does the same thing."
So I'm using it with the mssql_execute() function call, but there's no way to specify a batch size when doing this as there is with mssql_query()
I not only have to do a ini_set('memory_limit', '64G'); to make this work, I also have to sweat things as the SP call takes upwards of a half hour to run. Once it runs, I can do a loop on mssql_fetch_row(), no problem, but that initial call is a nail-biter!
And once I'm done, I have a process taking up 57G of memory (on a 96G box) that then takes a full hour at 80% CPU just to unwind and garbage collect. Yeah, I could kill the process, but that's a hack.
There has to be a better way!
With ad-hoc SQL, I call mssql_query() with a batch size of 10,000 rows and process them and then go back for more. I can then do something like echo "Yes, indeed, I'm on row $i right now..." and salve my paranoia that everything is running right.
So... what's the appropriate way to do this if I'm forced to use the SP that my DB guys want me to use?

Assuming the table has a primary key, I suggest you ask the DB guys to add 2 parameters to the stored procedure, one for the number of rows to be returned and another for the starting key value. Pass NULL for the initial batch and the last key value returned for each subsequent batch. This will provide efficient forward-only pagination. For example:
CREATE PROCEDURE dbo.usp_select_tablename
#NumRows int
, #StartKey int = NULL
AS
IF #StartKey IS NULL
BEGIN
SELECT TOP(#NumRows) foo
FROM tableName
ORDER BY StartKey;
END
ELSE
BEGIN
SELECT TOP(#NumRows) foo
FROM tableName
WHERE Key > #StartKey
ORDER BY StartKey;
END;

Related

SELECT+UPDATE to avoid returning the same result

I have a cron task running every x seconds on n servers. It will "SELECT FROM table WHERE time_scheduled<CURRENT_TIME" and then perform a lengthy task on this result set.
My problem is now: How do I avoid having two seperate servers perform the same task at the same time?
The idea is to update *time_scheduled* with a set interval after selecting it. But if two servers happen to run the query at the same time, that will be too late, no?
All ideas are welcome. It doesnt have to be a strict MySQL solution.
Thanks!
I am guessing you have a single MySQL instance, and connections from your n servers to run this processing job. You're implementing a job queue here.
The table you mention needs to use the InnoDB access method (or one of the other transaction-friendly access methods offered by Percona or MariaDB).
Do these items in your table need to be processed in batches? That is, are they somehow inter-related? Or is it possible for your server processes to handle them one-by-one? This is an important question, because you'll get better load balancing between your server processes if you can handle them individually or in small batches. Let's assume the small batches.
The idea is to prevent any server process from grabbing onto a row in your table if some other server process has that row. I've had to do this kind of thing a lot, and here is my suggestion; I know this works.
First, add an integer column to your table. Call it "working" or some such thing. Give it a default value of zero.
Second, assign a permanent id number to each server. The last part of the server's IP address (for example, if the server's IP address is 10.1.0.123, the id number is 123) is a good choice, because it's probably unique in your environment.
Then, when a server's grabbing work to do, use these two SQL queries.
UPDATE table
SET working = :this_server_id
WHERE working = 0
AND time_scheduled < CURRENT_TIME
ORDER BY time_scheduled
LIMIT 1
SELECT table_id, whatever, whatever
FROM table
WHERE working = :this_server_id
The first query will consistently grab a batch of rows to work on. If another server process comes in at the same time, it won't ever grab the same rows, because no process can grab rows unless working = 0. Notice that the LIMIT 1 will limit your batch size. You don't have to do this, but you can. I also threw in ORDER BY to process the rows first that have been waiting the longest. That's probably a useful way to do things.
The second query retrieves the information you need to do the work. Don't forget to retrieve the primary key values (I called them table_id) for the rows you're working on.
Then, your server process does whatever it needs to do.
When it's done, it needs to throw the row back into the queue for a later time. To do that, the server process needs to set the time_scheduled to whatever it needs to be, then to set working = 0. So, for example, you could run this query for each row you're processing.
UPDATE table
SET time_scheduled = CURRENT_TIME + INTERVAL 5 MINUTE,
working = 0
WHERE table_id = ?table_id_from_previous_query
That's it.
Except for one thing. In the real world these queuing systems get fouled up sometimes. Server processes crash. Etc. Etc. See Murphy's Law. You need a monitoring query. That's easy in this system.
This query will give a list of all jobs that are more than five minutes overdue, along with the server that's supposed to be working on them.
SELECT working, COUNT(*) stale_jobs
FROM table
WHERE time_scheduled < CURRENT_TIME - INTERVAL 5 MINUTE
GROUP BY WORKING
If this query comes up empty, all is well. If it comes up with lots of jobs with working set to zero, your servers aren't keeping up. If it comes up with jobs with working set to some server's id number, that server is taking a lunch break.
You can reset all the jobs assigned to the server that's gone to lunch with this query, if need be.
UPDATE table
SET working=0
WHERE working=?server_id_at_lunch
By the way, a compound index on (working, time_scheduled) will probably help this perform well.

PDO PHP Postgres: slow fetching of data

i was playing with PDO on PostgreSQL 9.2.4 and was trying to fetch data from a table having millions on rows. My query returns about 100.000 rows.
I do not use any of PDOStatements's fetch function, i simply use the result from the PDO Objecte itels and loop through it.
But its getting slower and slower by time. At the beginning it was fetching like 200 rows per second. But the close it comes to its end, it gets slower. Now being at row 30.000 it fetches only 1 row per second. Why is it getting slower.
I do this, its pretty simple:
$dbh = new PDO("pgsql...");
$sql = "SELECT x, y FROM point WHERE name is NOT NULL and place IN ('area1', 'area2')";
$res = $dbh->query($sql);
$ins_sql = "INSERT INTO mypoints (x, y) VALUES ";
$ins_vals = [];
$ins_placeholders = [];
foreach($res as $row) {
$ins_placeholders[] = "(?,?)";
$ins_vals = array_merge($ins_vals, [$row['x'], $row['y']]);
printCounter();
}
// now build up one insert query using placeholders and values,
// to insert all of them in one shot into table mypoints
Function printCounter simply increases an int var and prints it. So i can see how many rows it has put already in that array before i create my insert statement out of it. I use one shot inserts to speed things up, better than doing 100.000 inserts.
But that foreach loop is getting slower by time. How can i increase the speed.
Is there a difference between fetch() and the simple loop method using the pdostatement in foreach?
when i start this php script, it takes like 5-10 seconds for the query. So this has nothing to do with how the table is setup and if i need indexes.
I have other tables returning 1 million rows, im not sure what is the best way to fetch them. I can raise PHP's memory_limit if needed, so the most important thing for me is SPEED.
Appreciate any help.
It's not likely that the slowness is related to the database, because after the $dbh->query() call, the query is finished and the resulting rows are all in memory (they are not in PHP variables yet, but they're in memory accessible at the pgsql module level).
The more likely culprit is the array_merge operation. The array becomes larger at every loop iteration, and the operation recreates the entire array each time.
You may want to do instead:
$ins_vals[] = [$row['x'], $row['y']];
Although personally, when concerned with speed, I'd use an even simpler flat structure:
$ins_vals[] = $x;
$ins_vals[] = $y;
Another unrelated point is that it seems to build a query with a huge number of placeholders, which is not how placeholders are normally used. To send large numbers of values to the server, the efficient way is to use COPY, possibly into a temporary table followed by server-side merge operations if it's not a plain insertion.
I dont know why, but using fetch() method instead and doing the $ins_val filling like this:
$ins_vals[] = $x;
$ins_vals[] = $y;
and using beginTransaction and commit makes now my script unbelievable fast.
Now it takes only about 1 minute to add my 100.000 points.
i think both array_merge and that "ugly" looping through the PDOStatement slowed down my script.
And why the heck someone downvoted my question? Are you punishing me because of my missing knowledge? Thanks.
Ok i generated a class where i set the sql and then put the values for each row with a method call. Whenever it reaches a specific limit, it starts a transaction, prepares the statement with as many placeholders as i have put values, then executes it with the array having all the values, then commit.
This seems to be fast enough, at least it doesnt get slower anymore.
For some reason its faster to add values in a flat structure as Daniel suggested. Thats enough for me.
Sometimes its good to have a function doing one step of insertion, because when the function returns, all the memory used in the function will be freed, so your memory usage stays low.

Large mysql query in PHP

I have a large table of about 14 million rows. Each row has contains a block of text. I also have another table with about 6000 rows and each row has a word and six numerical values for each word. I need to take each block of text from the first table and find the amount of times each word in the second table appears then calculate the mean of the six values for each block of text and store it.
I have a debian machine with an i7 and 8gb of memory which should be able to handle it. At the moment I am using the php substr_count() function. However PHP just doesn't feel like its the right solution for this problem. Other than working around time-out and memory limit problems does anyone have a better way of doing this? Is it possible to use just SQL? If not what would be the best way to execute my PHP without overloading the server?
Do each record from the 'big' table one-at-a-time. Load that single 'block' of text into your program (php or what ever), and do the searching and calculation, then save the appropriate values where ever you need them.
Do each record as its own transaction, in isolation from the rest. If you are interrupted, use the saved values to determine where to start again.
Once you are done the existing records, you only need to do this in the future when you enter or update a record, so it's much easier. You just need to take your big bite right now to get the data updated.
What are you trying to do exactly? If you are trying to create something like a search engine with a weighting function, you maybe should drop that and instead use the MySQL fulltext search functions and indices that are there. If you still need to have this specific solution, you can of course do this completely in SQL. You can do this in one query or with a trigger that is run each time after a row is inserted or updated. You wont be able to get this done properly with PHP without jumping through a lot of hoops.
To give you a specific answer, we indeed would need more information about the queries, data structures and what you are trying to do.
Redesign IT()
If for size on disc is not !important just joints table into one
Table with 6000 put into memory [ memory table ] and make backup every one hour
INSERT IGNORE into back.table SELECT * FROM my.table;
Create "own" index in big table eq
Add column "name index" into big table with id of row
--
Need more info about query to find solution

PHP's in_array vs. MySQL SELECT

I need to check if some integer value is already in my database (which is growing all the time). And it should be done several thousand times in one script. I'm considering two alternatives:
Read all those numbers from MySQL database into PHP array and every time I need to check it, use in_array function.
Every time I need to check the number, just execute something like SELECT number FROM table WHERE number='#' LIMIT 1
On the one hand, searching in array which is stored in RAM should be faster than querying mysql every time (as I have mentioned, these checks are performed about a thousand times during one script execution). On the other hand, DB is growing, ant that array may become quite big and that may slow things down.
Question is - which way is faster or better by some other aspects?
I have to agree that #2 is your best choice. When performing a query with a LIMIT 1 MySQL stops the query when it finds the first match. Make sure the columns you intend to search by are indexed.
It sounds like you are duplicating a Unique Constraint in code...
CREATE TABLE MyTable(
SomeUniqueValue INT NOT NULL
CONSTRAINT MyUniqueKey UNIQUE (SomeUniqueValue));
How does the number of times you need to check compare with the number of values stored in the database? If it's 1:100 then your probably better of searching in the database each time, if it's (some amount) less then preloading the list will be faster. What happened when you tested it?
However even if the ratio is low enough for it to be faster loading the full table, this will gobble up memory and could, as a result, make everything else run more slowly.
So I would recommend not loading it all into memory. But if you can, then batch the checks up to minimise the number of round trips to the database.
C.
querying the database is the best option, one because you said the database is growing so that means new values are being added to the table, whereis in in_array you would be reading old values. Secondly, you might exhaust the RAM alloted to PHP with very large amount of data. Thirdly, mysql has its own query optimizers and other optimizations which makes it a far better choice as compared to php

MYSQL and the LIMIT clause

I was wondering if adding a LIMIT 1 to a query would speed up the processing?
For example...
I have a query that will most of the time return 1 result, but will occasionally return 10's, 100's or even 1000's of records. But I will only ever want the first record.
Would the limit 1 speed things up or make no difference?
I know I could use GROUP BY to return 1 result but that would just add more computation.
It depends if you have an ORDER BY. An ORDER BY needs the entire result set anyway, so it can be ordered.
If you don't have any ORDER BY it should run faster.
It will in all cases run at least a bit faster since the entire result set needn't be sent of course.
Yep! it will!
But to be sure, it should only take a sec to add 'Limit 1' to the end of your sql statement so why not give a shot and see
It all depends on the query itself. If you're doing an indexed lookup (Where indexedColumn = somevalue)or a sort on an indexed column (with no Where clause), then limit 1 will really speed it up. If you have joins or multiple where/order clauses, then things get really complicated really quickly. But the major thing to take away, using "LIMIT 1" will NEVER slow down a query. It will sometimes speed it up, but it will never slow it down.
Now, there is another issue when dealing with PHP. By default, PHP will buffer the entire result set before returning from the query (mysql_query or mysqli->query will only return after all the records are downloaded). So while the query time may be altered little by the limit 1, the time and memory that PHP uses to buffer the result are significant. Imagine each row has 200 bytes of data. Now, your query returns 10,000 rows. That means PHP has to allocate an additional 2mb of memory (actually closer to 10mb with the overhead of php's variable structure) that you'll never use. Allocating memory can be very expensive, so the general rule is only ever allocate what you need (or think you will need). Downloading 10,000 rows when you only want 1 is just wasteful.
Combine these two effects, and you can see why if you want only 1 row, you should ALWAYS use "LIMIT 1".
Here is some relevant documentation on the subject from the MySQL site.
It seems it can speed things up in different ways, depending on the other parts of the query. I'm not sure if it helps when you have no ORDER or GROUP or HAVING clauses, aside from being able to stop immediately rather than give back every single result row (which may be a big enough speed up if you are getting back 100,000 records).
This is the fundamental purpose of the LIMIT clause actually :P If you know how many results you want, you should ALWAYS specify that number as the LIMIT not only because it is faster (unless you are doing an ORDER BY), but to be more efficient with memory in your PHP script.
Note: I'm assuming you're using MySQL with PHP since you added PHP to the tags. If you're just selecting from MySQL directly (outside of a scripting language) the advantage to using LIMIT when also using ORDER BY is purely to make the results easier to manage.

Categories