Update Current Row in MySQL Loop - php

I have a MySQL table with over 16 million rows and there is no primary key. Whenever I try to add one, my connection crashes. I have tried adding one as an auto increment in PHPMyAdmin and in shell but the connection is always lost after about 10 minutes.
What I would like to do is loop through the table's rows in PHP so I can limit the number of results and with each returned row add an auto-incremented ID number. Since the number of impacted rows would be reduced by reducing the load on the MySQL query, I won't lose my connection.
I want to do something like
SELECT * FROM MYTABLE LIMIT 1000001, 2000000;
Then, in the loop, update the current row
UPDATE (current row) SET ID='$i++'
How do I do this?
Note: the original data was given to me as a txt file. I don't know if there are duplicates but I cannot eliminate any rows. Also, no rows will be added. This table is going to be used only for querying purposes. When I have added indexes, however, there were no problems.

I suspect you are trying to use phpmyadmin to add the index. As handy as it is, it is a PHP script and is limited to the same resources as any PHP script on your server, typically 30-60 seconds run time, and a limited amount of ram.
Suggest you get the mysql query you need to add the index, then use SSH to shell in, and use command line MySQL to add your indexes.

If you don't have duplicate rows then the following way might shed some light:
Suppose you want to update the auto incremented value for first 10000 rows.
UPDATE
MYTABLE
INNER JOIN
(SELECT
*,
#rn := #rn + 1 AS row_number
FROM MYTABLE,(SELECT #rn := 0) var
ORDER BY SOME_OF_YOUR_FIELD
LIMIT 0,10000 ) t
ON t.field1 = MYTABLE.field1 AND t.field2 = MYTABLE.field2 AND .... t.fieldN = MYTABLE.fieldN
SET MYTABLE.ID = t.row_number;
For next 10000 rows just need to change two things:
(SELECT #rn := 10000) var
LIMIT 10000,10000
Repeat..
Note: ORDER BY SOME_OF_YOUR_FIELD is important otherwise you would get results in random order. Better create a function which might take limit,offset as parameter and do this job. Since you need to repeat the process.
Explanation:
The idea is to create a temporary table(t) having N number of rows and assigning a unique row number to each of the row. Later make an inner join between your main table MYTABLE and this temporary table t ON matching all the fields and then update the ID field of the corresponding row(in MYTABLE) with the incremented value(in this case row_number).
Another IDEA:
You may use multithreading in PHP to do this job.
Create N threads.
Assign each thread a non overlapping region (1 to 10000, 10001 to
20000 etc) like the above query.
Caution: The query will get slower in higher offset.

Related

count() takes lots of time when use WHERE clause in mysql

Table has approximately 100 000 records(tuples). Without where clause it takes only few miliseconds whereas takes 4-5 secs when use where clause.
SELECT COUNT(DISTINCT id) FROM tablename WHERE shippable = '1'
I also tried this one but it takes more time as compared to previous one.
SELECT count(rowsss) FROM (SELECT count(*) as rowsss FROM tablename WHERE shippable = '1' GROUP BY id) as T
This is the output when I use EXPLAIN keyword before starting of mysql query
If you a need a filter you could use an index on shippable eg:
create index shippable_ixd on tablename (shippable);
in this way the scan for the table is limited to values that match
and avoid the scan for entire table
and based on the fact you also need the column id you could also trying alternatively a composite index
create index shippable_ixd on tablename (shippable, id);
the sqloptimizer should retrive directly form the index the info needed.
In this case The use of composite index ( with a redundant id not need by where clause) is useful because the SQL engine retrive all the data needed to the query just scanning the index, avoiding the access to the data in the table. This tecnique is use frequently for db queries tuning.
When you checking any condition that time both value should be in same type then execution of query will be fast.
SELECT count(rowsss) FROM (SELECT count(*) as rowsss FROM tablename WHERE CAST(shippable AS CHAR) = '1' GROUP BY id) as T

MYSQL from the 20th row to the last row [duplicate]

I would like to construct a query that displays all the results in a table, but is offset by 5 from the start of the table. As far as I can tell, MySQL's LIMIT requires a limit as well as an offset. Is there any way to do this?
From the MySQL Manual on LIMIT:
To retrieve all rows from a certain
offset up to the end of the result
set, you can use some large number for
the second parameter. This statement
retrieves all rows from the 96th row
to the last:
SELECT * FROM tbl LIMIT 95, 18446744073709551615;
As you mentioned it LIMIT is required, so you need to use the biggest limit possible, which is 18446744073709551615 (maximum of unsigned BIGINT)
SELECT * FROM somewhere LIMIT 18446744073709551610 OFFSET 5
As noted in other answers, MySQL suggests using 18446744073709551615 as the number of records in the limit, but consider this: What would you do if you got 18,446,744,073,709,551,615 records back? In fact, what would you do if you got 1,000,000,000 records?
Maybe you do want more than one billion records, but my point is that there is some limit on the number you want, and it is less than 18 quintillion. For the sake of stability, optimization, and possibly usability, I would suggest putting some meaningful limit on the query. This would also reduce confusion for anyone who has never seen that magical looking number, and have the added benefit of communicating at least how many records you are willing to handle at once.
If you really must get all 18 quintillion records from your database, maybe what you really want is to grab them in increments of 100 million and loop 184 billion times.
Another approach would be to select an autoimcremented column and then filter it using HAVING.
SET #a := 0;
select #a:=#a + 1 AS counter, table.* FROM table
HAVING counter > 4
But I would probably stick with the high limit approach.
As others mentioned, from the MySQL manual. In order to achieve that, you can use the maximum value of an unsigned big int, that is this awful number (18446744073709551615). But to make it a little bit less messy you can the tilde "~" bitwise operator.
LIMIT 95, ~0
it works as a bitwise negation. The result of "~0" is 18446744073709551615.
You can use a MySQL statement with LIMIT:
START TRANSACTION;
SET #my_offset = 5;
SET #rows = (SELECT COUNT(*) FROM my_table);
PREPARE statement FROM 'SELECT * FROM my_table LIMIT ? OFFSET ?';
EXECUTE statement USING #rows, #my_offset;
COMMIT;
Tested in MySQL 5.5.44. Thus, we can avoid the insertion of the number 18446744073709551615.
note: the transaction makes sure that the variable #rows is in agreement to the table considered in the execution of statement.
I ran into a very similar issue when practicing LC#1321, in which I have to select all the dates but the first 6 dates are skipped.
I achieved this in MySQL with the help of ROW_NUMBER() window function and subquery. For example, the following query returns all the results with the first five rows skipped:
SELECT
fieldname1,
fieldname2
FROM(
SELECT
*,
ROW_NUMBER() OVER() row_num
FROM
mytable
) tmp
WHERE
row_num > 5;
You may need to add some more logics in the subquery, especially in OVER() to fit your need. In addition, RANK()/DENSE_RANK() window functions may be used instead of ROW_NUMBER() depending on your real offset logic.
Reference:
MySQL 8.0 Reference Manual - ROW_NUMBER()
Just today I was reading about the best way to get huge amounts of data (more than a million rows) from a mysql table. One way is, as suggested, using LIMIT x,y where x is the offset and y the last row you want returned. However, as I found out, it isn't the most efficient way to do so. If you have an autoincrement column, you can as easily use a SELECT statement with a WHERE clause saying from which record you'd like to start.
For example,
SELECT * FROM table_name WHERE id > x;
It seems that mysql gets all results when you use LIMIT and then only shows you the records that fit in the offset: not the best for performance.
Source: Answer to this question MySQL Forums. Just take note, the question is about 6 years old.
I know that this is old but I didnt see a similar response so this is the solution I would use.
First, I would execute a count query on the table to see how many records exist. This query is fast and normally the execution time is negligible. Something like:
SELECT COUNT(*) FROM table_name;
Then I would build my query using the result I got from count as my limit (since that is the maximum number of rows the table could possibly return). Something like:
SELECT * FROM table_name LIMIT count_result OFFSET desired_offset;
Or possibly something like:
SELECT * FROM table_name LIMIT desired_offset, count_result;
Of course, if necessary, you could subtract desired_offset from count_result to get an actual, accurate value to supply as the limit. Passing the "18446744073709551610" value just doesnt make sense if I can actually determine an appropriate limit to provide.
WHERE .... AND id > <YOUROFFSET>
id can be any autoincremented or unique numerical column you have...

Getting random results from large tables

I'm trying to get 4 random results from a table that holds approx 7 million records. Additionally, I also want to get 4 random records from the same table that are filtered by category.
Now, as you would imagine doing random sorting on a table this large causes the queries to take a few seconds, which is not ideal.
One other method I thought of for the non-filtered result set would be to just get PHP to select some random numbers between 1 - 7,000,000 or so and then do an IN(...) with the query to only grab those rows - and yes, I know that this method has a caveat in that you may get less than 4 if a record with that id no longer exists.
However, the above method obviously will not work with the category filtering as PHP doesn't know which record numbers belong to which category and hence cannot select the record numbers to select from.
Are there any better ways I can do this? Only way I can think of would be to store the record id's for each category in another table and then select random results from that and then select only those record ID's from the main table in a secondary query; but I'm sure there is a better way!?
You could of course use the RAND() function on a query using a LIMIT and WHERE (for the category). That however as you pointed out, entails a scan of the database which takes time, especially in your case due to the volume of data.
Your other alternative, again as you pointed out, to store id/category_id in another table might prove a bit faster but again there has to be a LIMIT and WHERE on that table which will also contain the same amount of records as the master table.
A different approach (if applicable) would be to have a table per category and store in that the IDs. If your categories are fixed or do not change that often, then you should be able to use that approach. In that case you will effectively remove the WHERE from the clause and getting a RAND() with a LIMIT on each category table would be faster since each category table will contain a subset of records from your main table.
Some other alternatives would be to use a key/value pair database just for that operation. MongoDb or Google AppEngine can help with that and are really fast.
You could also go towards the approach of a Master/Slave in your MySQL. The slave replicates content in real time but when you need to perform the expensive query you query the slave instead of the master, thus passing the load to a different machine.
Finally you could go with Sphinx which is a lot easier to install and maintain. You can then treat each of those category queries as a document search and let Sphinx randomize the results. This way you offset this expensive operation to a different layer and let MySQL continue with other operations.
Just some issues to consider.
Working off your random number approach
Get the max id in the database.
Create a temp table to store your matches.
Loop n times doing the following
Generate a random number between 1 and maxId
Get the first record with a record Id greater than the random number and insert it into your temp table
Your temp table now contains your random results.
Or you could dynamically generate sql with a union to do the query in one step.
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
Note: my sql may not be valid, as I'm not a mySql guy, but the theory should be sound
First you need to get number of rows ... something like this
select count(1) from tbl where category = ?
then select a random number
$offset = rand(1,$rowsNum);
and select a row with offset
select * FROM tbl LIMIT $offset, 1
in this way you avoid missing ids. The only problem is you need to run second query several times. Union may help in this case.
For MySQl you can use
RAND()
SELECT column FROM table
ORDER BY RAND()
LIMIT 4

Does MYSQL load the whole table into cache everytime?

Lets say I have a table, with say 1 million rows, with the first column being a primary key.
Then, if I run the following:
SELECT * FROM table WHERE id='tomato117' LIMIT 1
Does the table ALL get put into the cache (thereby causing the query to slow as more and more rows get added) or would the number of rows of the table not matter, since the query uses the primary key?
edit: (added limit 1)
If the id is define as primary key, which only one record with value tomato117, so limit does not useful.
Using SELECT * will trigger mysql read from disk because unlikely all columns are stored into index. (mysql not able to fetch from index) In theory, it will affect performance.
However, your sql is matching query cache condition. So, mysql will stored the result into query cache for subsequent usage.
If you query cache size is huge, mysql will keep store all sql results into query cache until memory full.
This come with a cost, if there is an update on your table, query cache invalidation will be harder for mysql.
http://www.mysqlperformanceblog.com/2007/03/23/beware-large-query_cache-sizes/
http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-with-large-tables/
nothing of the sort.
It will only fetch the row you selected and perhaps a few other blocks. They will remain in cache until something pushes them out.
By cache, I refer to the innodb buffer pool, not query cache, which should probably be off anyway.
SELECT * FROM table WHERE id = 'tomato117' LIMIT 1
When tomato117 is found, it stops searching, if you don't set LIMIT 1 it will search until end of table. tomato117 can be second, and it will still search 1 000 000 rows for other tomato117.
http://forge.mysql.com/wiki/Top10SQLPerformanceTips
Showing rows 0 - 0 (1 total, Query took 0.0159 sec)
SELECT *
FROM 'forum_posts'
WHERE pid = 643154
LIMIT 0 , 30
Showing rows 0 - 0 (1 total, Query took 0.0003 sec)
SELECT *
FROM `forum_posts`
WHERE pid = 643154
LIMIT 1
Table is about 1GB, 600 000+ rows.
If you add the word EXPLAIN before the word SELECT, it will show you a table with a summary of how many rows it's reading instead of the normal results.
If your table has an index on the id column (including if it's set as primary key), the engine will be able to jump straight to the exact row (or rows, for a non-unique index) and only read the minimal amount of date. If there's no index, it will need to read the whole table.

Returning random rows from mysql database without using rand()

I would like to be able to pull back 15 or so records from a database. I've seen that using WHERE id = rand() can cause performance issues as my database gets larger. All solutions I've seen are geared towards selecting a single random record. I would like to get multiples.
Does anyone know of an efficient way to do this for large databases?
edit:
Further Edit and Testing:
I made a fairly simple table, on a new database using MyISAM. I gave this 3 fields: autokey (unsigned auto number key) bigdata (a large blob) and somemore (a medium int). I then applied random data to the table and ran a series of queries using Navicat. Here are the results:
Query 1: select * from test order by rand() limit 15
Query 2: select *
from
test
join
(select round(rand()*(select max(autokey) from test)) as val from test limit 15) as rnd
on
rnd.val=test.autokey;`
(I tried both select and select distinct and it made no discernible difference)
and:
Query 3 (I only ran this on the second test):
SELECT *
FROM (
SELECT #cnt := COUNT(*) + 1,
#lim := 10
FROM test
) vars
STRAIGHT_JOIN
(
SELECT r.*,
#lim := #lim - 1
FROM test r
WHERE (#cnt := #cnt - 1)
AND RAND(20090301) < #lim / #cnt
) i
ROWS: QUERY 1: QUERY 2: QUERY 3:
2,060,922 2.977s 0.002s N/A
3,043,406 5.334s 0.001s 1.260
I would like to do more rows so I can see how query 3 scales, but at the moment, it seems as though the clear winner is query 2.
Before I wrap up this testing and declare an answer, and while I have all this data and the test environment set up, can anyone recommend any further testing?
Try:
select * from table order by rand() limit 15
Another (and possibly more efficient way) would be to join against a set of random values. This should work, if there's some contiguous integer key in the table. Here is how I would do it in postgres (My MySQL is a bit rusty)
select * from table join
(select (random()*maxid)::integer as val from generate_series(1,15)) as rnd
on rand.val=table.id;
where maxid is the highest id in table. If id has an index, then this would mean only 15 index lookup, so its very fast.
UPDATE:
Looks like there no such thing as generate_series in MySQL. My fault. We don't need it actually:
select *
from
table
join
-- this just returns 15 random numbers.
-- I need `table` here only to produce rows for rand()
(select round(rand()*(select max(id) from table)) as val from table limit 15) as rnd
on
rnd.val=table.id;
P.S. If I don't want duplicates returned, I can use (select distinct [...]) in the random generator expression.
Update: Check out the accepted answer in this question. It's pure mySQL and even deals with even distribution.
The problem with id = rand() or anything comparable in PHP is that you can't be sure whether that particular ID still exists. Therefore, you need to work with LIMIT, and that can become slow for large amounts of data.
As an alternative to that, you could try using a loop in PHP.
What the loop does is
Create a random integer number using rand(), with a scope between 0 and the number of records in the database
Query the database whether a record with that ID exists
If it exists, add the number to an array
If it doesn't, go back to step 1
End the loop when the array of random numbers contains the desired number of elements
this method could cause a lot of queries in a fragmented table, but they should be pretty fast to execute. It may be faster than LIMIT rand() in certain situations.
The LIMIT method, as outlined by #Luther, is certainly the simplest code-wise.
You could do a query with all the results or however many limited, then use mysqli_fetch_all followed by:
shuffle($a);
$a = array_slice($a, 0, 15);
For a large dataset doing
select * from table order by rand() limit 15
can be quite time and memory consuming.
If your data records happen to be numbered you can put and index on the numbering colum and do a
select * from table where no >= rand() limit 15
Or even better do the random number generation in your application and do
select * from table where no >= $rand and no <= $rand+15
If your data doesn't change too often, it might be worth to add such a numbering a column to make the selection efficient.
Assuming MySQL supports nested queries and that operations on the primary key are fast, I'd try something like
select * from table where id in (select id from table order by rand() limit 15)

Categories