Improving a query UPDATE using large mysql databases - php

I'm trying to attempt updating my quite robust database (nearly 3 million rows) with following query:
$length = strlen($this);
$query = "UPDATE database
SET row_to_update='1'
WHERE row='{$this}'
AND row_length='{$length}'
LIMIT 1";
It gets words ($this) from a file (quite a lot of them) and then searches for a match. If found, it updates row_to_update with value 1 (set none as default).
Every row_length contains already value of length of certain cell, which I thought might speed up process significantly. Sadly it didn't.
It manages only ~30k queries in 8h. That's slow, to say the least!
Is there any way, I could improve this bit of inefficient code?

Try to collect a bunch of values you're looking for and use
UPDATE table SET row_to_update='1' WHERE row IN ({$my_values});
You can use EXPLAIN <your_query> and EXPLAIN EXTENDED .. to check if it uses indexes or not and adjust the query or create indexes to speed it up. Play with SELECT with the same WHERE conditions that way.
Much more you can get using:
SET profiling = 1;
<your query goes here>
SHOW PROFILES;
SHOW PROFILE FOR QUERY 1;
Be carefull with it if it's not on dev. env.
Consider as well to fill temp table with the values you're interested in and use it that way:
UPDATE table SET row_to_update='1' WHERE row in (SELECT values FROM my_temp_table);
when you get there than you can improve it to:
UPDATE table INNER JOIN temp_table ON table.row = temp_table.row SET row_to_update = '1';
EXAMPLES:
As you asked for examples. Lat say example table represents your original one with lot of data in it. In this example I'll use just 4 rows:
mysql> select * from example;
+----+------+
| id | data |
+----+------+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
+----+------+
4 rows in set (0.00 sec)
Let say that you're looking for ids of rows that has data= 'a', 'b', or 'c'
You can do this in 3 ways:
1) SELECT ... IN (list)
mysql> select id from example where data in ('a', 'b', 'c');
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+
3 rows in set (0.00 sec)
2) SELECT ... IN (SELECT ... FROM temp_table)
mysql> select * from temp_table;
+----+------+
| id | data |
+----+------+
| 10 | foo~ |
| 11 | a |
| 12 | bar |
| 13 | baz |
| 14 | b |
| 15 | c |
+----+------+
6 rows in set (0.00 sec)
mysql> select id from example where data in (SELECT data from temp_table);
[..]
3 rows in set (0.00 sec)
3) SELECT ... INNER JOIN temp_table ...
mysql> select example.id from example inner join temp_table on example.data = temp_table.data;
[..]
3 rows in set (0.01 sec)
And when you'll be ready use UPDATE with the same conditions to apply changes you like.

Related

Remove duplicated records in sphinx without doing setGroupBy?

Hey I am new to sphinx search.
In my query I retrieve course_ids. All the courses belong to a theme_id, but some of them can belong to more than 1 theme, so some of them are duplicated.
I set limits to my query to display results from 1-20, then 21-40... So 20 by 20.
But sometimes in those 20 results there are duplicated results, so for example if from 21 from 40 there are 3 duplicated results I want to remove them and then fill the 3 empty spaces with the next 3 results, so the query returns instead 21-43. Then 44-64...
I tried setGroupBy(), and it worked, but I don't want the courses to be sorted by course_id but with setSortMode(), so the course_ids are again duplicated.
How can I remove the duplicated records and keep the sorting?
Any help would be appreciated. Thanks
setGroupBy has a third and option argument, to specify the final sort order.
So can group by (for example) course_ids but still do the final sorting by weight (or whatever), rather than the default '#group desc'.
$client->setSortOrder( SPH_SORT_RELEVANCE );
$client->setGroupBy( 'course_id', SPH_GROUPBY_ATTR, "#weight desc" );
Still use setSortOrder, which determins WHICH of the rows from the course, is kept. Ie show the highest rank one first, which mimicks overall sorting of weight.
Looks like what you are looking for is exactly what REMOVE_REPEATS() does. Not sure it's available in the programming language clients. You'll probably need to use SphinxQL instead which is anyway recommended as the clients are outdated and miss a lot of functionality.
Here's an example:
Without REMOVE_REPEATS():
MySQL [(none)]> select * from testrt;
+------+------+
| id | gid |
+------+------+
| 1 | 10 |
| 2 | 10 |
| 3 | 20 |
| 4 | 30 |
| 5 | 30 |
+------+------+
5 rows in set (0.04 sec)
With REMOVE_REPEATS() by gid:
MySQL [(none)]> select remove_repeats((select * from testrt), gid, 0,10);
+------+------+
| id | gid |
+------+------+
| 1 | 10 |
| 3 | 20 |
| 4 | 30 |
+------+------+
3 rows in set (0.06 sec)

Mysql returns no values when null values are present In the subquery

I do have two tables table1 and table2. And their content as follows
mysql> select id from table1;
+------+
| id |
+------+
| 1 |
| 2 |
| 3 |
| 4 |
+------+
4 rows in set (0.00 sec)
mysql> select id from table2;
+------+
| id |
+------+
| 301 |
| 2 |
| NULL |
+------+
3 rows in set (0.00 sec)
when I hit the below query in mysql console it always returns empty set
select id
from table1
where id
not in (select id from table2);
Empty set (0.00 sec)
Is there a reason when there are null values in the sub query the in and not in would malfunction....?
I've solved it by using the below query
select id
from table1
where id
not in (select id from table2 where id is not null);
+------+
| id |
+------+
| 1 |
| 3 |
| 4 |
+------+
3 rows in set (0.00 sec)
Just want to know
Thanks in advance :)
edit: This question tries to clear some air but not enough
That is how not in works. I recommend that you use not exists instead:
select id
from table1 t1
where not exists (select 1 from table2 t2 where t1.id = t2.id);
Why does not in work this way? It is because of the semantics of not in. Remember that NULL in SQL (usually) means an unknown value. Hence, if you have a list of "(1, 2)" you can say that "3" is not in the list. If you have "(1, 2, unknown)" you cannot say that. Instead, the result is NULL, which is treated as false.
NOT EXISTS does not behave this way, so I find it more convenient to use.

WHERE vs HAVING in generated queries

I know that this title is overused, but it seems that my kind of question is not answered yet.
So, the problem is like this:
I have a table structure made of four tables (tables, rows, cols, values) that I use to recreate the behavior of the information_schema (in a way).
In php I am generating queries to retrieve the data, and the result would still look like a normal table:
SELECT
(SELECT value FROM `values` WHERE `col` = "3" and row = rows.id) as "col1",
(SELECT value FROM `values` WHERE `col` = "4" and row = rows.id) as "col2"
FROM rows WHERE `table` = (SELECT id FROM tables WHERE name = 'table1')
HAVING (col2 LIKE "%4%")
OR
SELECT * FROM
(SELECT
(SELECT value FROM `values` WHERE `col` = "3" and row = rows.id) as "col1",
(SELECT value FROM `values` WHERE `col` = "4" and row = rows.id) as "col2"
FROM rows WHERE `table` = (SELECT id FROM tables WHERE name = 'table1')) d
WHERE col2 LIKE "%4%"
note that the part where I define the columns of the result is generated by a php script. It is less important why I am doing this, but I want to extend this algorithm that generates the queries for a broader use.
And we got to the core problem, I have to decide if I will generate a where or a having part for the query, and I know when to use them both, the problem is my algorithm doesn't and I have to make a few extra checks for this. But the two above queries are equivalent, I can always put any query in a sub-query, give it an alias, and use where on the new derived table. But I wonder if I will have problems with the performance or not, or if this will turn back on me in an unexpected way.
I know how they both work, and how where is supposed to be faster, but this is why I came here to ask. Hopefully I made myself understood, please excuse my english and the long useless turns of phrases, and all.
EDIT 1
I already know the difference between the two, and all that implies, my only dilemma is that using custom columns from other tables, with variable numbers and size, and trying to achieve the same result as using a normally created table implies that I must use HAVING for filtering the derived tables columns, at the same time having the option to wrap it up in a subquery and use where normally, this probably will create a temporary table that will be filtered afterwards. Will this affect performance for a large database? And unfortunately I cannot test this right now, as I do not afford to fill the database with over 1 billion entries (that will be something like this: 1 billion in rows table, 5 billions in values table, as every row have 5 columns, 5 rows in cols table and 1 row in tables table = 6,000,006 entries in total)
right now my database looks like this:
+----+--------+-----------+------+
| id | name | title | dets |
+----+--------+-----------+------+
| 1 | table1 | Table One | |
+----+--------+-----------+------+
+----+-------+------+
| id | table | name |
+----+-------+------+
| 3 | 1 | col1 |
| 4 | 1 | col2 |
+----+-------+------+
where `table` is a foreign key from table `tables`
+----+-------+-------+
| id | table | extra |
+----+-------+-------+
| 1 | 1 | |
| 2 | 1 | |
+----+-------+-------+
where `table` is a foreign key from table `tables`
+----+-----+-----+----------+
| id | row | col | value |
+----+-----+-----+----------+
| 1 | 1 | 3 | 13 |
| 2 | 1 | 4 | 14 |
| 6 | 2 | 4 | 24 |
| 9 | 2 | 3 | asdfghjk |
+----+-----+-----+----------+
where `row` is a foreign key from table `rows`
where `col` is a foreign key from table `cols`
EDIT 2
The conditions are there just for demonstration purposes!
EDIT 3
For only two rows, it seems there is a difference between the two, the one using having is 0,0008 and the one using where is 0.0014-0.0019. I wonder if this will affect performance for large numbers of rows and columns
EDIT 4
The result of the two queries is identical, and that is:
+----------+------+
| col1 | col2 |
+----------+------+
| 13 | 14 |
| asdfghjk | 24 |
+----------+------+
HAVING is specifically for GROUP BY, WHERE is to provide conditional parameters. See also WHERE vs HAVING
I believe the having clause would be faster in this case, as you're defining specific values, as opposed to reading through the values and looking for a match.
See: http://database-programmer.blogspot.com/2008/04/group-by-having-sum-avg-and-count.html
Basically, WHERE filters out columns before passing them to an aggregate function, but HAVING filters the aggregate function's results.
you could do it like that
WHERE col2 In (14,24)
your code WHERE col2 LIKE "%4%" is bad idea so what about col2 = 34 it will be also selected.

delete duplicate rows that have blob text / mediumtext mysql

I have seen lots of posts on deleting rows using sql commands but i need to filter out rows which have mediumtext.
I keep getting an error Error Code: 1170. BLOB/TEXT column used in key specification without a key length from solution such as:
ALTER IGNORE TABLE foobar ADD UNIQUE (title, SID)
My table is simple, i need to check for duplicates in mytext, id is unique and they are AUTO_INCREMENT.
As a note, the table has about a million rows, and all attempts keep timing out. I would need a solution that performs actions in batches such as WHERE id>0 AND id<100
Also I am using MySQL Workbench on amazons RDS
From a table like this
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 2 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
I would like to end up with a table like this
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
update forgot to mention this is on amazon RDS using mysql workbench
my table is very large and i keep getting an error Error Code: 1205. Lock wait timeout exceeded from this sql command:
DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name
Also, if anyone else is having issues with MySQL workbench timing out the fix is
Go to Preferences -> SQL Editor and set to a bigger value this parameter:
DBMS connection read time out (in seconds)
OPTION #1: Delete all duplicates records leaving one of each (e.g. the one with max(id))
DELETE
FROM yourTable
WHERE id NOT IN
(
SELECT MAX(id)
FROM yourTable
GROUP BY mytext
)
You could prefer using min(id).
Depending on the engine used, this won't work and, as it did, give you the Error Code: 1093. You can't specify target table 'yourTable' for update in FROM clause. Why? Because deleting one record may cause something to happen which made the WHERE condition FALSE, i.e. max(id) changes the value.
In this case, you could try using another subquery as a temporary table:
DELETE
FROM yourTable
WHERE id NOT IN
(
SELECT MAXID FROM
(
SELECT MAX(id) as MAXID
FROM yourTable
GROUP BY mytext
) as temp_table
)
OPTION #2: Use a temporary table like in this example or:
First, create a temp table with the max ids:
SELECT MAX(id) AS MAXID
INTO tmpTable
FROM yourTable
GROUP BY mytext;
Then execute the delete:
DELETE
FROM yourTable
WHERE id NOT IN
(
SELECT MAXID FROM tmpTable
);
How about this it will delete all the duplicate records from the table
DELETE t1 FROM foobar t1 , foobar t2 WHERE t1 .mytext= t2.mytext

PHP - Doctrine ORM not able to handle bit(1) types correctly?

UPDATE I have filed a bug in Doctrine about this http://www.doctrine-project.org/jira/browse/DC-400
I have the following Doctrine schema:
---
TestTable:
columns:
bitty: bit(1)
I have created the database and table for this. I then have the following PHP code:
$obj1 = new TestTable();
$obj1['bitty'] = b'0';
$obj1->save();
$obj2 = new TestTable();
$obj2['bitty'] = 0;
$obj2->save();
Clearly my attempt is to save the bit value 0 in the bitty column.
However after running this PHP code I get the following odd results:
mysql> select * from test_table;
+----+-------+
| id | bitty |
+----+-------+
| 1 | |
| 2 | |
+----+-------+
2 rows in set (0.00 sec)
mysql> select * from test_table where bitty = 1;
+----+-------+
| id | bitty |
+----+-------+
| 1 | |
| 2 | |
+---+-------+
2 rows in set (0.00 sec)
mysql> select * from test_table where bitty = 0;
Empty set (0.00 sec)
Those boxes are the 0x01 character, i.e. Doctrine has set the value to 1, not 0.
However I can insert 0's into that table direct from MySQL:
mysql> insert into test_table values (4, b'0');
Query OK, 1 row affected (0.00 sec)
mysql> select * from test_table where bitty = 0;
+----+-------+
| id | bitty |
+----+-------+
| 4 | |
+----+-------+
1 row in set (0.00 sec)
What's going on? Is this a bug in Doctrine?
There is nothing in the doctrine documentation that says Bit is a legal type.
Doctrine does know the bit type - at least if you're using MySQL and generate Doctrine models from the existing tables.
I tried to read a few bit columns and dump the resulting objects. Basically the bit value returned is either \0 or \1, instead of 0 and 1 as I expected.

Categories