Finding mySQL duplicates, then merging data - php

I have a mySQL database with a tad under 2 million rows. The database is non-interactive, so efficiency isn't key.
The (simplified) structure I have is:
`id` int(11) NOT NULL auto_increment
`category` varchar(64) NOT NULL
`productListing` varchar(256) NOT NULL
Now the problem I would like to solve is, I want to find duplicates on productListing field, merge the data on the category field into a single result - deleting the duplicates.
So given the following data:
+----+-----------+---------------------------+
| id | category | productListing |
+----+-----------+---------------------------+
| 1 | Category1 | productGroup1 |
| 2 | Category2 | productGroup1 |
| 3 | Category3 | anotherGroup9 |
+----+-----------+---------------------------+
What I want to end up is with:
+----+----------------------+---------------------------+
| id | category | productListing |
+----+----------------------+---------------------------+
| 1 | Category1,Category2 | productGroup1 |
| 3 | Category3 | anotherGroup9 |
+----+----------------------+---------------------------+
What's the most efficient way to do this either in pure mySQL query or php?

I think you're looking for GROUP_CONCAT:
SELECT GROUP_CONCAT(category), productListing
FROM YourTable
GROUP BY productListing
I would create a new table, inserting the updated values, delete the old one and rename the new table to the old one's name:
CREATE TABLE new_YourTable SELECT GROUP_CONCAT(...;
DROP TABLE YourTable;
RENAME TABLE new_YourTable TO YourTable;
-- don't forget to add triggers, indexes, foreign keys, etc. to new table

SELECT MIN(id), GROUP_CONCAT(category SEPARATOR ',' ORDER BY id), productListing
FROM mytable
GROUP BY
productListing

Related

How can i update the Records included in another query using SUM and GROUP By in mysql

I am having a mysql table
content_votes_tmp
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| up | int(11) | NO | MUL | 0 | |
| down | int(11) | NO | | 0 | |
| ip | int(10) unsigned | NO | | NULL | |
| content | int(11) | NO | | NULL | |
| datetime | datetime | NO | | NULL | |
| is_updated | tinyint(2) | NO | | 0 | |
| record_num | int(11) | NO | PRI | NULL | auto_increment |
+------------+------------------+------+-----+---------+----------------+
surfers can vote up or vote down on posts i.e. content, a record gets inserted everytime a vote is given same as rating , in the table along with other data like ip , content id
Now i am trying to create cronjob script in php which will SUM(up) and SUM(down) of votes
like this,
mysqli_query($con, "SELECT SUM(up) as up_count, SUM(down) as down_count, content FROM `content_votes_tmp` WHERE is_updated = 0 GROUP by content")
and then by using while loop in php i can update the main table for the specific content id,
but i would like to set the records which are part of SUM to be marked as updated i.e. SET is_updated = 1, so the same values wont get summed again and again.
How can i achieve this ? using mysql query ? and work on same data set as , every second/milisecond the records are getting inserted in the table ,.
i can think of another way of achieving this is by getting all the non-updated records and doing sum in the php and then updating every record.
The simplest way would probably be a temporary table. Create one with the record_num values you want to select from;
CREATE TEMPORARY TABLE temp_table AS
SELECT record_num FROM `content_votes_tmp` WHERE is_updated = 0;
Then do your calculation using the temp table;
SELECT SUM(up) as up_count, SUM(down) as down_count, content
FROM `content_votes_tmp`
WHERE record_num IN (SELECT record_num FROM temp_table)
GROUP by content
Once you've received your result, you can set is_updated on the values you just calculated over;
UPDATE `content_votes_tmp`
SET is_updated = 1
WHERE record_num IN (SELECT record_num FROM temp_table)
If you want to reuse the connection to do the same thing again, you'll need to drop the temporary table before creating it again, but if you just want to do it a single time in a page, it will disappear automatically when the database is disconnected at the end of the page.

removing duplicate row from mysql where value equals something

I've all the way to the end of the internet and I'm proper stuck. Whilst I can find partial answer I'm unable to modify it to make it work.
I have a table named myfetcher like:
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| fid_id | int(11) | NO | PRI | NULL | auto_increment |
| linksetid | varchar(200) | NO | | NULL | |
| url | varchar(200) | NO | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
The url field would sometimes contain dupes but rather than remove all duplicates in the table, I need only where the field linksetid is equal to X.
The SQL below removes all duplicates in the table (which is not what I want)... but what I want is only the duplicates within a set range in the field linksetid. I know I'm doing something wrong, just not sure what is it.
DELETE FROM myfetcher USING myfetcher, myfetcher as vtable
WHERE (myfetcher.fid>vtable.fid)
AND (myfetcher.url=vtable.url)
AND (myfetcher.linksetid='$linkuniq')
Delete only records with linksetid=X. First EXISTS check case when all records are with linksetid=X then only one with min(fid) remains. The second EXISTS check case when there is a record with linksetid<>X then all records with linksetid=X will be removed:
NOTE: this query works in Oracle or MSSQL. For MYSql use next workaround:
DELETE FROM myfetcher
where (myfetcher.linksetid='$linkuniq')
and
(
exists
(select t.fid from myfetcher t where
t.fid<myfetcher.fid
and
t.url=myfetcher.url
and
t.linksetid='$linkuniq')
or
exists
(select t.fid from myfetcher t where
t.url=myfetcher.url
and
t.linksetid<>'$linkuniq')
)
In MYSql you can't use update/delete command with subquery for the target table. So for MySql you can use following script. SqlFiddle demo:
create table to_delete_tmp as
select fid from myfetcher as tmain
where (tmain.linksetid='$linkuniq')
and
(
exists
(select t.fid from myfetcher t where
t.fid<tmain.fid
and
t.url=tmain.url
and
t.linksetid='$linkuniq')
or
exists
(select t.fid from myfetcher t where
t.url=tmain.url
and
t.linksetid<>'$linkuniq')
) ;
delete from myfetcher where myfetcher.fid in (select fid from to_delete_tmp);
drop table to_delete_tmp;

MySQL Select from multiple tables

I'm new to MySQL. I am creating a checkout page in PHP. When the users select the items they want to buy and click "Add to Cart", a temporary table gets created which has the following fields (table name is temp):
+--------------+-----------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-----------+------+-----+-------------------+----------------+
| Cart_Item_ID | int(11) | NO | PRI | NULL | auto_increment |
| Item_ID | int(11) | NO | | | |
| Added_On | timestamp | YES | | CURRENT_TIMESTAMP | |
+--------------+-----------+------+-----+-------------------+----------------+
I'm only inserting to the Item_ID field which contains the ID of each item they bought (I'm populating the forms with item IDs). What I want to do is look up the item's name and price that's stored in the Inventory table. Here's how that looks:
+--------------+----------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------------+------+-----+-------------------+----------------+
| Inventory_ID | int(11) | NO | PRI | NULL | auto_increment |
| Item_Name | varchar(40) | NO | | | |
| Item_Price | float unsigned | NO | | 0 | |
| Added_On | timestamp | YES | | CURRENT_TIMESTAMP | |
+--------------+----------------+------+-----+-------------------+----------------+
So how would I pull out the Item_name and Item_Price fields from the Inventory table based on the Item_ID field from the temp table so I can display it on the page? I just don't understand how to formulate the query. I'd appreciate any help. Thank you.
It's called JOIN - read more here
SELECT Inventory.Item_Name, Inventory.Item_Price
FROM Inventory, temp WHERE Inventory.Inventory_ID = temp.Item_ID
what i understand is that the Item_ID in temp table is referencing to the Inventory_ID in inventory table. based on this assumption you can use the following query.
Select Item_Name, Item_Price from Inventory, Temp where Temp.Item_ID == Inventory.Inventory_ID
i guess this is what you want to do.
Thanks
As it stands, you can't (unless Inventory_ID = Item_ID)
What you need is a way of JOINing the two tables together. In this instance, if Inventory_ID = Item_ID then the following is possible:
SELECT Item_Name,
Item_Price
FROM InventoryTable
INNER JOIN TempItemTable ON (InventoryTable.Inventory_ID = ItemTable.Item_ID)
If you want to filter for a particular item you can add the constraint:
WHERE ItemTable.Item_ID = 27 --for example
That will join all the rows in your inventory table with matching rows in the Item table.
Jeff Atwood has a great (IMO) visual explanation of how JOINs work.

How to avoid "Using temporary" in many-to-many queries?

This query is very simple, all I want to do, is get all the articles in given category ordered by last_updated field:
SELECT
`articles`.*
FROM
`articles`,
`articles_to_categories`
WHERE
`articles`.`id` = `articles_to_categories`.`article_id`
AND `articles_to_categories`.`category_id` = 1
ORDER BY `articles`.`last_updated` DESC
LIMIT 0, 20;
But it runs very slow. Here is what EXPLAIN said:
select_type table type possible_keys key key_len ref rows Extra
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SIMPLE articles_to_categories ref article_id,category_id article_id 5 const 5016 Using where; Using temporary; Using filesort
SIMPLE articles eq_ref PRIMARY PRIMARY 4 articles_to_categories.article_id 1
Is there a way to rewrite this query or add additional logic to my PHP scripts to avoid Using temporary; Using filesort and speed thing up?
The table structure:
*articles*
id | title | content | last_updated
*articles_to_categories*
article_id | category_id
UPDATE
I have last_updated indexed. I guess my situation is explained in documentation:
In some cases, MySQL cannot use
indexes to resolve the ORDER BY,
although it still uses indexes to find
the rows that match the WHERE clause.
These cases include the following:
The key used to fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
You are joining many tables, and the
columns in the ORDER BY are not all
from the first nonconstant table that
is used to retrieve rows. (This is the
first table in the EXPLAIN output that
does not have a const join type.)
but I still have no idea how to fix this.
Here's a simplified example I did for a similar performance related question sometime ago that takes advantage of innodb clustered primary key indexes (obviously only available with innodb !!)
http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
You have 3 tables: category, product and product_category as follows:
drop table if exists product;
create table product
(
prod_id int unsigned not null auto_increment primary key,
name varchar(255) not null unique
)
engine = innodb;
drop table if exists category;
create table category
(
cat_id mediumint unsigned not null auto_increment primary key,
name varchar(255) not null unique
)
engine = innodb;
drop table if exists product_category;
create table product_category
(
cat_id mediumint unsigned not null,
prod_id int unsigned not null,
primary key (cat_id, prod_id) -- **note the clustered composite index** !!
)
engine = innodb;
The most import thing is the order of the product_catgeory clustered composite primary key as typical queries for this scenario always lead by cat_id = x or cat_id in (x,y,z...).
We have 500K categories, 1 million products and 125 million product categories.
select count(*) from category;
+----------+
| count(*) |
+----------+
| 500000 |
+----------+
select count(*) from product;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
select count(*) from product_category;
+-----------+
| count(*) |
+-----------+
| 125611877 |
+-----------+
So let's see how this schema performs for a query similar to yours. All queries are run cold (after mysql restart) with empty buffers and no query caching.
select
p.*
from
product p
inner join product_category pc on
pc.cat_id = 4104 and pc.prod_id = p.prod_id
order by
p.prod_id desc -- sry dont a date field in this sample table - wont make any difference though
limit 20;
+---------+----------------+
| prod_id | name |
+---------+----------------+
| 993561 | Product 993561 |
| 991215 | Product 991215 |
| 989222 | Product 989222 |
| 986589 | Product 986589 |
| 983593 | Product 983593 |
| 982507 | Product 982507 |
| 981505 | Product 981505 |
| 981320 | Product 981320 |
| 978576 | Product 978576 |
| 973428 | Product 973428 |
| 959384 | Product 959384 |
| 954829 | Product 954829 |
| 953369 | Product 953369 |
| 951891 | Product 951891 |
| 949413 | Product 949413 |
| 947855 | Product 947855 |
| 947080 | Product 947080 |
| 945115 | Product 945115 |
| 943833 | Product 943833 |
| 942309 | Product 942309 |
+---------+----------------+
20 rows in set (0.70 sec)
explain
select
p.*
from
product p
inner join product_category pc on
pc.cat_id = 4104 and pc.prod_id = p.prod_id
order by
p.prod_id desc -- sry dont a date field in this sample table - wont make any diference though
limit 20;
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+----------------------------------------------+
| 1 | SIMPLE | pc | ref | PRIMARY | PRIMARY | 3 | const | 499 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | vl_db.pc.prod_id | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+----------------------------------------------+
2 rows in set (0.00 sec)
So that's 0.70 seconds cold - ouch.
Hope this helps :)
EDIT
Having just read your reply to my comment above it seems you have one of two choices to make:
create table articles_to_categories
(
article_id int unsigned not null,
category_id mediumint unsigned not null,
primary key(article_id, category_id), -- good for queries that lead with article_id = x
key (category_id)
)
engine=innodb;
or.
create table categories_to_articles
(
article_id int unsigned not null,
category_id mediumint unsigned not null,
primary key(category_id, article_id), -- good for queries that lead with category_id = x
key (article_id)
)
engine=innodb;
depends on your typical queries as to how you define your clustered PK.
You should be able to avoid filesort by adding a key on articles.last_updated. MySQL needs the filesort for the ORDER BY operation, but can do it without filesort as long as you order by an indexed column (with some limitations).
For much more info, see here: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
I assume you have made the following in your db:
1) articles -> id is a primary key
2) articles_to_categories -> article_id is a foreign key of articles -> id
3) you can create index on category_id
ALTER TABLE articles ADD INDEX (last_updated);
ALTER TABLE articles_to_categories ADD INDEX (article_id);
should do it. The right plan is to find the first few records using the first index and do the JOIN using the second one. If it doesn't work, try STRAIGHT_JOIN or something to enforce proper index usage.

Recursive-ish query for tags?

I have a table of tags that can be linked to other tags and I want to "recursively" select the tags in order of arrangement. So that when a search is made, we get the immediate (1-level) results and then carry on down to say 5-levels so that we always have a list of tags no matter if there wasn't enough exact matches on level 1.
I can manage this fine with making multiple queries until I get enough results, but surely there is a better, optimized, way via a one-trip query?
Any tips will be appreciated.
Thanks!
Results:
tagId, tagWord, child, child tagId
'513', 'Slap', 'Hog Slapper', '1518'
'513', 'Slap', 'Corporal Punishment', '147'
'513', 'Slap', 'Impact Play', '1394'
Query:
SELECT t.tagId, t.tagWord as tag, tt.tagWord as child, tt.tagId as childId
FROM platform.tagWords t
INNER JOIN platform.tagsLinks l ON l.parentId = t.tagId
INNER JOIN platform.tagWords tt ON tt.tagId = l.tagId
WHERE t.tagWord = 'slap'
Table Layouts:
mysql> explain tagWords;
+---------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------------------+------+-----+---------+----------------+
| tagId | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| tagWord | varchar(45) | YES | UNI | NULL | |
+---------+---------------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)
mysql> explain tagsLinks;
+----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+-------+
| tagId | bigint(20) unsigned | NO | | NULL | |
| parentId | bigint(20) | YES | | NULL | |
+----------+---------------------+------+-----+---------+-------+
2 rows in set (0.00 sec)
AFAIK Mysql doesn't have any mechanism for querying data recursively
Oracle has Connected By construct and Sql Server has CTE(Common Table Expressions).
But Mysql,
Read Here and Here
Here are the options that I consider each time I find myself in a situation when I need to query hierarchical data.
Nested Sets
Path enumeration
Explicit joins (when the maximum level is known)
Vendor Extensions (SQL Server CTE, Oracle Connect by etc)
Stored Procedures
Suck it up

Categories