How to check if a value already exists to avoid duplicates?

How to check if a value already exists to avoid duplicates? - php

I've got a table of URLs and I don't want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?

If you don't want to have duplicates you can do following:
add uniqueness constraint
use "REPLACE" or "INSERT ... ON DUPLICATE KEY UPDATE" syntax
If multiple users can insert data to DB, method suggested by #Jeremy Ruten, can lead to an error: after you performed a check someone can insert similar data to the table.

To answer your initial question, the easiest way to check whether there is a duplicate is to run an SQL query against what you're trying to add!
For example, were you to want to check for the url http://www.example.com/ in the table links, then your query would look something like
SELECT * FROM links WHERE url = 'http://www.example.com/';
Your PHP code would look something like
$conn = mysql_connect('localhost', 'username', 'password');
if (!$conn)
{
die('Could not connect to database');
}
if(!mysql_select_db('mydb', $conn))
{
die('Could not select database mydb');
}
$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);
if (!$result)
{
die('There was a problem executing the query');
}
$number_of_rows = mysql_num_rows($result);
if ($number_of_rows > 0)
{
die('This URL already exists in the database');
}
I've written this out longhand here, with all the connecting to the database, etc. It's likely that you'll already have a connection to a database, so you should use that rather than starting a new connection (replace $conn in the mysql_query command and remove the stuff to do with mysql_connect and mysql_select_db)
Of course, there are other ways of connecting to the database, like PDO, or using an ORM, or similar, so if you're already using those, this answer may not be relevant (and it's probably a bit beyond the scope to give answers related to this here!)
However, MySQL provides many ways to prevent this from happening in the first place.
Firstly, you can mark a field as "unique".
Lets say I have a table where I want to just store all the URLs that are linked to from my site, and the last time they were visited.
My definition might look something like this:-
CREATE TABLE links
(
url VARCHAR(255) NOT NULL,
last_visited TIMESTAMP
)
This would allow me to add the same URL over and over again, unless I wrote some PHP code similar to the above to stop this happening.
However, were my definition to change to
CREATE TABLE links
(
url VARCHAR(255) NOT NULL,
last_visited TIMESTAMP,
PRIMARY KEY (url)
)
Then this would make mysql throw an error when I tried to insert the same value twice.
An example in PHP would be
$result = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);
if (!$result)
{
die('Could not Insert Row 1');
}
$result2 = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);
if (!$result2)
{
die('Could not Insert Row 2');
}
If you ran this, you'd find that on the first attempt, the script would die with the comment Could not Insert Row 2. However, on subsequent runs, it'd die with Could not Insert Row 1.
This is because MySQL knows that the url is the Primary Key of the table. A Primary key is a unique identifier for that row. Most of the time, it's useful to set the unique identifier for a row to be a number. This is because MySQL is quicker at looking up numbers than it is looking up text. Within MySQL, keys (and espescially Primary Keys) are used to define relationships between two tables. For example, if we had a table for users, we could define it as
CREATE TABLE users (
username VARCHAR(255) NOT NULL,
password VARCHAR(40) NOT NULL,
PRIMARY KEY (username)
)
However, when we wanted to store information about a post the user had made, we'd have to store the username with that post to identify that the post belonged to that user.
I've already mentioned that MySQL is faster at looking up numbers than strings, so this would mean we'd be spending time looking up strings when we didn't have to.
To solve this, we can add an extra column, user_id, and make that the primary key (so when looking up the user record based on a post, we can find it quicker)
CREATE TABLE users (
user_id INT(10) NOT NULL AUTO_INCREMENT,
username VARCHAR(255) NOT NULL,
password VARCHAR(40) NOT NULL,
PRIMARY KEY (`user_id`)
)
You'll notice that I've also added something new here - AUTO_INCREMENT. This basically allows us to let that field look after itself. Each time a new row is inserted, it adds 1 to the previous number, and stores that, so we don't have to worry about numbering, and can just let it do this itself.
So, with the above table, we can do something like
INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
and then
INSERT INTO users (username, password) VALUES('User', '988881adc9fc3655077dc2d4d757d480b5ea0e11');
When we select the records from the database, we get the following:-
mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password |
+---------+----------+------------------------------------------+
| 1 | Mez | d3571ce95af4dc281f142add33384abc5e574671 |
| 2 | User | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
+---------+----------+------------------------------------------+
2 rows in set (0.00 sec)
However, here - we have a problem - we can still add another user with the same username! Obviously, this is something we don't want to do!
mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password |
+---------+----------+------------------------------------------+
| 1 | Mez | d3571ce95af4dc281f142add33384abc5e574671 |
| 2 | User | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
| 3 | Mez | d3571ce95af4dc281f142add33384abc5e574671 |
+---------+----------+------------------------------------------+
3 rows in set (0.00 sec)
Lets change our table definition!
CREATE TABLE users (
user_id INT(10) NOT NULL AUTO_INCREMENT,
username VARCHAR(255) NOT NULL,
password VARCHAR(40) NOT NULL,
PRIMARY KEY (user_id),
UNIQUE KEY (username)
)
Lets see what happens when we now try and insert the same user twice.
mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
ERROR 1062 (23000): Duplicate entry 'Mez' for key 'username'
Huzzah!! We now get an error when we try and insert the username for the second time. Using something like the above, we can detect this in PHP.
Now, lets go back to our links table, but with a new definition.
CREATE TABLE links
(
link_id INT(10) NOT NULL AUTO_INCREMENT,
url VARCHAR(255) NOT NULL,
last_visited TIMESTAMP,
PRIMARY KEY (link_id),
UNIQUE KEY (url)
)
and let's insert "http://www.example.com" into the database.
INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
If we try and insert it again....
ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'
But what happens if we want to update the time it was last visited?
Well, we could do something complex with PHP, like so:-
$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);
if (!$result)
{
die('There was a problem executing the query');
}
$number_of_rows = mysql_num_rows($result);
if ($number_of_rows > 0)
{
$result = mysql_query("UPDATE links SET last_visited = NOW() WHERE url = 'http://www.example.com/'", $conn);
if (!$result)
{
die('There was a problem updating the links table');
}
}
Or, even grab the id of the row in the database and use that to update it.
$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);
if (!$result)
{
die('There was a problem executing the query');
}
$number_of_rows = mysql_num_rows($result);
if ($number_of_rows > 0)
{
$row = mysql_fetch_assoc($result);
$result = mysql_query('UPDATE links SET last_visited = NOW() WHERE link_id = ' . intval($row['link_id'], $conn);
if (!$result)
{
die('There was a problem updating the links table');
}
}
But, MySQL has a nice built in feature called REPLACE INTO
Let's see how it works.
mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url | last_visited |
+---------+-------------------------+---------------------+
| 1 | http://www.example.com/ | 2011-08-19 23:48:03 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)
mysql> INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'
mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
Query OK, 2 rows affected (0.00 sec)
mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url | last_visited |
+---------+-------------------------+---------------------+
| 2 | http://www.example.com/ | 2011-08-19 23:55:55 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)
Notice that when using REPLACE INTO, it's updated the last_visited time, and not thrown an error!
This is because MySQL detects that you're attempting to replace a row. It knows the row that you want, as you've set url to be unique. MySQL figures out the row to replace by using the bit that you passed in that should be unique (in this case, the url) and updating for that row the other values. It's also updated the link_id - which is a bit unexpected! (In fact, I didn't realise this would happen until I just saw it happen!)
But what if you wanted to add a new URL? Well, REPLACE INTO will happily insert a new row if it can't find a matching unique row!
mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.stackoverflow.com/', NOW());
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM links;
+---------+-------------------------------+---------------------+
| link_id | url | last_visited |
+---------+-------------------------------+---------------------+
| 2 | http://www.example.com/ | 2011-08-20 00:00:07 |
| 3 | http://www.stackoverflow.com/ | 2011-08-20 00:01:22 |
+---------+-------------------------------+---------------------+
2 rows in set (0.00 sec)
I hope this answers your question, and gives you a bit more information about how MySQL works!

Are you concerned purely about URLs that are the exact same string .. if so there is a lot of good advice in other answers. Or do you also have to worry about canonization?
For example: http://google.com and http://go%4fgle.com are the exact same URL, but would be allowed as duplicates by any of the database only techniques. If this is an issue you should preprocess the URLs to resolve and character escape sequences.
Depending where the URLs are coming from you will also have to worry about parameters and whether they are significant in your application.

First, prepare the database.
Domain names aren't case-sensitive, but you have to assume the rest of a URL is. (Not all web servers respect case in URLs, but most do, and you can't easily tell by looking.)
Assuming you need to store more than a domain name, use a case-sensitive collation.
If you decide to store the URL in two columns--one for the domain name and one for the resource locator--consider using a case-insensitive collation for the domain name, and a case-sensitive collation for the resource locator. If I were you, I'd test both ways (URL in one column vs. URL in two columns).
Put a UNIQUE constraint on the URL column. Or on the pair of columns, if you store the domain name and resource locator in separate columns, as UNIQUE (url, resource_locator).
Use a CHECK() constraint to keep encoded URLs out of the database. This CHECK() constraint is essential to keep bad data from coming in through a bulk copy or through the SQL shell.
Second, prepare the URL.
Domain names aren't case-sensitive. If you store the full URL in one column, lowercase the domain name on all URLs. But be aware that some languages have uppercase letters that have no lowercase equivalent.
Think about trimming trailing characters. For example, these two URLs from amazon.com point to the same product. You probably want to store the second version, not the first.
http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X/ref=sr_1_1?ie=UTF8&qid=1313583998&sr=8-1
http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X
Decode encoded URLs. (See php's urldecode() function. Note carefully its shortcomings, as described in that page's comments.) Personally, I'd rather handle these kinds of transformations in the database rather than in client code. That would involve revoking permissions on the tables and views, and allowing inserts and updates only through stored procedures; the stored procedures handle all the string operations that put the URL into a canonical form. But keep an eye on performance when you try that. CHECK() constraints (see above) are your safety net.
Third, if you're inserting only the URL, don't test for its existence first. Instead, try to insert and trap the error that you'll get if the value already exists. Testing and inserting hits the database twice for every new URL. Insert-and-trap just hits the database once. Note carefully that insert-and-trap isn't the same thing as insert-and-ignore-errors. Only one particular error means you violated the unique constraint; other errors mean there are other problems.
On the other hand, if you're inserting the URL along with some other data in the same row, you need to decide ahead of time whether you'll handle duplicate urls by
deleting the old row and inserting a new one (See MySQL's REPLACE extension to SQL)
updating existing values (See ON DUPLICATE KEY UPDATE)
ignoring the issue
requiring the user to take further action
REPLACE eliminates the need to trap duplicate key errors, but it might have unfortunate side effects if there are foreign key references.

To guarantee uniqueness you need to add a unique constraint. Assuming your table name is "urls" and the column name is "url", you can add the unique constraint with this alter table command:
alter table urls add constraint unique_url unique (url);
The alter table will probably fail (who really knows with MySQL) if you've already got duplicate urls in your table already.

The simple SQL solutions require a unique field; the logic solutions do not.
You should normalize your urls to ensure there is no duplication. Functions in PHP such as strtolower() and urldecode() or rawurldecode().
Assumptions: Your table name is 'websites', the column name for your url is 'url', and the arbitrary data to be associated with the url is in the column 'data'.
Logic Solutions
SELECT COUNT(*) AS UrlResults FROM websites WHERE url='http://www.domain.com'
Test the previous query with if statements in SQL or PHP to ensure that it is 0 before you continue with an INSERT statement.
Simple SQL Statements
Scenario 1: Your db is a first come first serve table and you have no desire to have duplicate entries in the future.
ALTER TABLE websites ADD UNIQUE (url)
This will prevent any entries from being able to be entered in to the database if the url value already exists in that column.
Scenario 2: You want the most up to date information for each url and don't want to duplicate content. There are two solutions for this scenario. (These solutions also require 'url' to be unique so the solution in Scenario 1 will also need to be carried out.)
REPLACE INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
This will trigger a DELETE action if a row exists followed by an INSERT in all cases, so be careful with ON DELETE declarations.
INSERT INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
ON DUPLICATE KEY UPDATE data='random data'
This will trigger an UPDATE action if a row exists and an INSERT if it does not.

In considering a solution to this problem, you need to first define what a "duplicate URL" means for your project. This will determine how to canonicalize the URLs before adding them to the database.
There are at least two definitions:
Two URLs are considered duplicates if they represent the same resource knowing nothing about the corresponding web service that generates the corresponding content. Some considerations include:
The scheme and domain name portion of the URLs are case-insensitive, so HTTP://WWW.STACKOVERFLOW.COM/ is the same as http://www.stackoverflow.com/.
If one URL specifies a port, but it is the conventional port for the scheme and they are otherwise equivalent, then they are the same ( http://www.stackoverflow.com/ and http://www.stackoverflow.com:80/).
If the parameters in the query string are simple rearrangements and the parameter names are all different, then they are the same; e.g. http://authority/?a=test&b=test and http://authority/?b=test&a=test. Note that http://authority/?a%5B%5D=test1&a%5B%5D=test2 is not the same, by this first definition of sameness, as http://authority/?a%5B%5D=test2&a%5B%5D=test1.
If the scheme is HTTP or HTTPS, then the hash portions of the URLs can be removed, as this portion of the URL is not sent to the web server.
A shortened IPv6 address can be expanded.
Append a trailing forward slash to the authority only if it is missing.
Unicode canonicalization changes the referenced resource; e.g. you can't conclude that http://google.com/?q=%C3%84 (%C3%84 represents 'Ä' in UTF-8) is the same as http://google.com/?q=A%CC%88 (%CC%88 represents U+0308, COMBINING DIAERESIS).
If the scheme is HTTP or HTTPS, 'www.' in one URL's authority can not simply be removed if the two URLs are otherwise equivalent, as the text of the domain name is sent as the value of the Host HTTP header, and some web servers use virtual hosts to send back different content based on this header. More generally, even if the domain names resolve to the same IP address, you can not conclude that the referenced resources are the same.
Apply basic URL canonicalization (e.g. lower case the scheme and domain name, supply the default port, stable sort query parameters by parameter name, remove the hash portion in the case of HTTP and HTTPS, ...), and take into account knowledge of the web service. Maybe you will assume that all web services are smart enough to canonicalize Unicode input (Wikipedia is, for example), so you can apply Unicode Normalization Form Canonical Composition (NFC). You would strip 'www.' from all Stack Overflow URLs. You could use PostRank's postrank-uri code, ported to PHP, to remove all sorts of pieces of the URLs that are unnecessary (e.g. &utm_source=...).
Definition 1 leads to a stable solution (i.e. there is no further canonicalization that can be performed and the canonicalization of a URL will not change). Definition 2, which I think is what a human considers the definition of URL canonicalization, leads to a canonicalization routine that can yield different results at different moments in time.
Whichever definition you choose, I suggest that you use separate columns for the scheme, login, host, port, and path portions. This will allow you to use indexes intelligently. The columns for scheme and host can use a character collation (all character collations are case-insensitive in MySQL), but the columns for the login and path need to use a binary, case-insensitive collation. Also, if you use Definition 2, you need to preserve the original scheme, authority, and path portions, as certain canonicalization rules might be added or removed from time to time.
EDIT: Here are example table definitions:
CREATE TABLE `urls1` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`scheme` VARCHAR(20) NOT NULL,
`canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci', /* the "ci" stands for case-insensitive. Also, we want 'utf8mb4_unicode_ci'
rather than 'utf8mb4_general_ci' because 'utf8mb4_general_ci' treats accented characters as equivalent. */
`port` INT UNSIGNED,
`canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
PRIMARY KEY (`id`),
INDEX (`canonical_host`(10), `scheme`)
) ENGINE = 'InnoDB';
CREATE TABLE `urls2` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`canonical_scheme` VARCHAR(20) NOT NULL,
`canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`port` INT UNSIGNED,
`canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
`orig_scheme` VARCHAR(20) NOT NULL,
`orig_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`orig_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`orig_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
PRIMARY KEY (`id`),
INDEX (`canonical_host`(10), `canonical_scheme`),
INDEX (`orig_host`(10), `orig_scheme`)
) ENGINE = 'InnoDB';
Table `urls1` is for storing canonical URLs according to definition 1. Table `urls2` is for storing canonical URLs according to definition 2.
Unfortunately you will not be able to specify a UNIQUE constraint on the tuple (`scheme`/`canonical_scheme`, `canonical_login`, `canonical_host`, `port`, `canonical_path`) as MySQL limits the length of InnoDB keys to 767 bytes.

i don't know the syntax for MySQL, but all you need to do is wrap your INSERT with IF statement that will query the table and see if the record with given url EXISTS, if it exists - don't insert a new record.
if MSSQL you can do this:
IF NOT EXISTS (SELECT 1 FROM YOURTABLE WHERE URL = 'URL')
INSERT INTO YOURTABLE (...) VALUES (...)

If you want to insert urls into the table, but only those that don't exist already you can add a UNIQUE contraint on the column and in your INSERT query add IGNORE so that you don't get an error.
Example: INSERT IGNORE INTO urls SET url = 'url-to-insert'

First things first. If you haven't already created the table, or you created a table but do not have data in in then you need to add a unique constriant, or a unique index. More information about choosing between index or constraints follows at the end of the post. But they both accomplish the same thing, enforcing that the column only contains unique values.
To create a table with a unique index on this column, you can use.
CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,UNIQUE INDEX IDX_URL(URL)
);
If you just want a unique constraint, and no index on that table, you can use
CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,CONSTRAINT UNIQUE UNIQUE_URL(URL)
);
Now, if you already have a table, and there is no data in it, then you can add the index or constraint to the table with one of the following pieces of code.
ALTER TABLE MyURLTable
ADD UNIQUE INDEX IDX_URL(URL);
ALTER TABLE MyURLTable
ADD CONSTRAINT UNIQUE UNIQUE_URL(URL);
Now, you may already have a table with some data in it. In that case, you may already have some duplicate data in it. You can try creating the constriant or index shown above, and it will fail if you already have duplicate data. If you don't have duplicate data, great, if you do, you'll have to remove the duplicates. You can see a lit of urls with duplicates using the following query.
SELECT URL,COUNT(*),MIN(ID)
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1;
To delete rows that are duplicates, and keep one, do the following:
DELETE RemoveRecords
FROM MyURLTable As RemoveRecords
LEFT JOIN
(
SELECT MIN(ID) AS ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1
UNION
SELECT ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) = 1
) AS KeepRecords
ON RemoveRecords.ID = KeepRecords.ID
WHERE KeepRecords.ID IS NULL;
Now that you have deleted all the records, you can go ahead and create you index or constraint. Now, if you want to insert a value into your database, you should use something like.
INSERT IGNORE INTO MyURLTable(URL)
VALUES('http://www.example.com');
That will attempt to do the insert, and if it finds a duplicate, nothing will happen. Now, lets say you have other columns, you can do something like this.
INSERT INTO MyURLTable(URL,Visits)
VALUES('http://www.example.com',1)
ON DUPLICATE KEY UPDATE Visits=Visits+1;
That will look try to insert the value, and if it finds the URL, then it will update the record by incrementing the visits counter. Of course, you can always do a plain old insert, and handle the resulting error in your PHP Code. Now, as for whether or not you should use constraints or indexes, that depends on a lot of factors. Indexes make for faster lookups, so your performance will be better as the table gets bigger, but storing the index will take up extra space. Indexes also usually make inserts and updates take longer as well, because it has to update the index. However, since the value will have to be looked up either way, to enforce the uniqueness, in this case, It may be quicker to just have the index anyway. As for anything performance related, the answer is try both options and profile the results to see which works best for your situation.

If you just want a yes or no answer this syntax should give you the best performance.
select if(exists (select url from urls where url = 'http://asdf.com'), 1, 0) from dual

If you just want to make sure there are no duplicates then add an unique index to the url field, that way there is no need to explicitly check if the url exists, just insert as normal, and if it is already there then the insert will fail with a duplicate key error.

The answer depends on whether you want to know when an attempt is made to enter a record with a duplicate field. If you don't care then use the "INSERT... ON DUPLICATE KEY" syntax as this will make your attempt quietly succeed without creating a duplicate.
If on the other hand you want to know when such an event happens and prevent it, then you should use a unique key constraint which will cause the attempted insert/update to fail with a meaningful error.

$url = "http://www.scroogle.com";
$query = "SELECT `id` FROM `urls` WHERE `url` = '$url' ";
$resultdb = mysql_query($query) or die(mysql_error());
list($idtemp) = mysql_fetch_array($resultdb) ;
if(empty($idtemp)) // if $idtemp is empty the url doesn't exist and we go ahead and insert it into the db.
{
mysql_query("INSERT INTO urls (`url` ) VALUES('$url') ") or die (mysql_error());
}else{
//do something else if the url already exists in the DB
}

Make the column the primary key

You can locate (and remove) using a self-join. Your table has some URL and also some PK (We know that the PK is not the URL because otherwise you would not be allowed to have duplicates)
SELECT
*
FROM
yourTable a
JOIN
yourTable b -- Join the same table
ON b.[URL] = a.[URL] -- where the URL's match
AND b.[PK] <> b.[PK] -- but the PK's are different
This will return all rows which have duplicated URLs.
Say, though, that you wanted to only select the duplicates and exclude the original.... Well you would need to decide what constitutes the original. For the purpose of this answer let's assume that the lowest PK is the "original"
All you need to do is add the following clause to the above query:
WHERE
a.[PK] NOT IN (
SELECT
TOP 1 c.[PK] -- Only grabbing the original!
FROM
yourTable c
WHERE
c.[URL] = a.[URL] -- has the same URL
ORDER BY
c.[PK] ASC) -- sort it by whatever your criterion is for "original"
Now you have a set of all non-original duplicated rows. You could easily execute a DELETE or whatever you like from this result set.
Note that this approach may be inefficient, in part because mySQL doesn't always handle IN well but I understand from the OP that this is sort of "clean up" on the table, not always a check.
If you want to check at INSERT time whether or not a value already exists you can run something like this
SELECT
1
WHERE
EXISTS (SELECT * FROM yourTable WHERE [URL] = 'testValue')
If you get a result then you can conclude the value already exists in your DB at least once.

You could do this query:
SELECT url FROM urls WHERE url = 'http://asdf.com' LIMIT 1
Then check if mysql_num_rows() == 1 to see if it exists.

Related

MySQL single auto increment for two tables without duplication a better solution

i have two tables(innodb) in MYSQL data base both share a similar column the account_no column i want to keep both columns as integers and still keep both free from collusion when inserting data only.
there are 13 instances of this same question on stackoverflow i have read all. but in all, the recommended solutions where:
1) using GUID :this is good but am trying to keep the numbers short and easy for the users to remember.
2) using sequence :i do not fully understand how to do this but am thinking it involves making a third table that has an auto_increment and getting my values for the the two major tables from it.
3) using IDENTITY (1, 10) [1,11,21...] for the first table and the second using IDENTITY (2, 10) [2,12,22...] this works fine but in the long term might not be such a good idea.
4) using php function uniqid(,TRUE) :not going to work its not completely collision free and the columns in my case have to be integers.
5) using php function mt_rand(0,10): might work but i still have to check for collisions before inserting data.
if there is no smarter way to archive my goal i would stick with using the adjusted IDENTITY (1, 10) and (2, 10).
i know this question is a bit dumb seeing all the options i have available but the most recent answer on a similar topic was in 2012 there might have been some improvements in the MYSQL system that i do not know about yet.
also am using php language to insert the data thanks.

Basically, you are saying that you have two flavors of an entity. My first recommendation is to try to put them in a single table. There are three methods:
If most columns overlap, just put all the columns in a single table (accounts).
If one entity has more columns, put the common columns in one table and have a second table for the wider entity.
If only some columns overlap, put those in a single table and have a separate table for each subentity.
Let met assume the third situation for the moment.
You want to define something like:
create table accounts (
AccountId int auto_increment primary key,
. . . -- you can still have common columns here
);
create table subaccount_1 (
AccountId int primary key,
constraint foreign key (AccountId) references accounts(AccountId),
. . .
);
create table subaccount_2 (
AccountId int primary key,
constraint foreign key (AccountId) references accounts(AccountId),
. . .
);
Then, you want an insert trigger on each sub-account table. This trigger does the following on insert:
inserts a row into accounts
captures the new accountId
uses that for the insert into the subaccount table
You probably also want something on accounts that prevents inserts into that table, except through the subaccount tables.

A big thank you to Gordon Linoff for his answer i want to fully explain how i solved the problem using his answer to help others understand better.
original tables:
Table A (account_no, fist_name, last_name)
Table B (account_no, likes, dislikes)
problem: need account_no to auto_increment across both tables and be unique across both tables and remain a medium positive integer (see original question).
i had to make an extra Table_C to which will hold all the inserted data at first, auto_increment it and checks for collisions through the use of primary_key
CREATE TABLE Table_C (
account_no int NOT NULL AUTO_INCREMENT,
fist_name varchar(50),
last_name varchar(50),
likes varchar(50),
dislikes varchar(50),
which_table varchar(1),
PRIMARY KEY (account_no)
);
Then i changed MySQL INSERT statement to insert to Table_C and added an extra column which_table to say which table the data being inserted belong to and Table_C on insert of data performs auto_increment and checks collision then reinsert the data to the desired table through the use of triggers like so:
CREATE TRIGGER `sort_tables` AFTER INSERT ON `Table_C` FOR EACH ROW
BEGIN
IF new.which_table = 'A' THEN
INSERT INTO Table_A
VALUES (new.acc_no, new.first_name, new.last_name);
ELSEIF new.which_table = 'B' THEN
INSERT INTO Table_B
VALUES (new.acc_no, new.likes, new.dislikes);
END IF;
END

Prevent InnoDB auto increment ON DUPLICATE KEY

I am currently having problems with a primary key ID which is set to auto increment. It keeps incrementing ON DUPLICATE KEY.
For Example:
ID | field1 | field2
1 | user | value
5 | secondUser | value
86 | thirdUser | value
From the description above, you'll notice that I have 3 inputs in that table but due to auto increment on each update, ID has 86 for the third input.
Is there anyway to avoid this ?
Here's what my mySQL query looks like:
INSERT INTO table ( field1, field2 ) VALUES (:value1, :value2)
ON DUPLICATE KEY
UPDATE field1 = :value1, field2 = :value2
And here's what my table looks like;
CREATE TABLE IF NOT EXISTS `table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`field1` varchar(200) NOT NULL,
`field2` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `field1` (`field1`),
KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

You could set the innodb_autoinc_lock_mode config option to "0" for "traditional" auto-increment lock mode, which guarantees that all INSERT statements will assign consecutive values for AUTO_INCREMENT columns.
That said, you shouldn't depend on the auto-increment IDs being consecutive in your application. Their purpose is to provide unique identifiers.

This behavior is easily seen below with the default setting for innodb_autoinc_lock_mode = 1 (“consecutive” lock mode). Please also reference the fine manual page entitled AUTO_INCREMENT Handling in InnoDB. Changing this value will lower concurrency and performance with the setting = 0 for “Tranditional” lock mode as it uses a table-level AUTO-INC lock.
That said, the below is with the default setting = 1.
I am about to show you four examples of how easy it is to create gaps.
Example 1:
create table x
( id int auto_increment primary key,
someOtherUniqueKey varchar(50) not null,
touched int not null,
unique key(someOtherUniqueKey)
);
insert x(touched,someOtherUniqueKey) values (1,'dog') on duplicate key update touched=touched+1;
insert x(touched,someOtherUniqueKey) values (1,'dog') on duplicate key update touched=touched+1;
insert x(touched,someOtherUniqueKey) values (1,'cat') on duplicate key update touched=touched+1;
select * from x;
+----+--------------------+---------+
| id | someOtherUniqueKey | touched |
+----+--------------------+---------+
| 1 | dog | 2 |
| 3 | cat | 1 |
+----+--------------------+---------+
The Gap (id=2 is skipped) is due to one of a handful of operations and quirks and nervous twitches of the INNODB engine. In its default high performance mode of concurrency, it performs range gap allocations for various queries sent to it. One had better have good reasons to change this setting, because doing so impacts performance. The sorts of things later versions of MySQL delivers to you, and you turn off due to Hyper Focusing on gaps in printout sheets (and bosses that say "Why do we have gaps").
In the case of an Insert on Duplicate Key Update (IODKU), it is assuming 1 new row and allocates a slot for it. Remember, concurrency, and your peers doing the same operations, perhaps hundreds concurrently. When the IODKU turns into an Update, well, there goes the use of that abandoned and never inserted row with id=2 for your connection and anyone else.
Example 2:
The same happens during Insert ... Select From as seen in This Answer of mine. In it I purposely use MyISAM due to reporting on counts, min, max, otherwise the range gap quirk would allocate and not fill all. And the numbers would look weird as that answer dealt with actual numbers. So the older engine (MyISAM) worked fine for tight non-gaps. Note that in that answer I was trying to do something fast and safe and that table could be converted to INNODB with ALTER TABLE after the fact. Had I done that example in INNODB to begin with, there would have been plenty of gaps (in the default mode). The reason the Insert ... Select From would have creates gaps in that Answer had I used INNODB was due to the uncertainty of the count, the mechanism that the engine chooses for safe (uncertain) range allocations. The INNODB engine knows the operation naturally, knows in has to create a safe pool of AUTO_INCREMENT id's, has concurrency (other users to think about), and gaps flourish. It's a fact. Try example 2 with the INNODB engine and see what you come up with for min, max, and count. Max won't equal count.
Examples 3 and 4:
There are various situations that cause INNODB Gaps documented on the Percona website as they stumble into more and document them. For instance, it occurs during failed inserts due to Foreign Key constraints seen in this 1452 Error image. Or a Primary Key error in this 1062 Error image.
Remember that the INNODB Gaps are there as a side-effect of system performance and a safe engine. Is that something one really wants to turn-off (Performance, Higher user statisfaction, higher concurrency, lack of table locks), for the sake of tighter id ranges? Ranges that have holes on deletes anyway. I would suggest not for my implementations, and the default with Performance is just fine.

I am currently having problems with a primary key ID which is set to
auto increment. It keeps incrementing ON DUPLICATE KEY
One of us must be misunderstanding the problem, or you're misrepresenting it. ON DUPLICATE KEY UPDATE never creates a new row, so it cannot be incrementing. From the docs:
If you specify ON DUPLICATE KEY UPDATE, and a row is inserted that
would cause a duplicate value in a UNIQUE index or PRIMARY KEY, MySQL
performs an UPDATE of the old row.
Now it's probably the case that auto-increment occurs when you insert and no duplicate key is found. If I assume that this is what's happening, my question would be: why is that a problem?
If you absolutely want to control the value of your primary key, change your table structure to remove the auto-increment flag, but keep it a required, non-null field. It will force you to provide the keys yourself, but I would bet that this will become a bigger headache for you.
I really am curious though: why do you need to plug all the holes in the ID values?

I answered it here:
to solve the auto-incrementing problem use the following code before insert/on duplicate update part and execute them all together:
SET #NEW_AI = (SELECT MAX(`the_id`)+1 FROM `table_blah`);
SET #ALTER_SQL = CONCAT('ALTER TABLE `table_blah` AUTO_INCREMENT =', #NEW_AI);
PREPARE NEWSQL FROM #ALTER_SQL;
EXECUTE NEWSQL;
together and in one statement it should be something like below:
SET #NEW_AI = (SELECT MAX(`the_id`)+1 FROM `table_blah`);
SET #ALTER_SQL = CONCAT('ALTER TABLE `table_blah` AUTO_INCREMENT =', #NEW_AI);
PREPARE NEWSQL FROM #ALTER_SQL;
EXECUTE NEWSQL;
INSERT INTO `table_blah` (`the_col`) VALUES("the_value")
ON DUPLICATE KEY UPDATE `the_col` = "the_value";

You can change your query from
INSERT INTO table ( f1, f2 ) VALUES (:v1, :v2) ON DUPLICATE KEY UPDATE f1 = :v1, f2 = :v2
to
insert ignore into table select (select max(id)+1 from table), :v1, :v2 ;
This will try
insert new data with last unused id (not autoincrement)
if in unique fields duplicate entry found ignore it
else insert new data normally
( but this method not support to update fields if duplicate entry found )

How to insert data to multiple tables with foreign key dependencies involved (MySQL)

I am looking for the best-practice way to insert data to multiple MySQL tables where some columns are foreign key dependencies. Here is an example:
Table: contacts
--------------------------------------------------------------------
| contact_id | first_name | last_name | prof_id | zip_code |
--------------------------------------------------------------------
The 'contacts' table has PRIMARY KEY (contact_id) which simply auto_increments, and FOREIGN KEY (prof_id) REFERENCES 'profession' table, FOREIGN KEY (zip_code) REFERENCES 'zip_code' table. Those tables would look something like this:
Table: profession
----------------------------
| prof_id | profession |
----------------------------
where 'prof_id' is an INT NOT NULL AUTO_INCREMENT PRIMARY KEY, and
Table: zip_code
--------------------------------
| zip_code | city | state |
--------------------------------
where 'zip_code' is an INT(5) NOT NULL PRIMARY KEY.
I have a new person I want to add a record for, let's say:
first_name = 'John', last_name = 'Doe', profession = 'Network Administrator', city = 'Sometown', state = 'NY', zip_code = 12345
Here's what I'm trying to do: I want to take that information and insert it into the appropriate tables and columns. To prevent duplicate values in a column (like profession for example) I'd first want to make sure there isn't already an entry for "Network Administrator", and if there was I'd like to just get its key value, if not insert it and then get its key value. The same goes for zip_code, city, state - if it exists just use that zip_code key, otherwise insert the new data and grab the associated key. Then, finally, I'd want to enter the new contact to 'contact' table using the supplied information, including the appropriate key values associated with profession and location from the other tables.
My question is, what is the best recommended way to do this? I know I can sit here and write single statements to check if the given profession exists, if not then enter it, then get the key. Do the same for zip_code. Then finally Insert all of that into contacts, however I know there must be a better way to accomplish this in fewer (perhaps one) statement, especially considering this could cause a problem if, say, the database went offline for a moment in the midst of all of this. Is there a way to use JOINs with this INSERT to essentially have everything cascade into the correct place? Should I look to handle this with a TRANSACTION series of statements?
I am in the learning stage with SQL but I feel that the books and resources I have used thus far have sort of jumped to using nested queries and JOINS under the assumption we have all of these tables of data already populated. I'm even open to suggestions on WHAT I should be Googling to better learn this, or any resources that can help fill this gap. Ideally, I'd love to see some functioning SQL code to do this though. If necessary, assume PHP as the language to interact with the database, but command-line sql is what I was aiming for. Thanks ahead of time, hopefully I made everything clear.

In short, you want to use transactions (more doc on this) so that your inserts are atomic. This is the only way to guarantee that all (or none) of your data will be inserted. Otherwise, you can get into the situation you describe where the database becomes unavailable after some insertions and others are unable to complete. A transaction tells the database that what you are doing is all-or-nothing and so it should roll back if something goes wrong.
When you are using synthetic primary keys, as you are, PHP and other languages provide mechanisms for getting the last inserted id. If you want to do it entirely in MySQL you can use the LAST_INSERT_ID() function. You will end up with code like this:
START TRANSACTION;
INSERT INTO foo (auto,text)
VALUES(NULL,'text'); # generate ID by inserting NULL
INSERT INTO foo2 (id,text)
VALUES(LAST_INSERT_ID(),'text'); # use ID in second table
COMMIT;

Unique IDs in MySQL?

So I am working on a fairly small script for the company I work at the help us manage our servers better. I don't use MySQL too often though so I am a bit confused on what would be the best path to take.
I am doing something like this...
$sql = "CREATE TABLE IF NOT EXISTS Servers
(
MachineID int NOT NULL AUTO_INCREMENT,
PRIMARY KEY(MachineID),
FirstName varchar(32),
LastName varchar(32),
Datacenter TINYINT(1),
OperatingSystem TINYINT(1),
HostType TINYINT(1)
)";
$result = mysql_query($sql,$conn);
check ($result);
$sql = "CREATE TABLE IF NOT EXISTS Datacenter
(
DatacenterID int NOT NULL AUTO_INCREMENT,
PRIMARY KEY(DatacenterID),
Name varchar(32),
Location varchar(32)
)";
$result = mysql_query($sql,$conn);
check ($result);
Basically inside the Servers table, I will be storing the index of an entry in another table. My concern is that if we remove one of those entries, it will screw up the auto incremented indexes and potentially cause a lot of incorrect data.
To explain better, lets say the first server I add, the Datacenter value is 3 (which is the DatacenterID), and we remove the id 1 (DatacenterID) Datacenter at a later time.
Is there a good way to do this?

Auto increment only has an effect on inserting new rows into a table. So you insert three records into database and they get assigned ids of 1, 2, and 3. Later you delete id 1, but the records at id 2 and 3 are unchanged. This says nothing of any of the server records that might be trying to reference the database id 1 record though that is now missing.

as said by paul, it is safe to remove old row and add another new one later. The auto incremental index won't be affected by deletion.
But I would suggest instead of removing them, simply add a column 'status' and set 0, implying they are no longer in use, to keep any possible record in db.

Servers.Datacenter should be an INT, too, as you would store the DataCenterID in this field. Then, nothing will be mixed up when you remove some Datacenter from the second table.

If you remove in the way you suggest nothing will happen. auto-increment will just add the next highest number in sequence whenever a new record is added and so will not affect any previous records.
Hope that helps.

Change Column to Auto_Increment

I asked this question a little earlier today but am not sure as to how clear I was.
I have a MySQL column filled with ordered numbers 1-56. These numbers were generated by my PHP script, not by auto_increment.
What I'd like to do is make this column auto_incrementing after the PHP script sets the proper numbers. The PHP script works hand in hand with a jQuery interface that allows me to reorder a list of items using jQuery's UI plugin.
Once I decide what order I'd like the entries in, I'd like for the column to be set to auto increment, such that if i were to insert a new entry, it would recognize the highest number already existing in the column and set its own id number to be one higher than what's already existing.
Does anyone have any suggestions on how to approach this scenario?

I'd suggest creating the table with your auto_increment already in place. You can specify a value for the auto_inc column, and mysql will use it, and still the next insert to specify a NULL or 0 value for the auto_inc column will magically get $highest + 1 assigned to it.
example:
mysql> create table foobar (i int auto_increment primary key);
mysql> insert into foobar values (10),(25);
mysql> insert into foobar values (null);
mysql> select * from foobar;
# returns 10,25,26

You can switch it to MySQL's auto_increment implementation, but it'll take 3 queries to do it:
a) ALTER TABLE to add the auto_increment to the field in question
b) SELECT MAX(id) + 1 to find out what you need to set the ID to
c) ALTER TABLE table AUTO_INCREMENT =result from (b)
MySQL considers altering the AUTO_INCREMENT value a table-level action, so you can't do it in (a), and it doesn't allow you to do MAX(id) in (c), so 3 queries.

You can change that with a query, issued through php, using the mysql console interface or (easiest) using phpmyadmin.
ALTER TABLE table_name CHANGE old_column_name new_column_name column_definition;
ALTER TABLE table_name AUTO_INCREMENT = highest_current_index + 1
column_definiton:
old_column_definition AUTO_INCREMENT
More info:
http://dev.mysql.com/doc/refman/5.1/en/alter-table.html
http://dev.mysql.com/doc/refman/5.1/en/create-table.html
EDIT
Always use mysql_insert_id or the appropiate function of your abstraction layer to get the last created id, as LAST_INSERT_ID may lead to wrong results.

No, stop it. This isn't the point of auto_increment. If you aren't going to make them ordered by the id then don't make them auto_increment, just add a column onto the end of the table for ordering and enjoy the added flexibility it gives you. It seems like you're trying to pack two different sets of information into one column and it's really only going to bite you in the ass despite all the well-meaning people in this thread telling you how to go about shooting yourself in the foot.

In MySQL you can set a custom value for an auto_increment field. MySQL will then use the highest auto_increment column value for new rows, essentially MAX(id)+1. This means you can effectively reserve a range of IDs for custom use. For instance:
CREATE TABLE mytable (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
col1 VARCHAR(256)
);
ALTER TABLE mytable AUTO_INCREMENT = 5001;
In this schema all ids < 5001 are reserved for use by your system. So, your PHP script can auto-generate values:
for ($i=1; $i<=56; $i++)
mysql_query("INSERT INTO mytable SET id = $i, col1= 'whatevers'");
New entries will use the non-reserved range by not specifying id or setting it to null:
INSERT INTO mytable SET id = NULL, col1 = 'whatevers2';
-- The id of the new row will be 5001
Reserving a range like this is key - in case you need more than 56 special/system rows in the future.

ALTER TABLE <table name> <column name> NOT NULL AUTO_INCREMENT
More info:
AUTO_INCREMENT Handling in InnoDB
Server SQL Modes

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.