Unique url from primary key - php

I'm trying to create a URL similar to youtube's /v=xxx in look and in behavior. In short, users will upload files and be able to access them through that URL. This URL code needs to be some form of the database's primary key so the page can gather the data needed. I'm new to databases and this is more of a database problem than anything.
In my database I have a auto increment primary key which file data is accessed by. I want to use that number to to create the URL for files. I started looking into different hash functions, but I'm worried about collisions. I don't want the same URL for two different files.
I also considered using uniqid() as my primary key CHAR(13), and just use that directly. But with this I'm worried about efficiency. Also looking around I can't seem to find much about it so it's probably a strange idea. Not to mention I would need to test for collisions when ids are generated which can be inefficient. Auto increment is a lot easier.
Is there any good solution to this? Will either of my ideas work? How can I generate a unique URL from an auto incremented primary key and avoid collisions?
I'm leaning toward my second idea, it won't be greatly efficient, but the largest performance drawbacks are caused when things need to be added to the database (testing for collisions), which for the end user, only happens once. The other performance drawback will probably be in the actual looking of of chars instead of ints. But I'm mainly worried that it's bad practice.
EDIT:
A simple solution would to be just to use the auto incremented value directly. Call me picky, but that looks kind of ugly.

Generating non colliding short hash will indeed be a headache. So, instead the slug format of Stackoverflow is very promising and is guaranteed to produce non duplicate url.
For example, this very same question has
https://stackoverflow.com/questions/11991785/unique-url-from-primary-key
Here, it has unique primary key and also a title to make it more SE friendly.
However as commented, they are few previously asked question, that might clear out, why? what you are trying is better left out.
How to generate a unique hash for a URL?
Create Tinyurl style hash
Creating short hashes increases the chances a collision a lot, so better user base64 or sha512 functions to create a secured hash.

You can simply make a hash of the time, and afterwards check that hash (or part of that hash in your DB.
If you set an index on that field in your DB (and make sure the hash is long enough to not make a lot of collisions), it won't be an issue at all time wise.
<?php
$hashChecked = false;
while( $hashChecked === false ){
$hash = substr( sha1(time().mt_rand(9999,99999999)), 0, 8); //varchar 8 (make sure that is enough with a very big margin)
$q = mysql_query("SELECT `hash` FROM `tableName` WHERE `hash` = '".$hash."'");
$hashChecked = mysql_num_rows() > 0 ? false : true;
}
mysql_query("INSERT INTO `tableName` SET `hash` = '".$hash."'");

This is fairly straightforward if you're willing to use a random number to generate your short URL. For example, you can do this:
SELECT BASE64_ENCODE(CAST(RAND()*1000000 AS UNSIGNED INTEGER)) AS tag
This is capable of giving you one million different tags. To get more possible tags, increase the value by which the RAND() number is multiplied. These tag values will be hard to predict.
To make sure you don't get duplicates you need to dedupe the tag values. That's easy enough to do but will require logic in your program. Insert the tag values into a table which uses them as a primary key. If your insert fails, try again, reinvoking RAND().
If you get close to your maximum number of tags you'll start having lots of insert failures (tag collisions).
BASE64_ENCODE comes from a stored function you need to install. You can find it here:
http://wi-fizzle.com/downloads/base64.sql
If you're using MySQL 5.6 or higher you can use the built-in TO_BASE64 function.

I wanted to do something similar (but with articles, not uploaded documents), and came up with something a bit different:
take a prime number [y] (much) larger than the max number [n] of documents there will ever be (e.g. 25000 will be large enough for the total number of documents, and 1000099 is a much larger prime number than 25001)
for the current document id [x]: (x*y) modulus (n+1)
this will generate a number between 1 and n that is never duplicated
although the url may look like a traditional primary key, it does have the slight advantage that each subsequent document will have a id which is totally unrelated to the previous one; some people also argue that not including the primary key also has a very slight security advantage...

Related

Why do sites use random alphanumeric ids rather than database ids to identify content?

Why do sites like YouTube, Imgur and most others use random characters as their content ids rather than just sequential numbers, like those created by auto-increment in MySQL?
To explain what I mean:
In the URL: https://www.youtube.com/watch?v=QMlXuT7gd1I
The QMlXuT7gd1I at the end indicates the specific video on that page, but I'm assuming that video also has a unique numeric id in the database. Why do they create and use this alphanumeric string rather than just use the video's database id?
I'm creating a site which identifies content in the URL like above, but I'm currently using just the DB id. I'm considering switching to random strings because all major sites do it, but I'd like to know why this is done before I implement it.
Thanks!
Some sites do that because of sharding.
When you have only one process (one server) writing, it is possible to make an auto-increment id without having duplicate ids, but when you have multiple servers (with multiple processes) writing content, like youtube, it's not possible to use autoincrement id anymore. The costs of synchronization to avoid duplication would be huge.
For example, if you read mongodb's ocjectid documentation you can see this structure for the id:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
At the end, it's only 12 byte. The thing is when you represent in hexadecimal, it seems like 24 bytes, but that is only when you show it.
Another advantage of this system is that the timestamp is included in the id, so you can decouple the id to get the timestamp.
First this is not a random string, it is a base calculation which is depended on the id. They go this way, because Alphanumeric has a bigger base
Something like 99999999 could be 1NJCHR
Take a look here, and play with the bases, and learn more about it.
You will see it is way more shorter. That is the only reason i can imagine, someone would go this way, and it makes sense, if you have ids like 54389634589347534985348957863457438959734
As self and Cameron commented/answered there are chances (especialy for youtube) that there are additional security parameters like time and lenght are calculated into it in some way, so you are not able to guess an identifier.
In addition to Christian's answer above, using a base calculation, hashed value or other non-numeric identifier has the advantage of obscuring your db's size from competitors.
Even if you stayed with numeric and set your auto_increment to start at 50,000, increase by 50, etc., educated guesses can still be made at the db's size and growth. Non-numeric options don't eliminate this possibility, but they inhibit it to a certain extent.
there are major chances for malicious inputs by end users, and by not using ids users cant guess ids and thus can't guess how large db is. However other's answers on base calculation explains well.

Create unique random codes - PHP/MySQL

I have to create unique codes for each "company" in my database.
The only way I see this to be possible is to create a random number with rand() and then check if the number exists for this "company" in the DB, if it does recreate.
My question is: Is there not a better way to do this - a more efficient way. As if I am creating 10 000 codes and there are already 500 000 in the DB it's going to get progressively slower and slower.
Any ideas or tips on perhaps a better way to do it?
EDIT:
Sorry perhaps I can explain better. The codes will not all be generated at the same time, they can be created once a day/month/year whenever.
Also, I need to be able to define the characters of the codes for example, alpha numberic or numbers only
I recommend you to use "Universally Unique Identifier": http://en.wikipedia.org/wiki/Universally_unique_identifier to generate your random codes for each company. In this way you can avoid checking your database for duplicates:
Anyone can create a UUID and use it to identify something with
reasonable confidence that the same identifier will never be
unintentionally created by anyone to identify something else.
Information labeled with UUIDs can therefore be later combined into a
single database without needing to resolve identifier (ID) conflicts.
In PHP you can use function uniqid for this purpose: http://es1.php.net/manual/en/function.uniqid.php
MySQL's UUID Function should help. http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
INSERT INTO table (col1,col2)VALUES(UUID(), "someValue")
If the codes are just integers then use autoincrement or get the current max value and start incrementing it

Why do web sites tend to use random id:s on database tables?

I wonder why many web sites choose to use random id:s instead of incrementing from 1 on their database tables. I´ve searched without finding any good reasons, are there any?
Also, which is the best method to use? It seems quite inefficient to check if an id already exists before inserting the data, (takes a second query).
Thanks for your help!
Under the hood, it is likely that they are using incremental ids in the database to identify rows, but the value that gets exposed to end users via the URL parameters is often made into a random string to make the sequence of available objects harder to guess.
It is really a matter of security through obscurity. It hinders automated scripts from proceeding through incremental values and attempting attacks via the URL, and it hinders automated scraping of site content.
If youtube, for example, used incremental ids instead of values like v=HSsdaX4s, you could download every by simply starting at v=1 and incrementing that value millions of times.
Sequential ids do not scale well (they become a synchronization bottle-neck in distributed systems).
Also, you don't need to check if a newly generated random id already exists, you can just assume that it does not (because there are so many of them).
Are you sure that the id's are random? or are they encoded? Either way it is for security.

URL Shortener algorithm

I recently bought myself a domain for personal URL shortening.
And I created a function to generate alphanumeric strings of 4 characters as reference.
BUT
How do I check if they are already used or not? I can't check for every URL if it exists in the database, or is this just the way it works and I have to do it?
If so, what if I have 13.000.000 URLs generated (out of 14.776.336). Do I need to keep generating strings until I have found one that is not in the DB yet?
This just doesn't look the right way to do it, anyone who can give me some advise?
One memory efficient and faster way I think of is following. This problem can be solved without use of database at all. The idea is that instead of storing used urls in database, you can store them in memory. And since storing them in memory can take a lot of memory usage, so we will use a bit set (an array of bits ) and we only one bit for each url.
For each random string you generate, create a hashcode for that that lies b/w 0 and max number K.
Create a bit set( basically a bit array). Whenever you use some url, set corresponding hash code bit in bit set to 1.
Whenever you generate a new url, see if its hashcode bit is set. If yes, then discard that url and generate a new one. Repeat the process till you get one unused one.
This way you avoid DB forever, your lookups are extremely fast and it takes least amount of memory.
I borrowed the idea from this place
A compromise solution is to generate a random id, and if it is already in the database, find the first empty id that is bigger than it. (Wrapping around if you can't find any empty space in the range above.)
If you don't want the ids to be unguessable (you probably don't if you only use 4 characters), this approach works fine and is quick.
One algorithm is to try a few times to find a free url of N characters, if still not found, increase N. Start with N=4.

MySQL key dilemma

I have a table that caches data (shared hosting so no memcached) to a MySQL table.
The concept is this:
I have a page that loads (static) data and then cache then:
If the cache does not exist then it queries the page then render the HTML and save it to the cache table.
If a page does not exist in cache, it executes 12 queries (menu, page content, SEO, product list, etc.) then saves the rendered HTML in the table.
The cache table is like this:
=cache=
url varchar(255) - primary key
page mediumtext
Now I think I'm doing the right thing, based on what I have (shared host, no caching like memcached, etc.) but my question is this:
Because the URL is a varchar index but because numeric IDs (like int) are faster, is there a way to convert a URL like /contact-us/ or /product-category/product-name/ to a unique integer? Or is there any other way to optimize this?
I would create some form of hash which would allow a shorter key. In many cases something simple like a hash of the request path may be viable. Alternatively something even simpler like CRC32('/your/path/here') may be suitable in your situation as a primary key. In this example the following columns would exist
urlCRC INT(11) UNSIGNED NOT NULL (PRIMARY KEY)
url VARCHAR(255) NOT NULL
page MEDIUMTEXT
You could then take this a step further, and add trigger BEFORE INSERT which would calculate the value for urlCRC, i.e. containing
NEW.urlCRC = CRC32(NEW.url)
You could then create a stored procedure which takes as argument inURL (string), and internally it would do
SELECT * FROM cacheTable WHERE urlCRC = CRC32(inURL);
If the number of rows returned is 0, then you can trigger logic to cache it.
This may of course be overkill, but would provide you a numeric key to work on which, assuming there are no conflicts, would suffice. By storing the url as VARCHAR(255) also then if a conflict does occur, you can easily regenerate new hashes using a different algorithm.
Just to be clear, I just use CRC32() as an example off the top of my head, chances are there are more suitable algorithms. The main point to take away is that a numeric key is more efficient to search so if you can convert your strings into unique numerics it would be more efficient when retrieving data.
Changing your url column to a fixed-size string would make indexing slightly faster, if there wasn't another dynamically-sized column (TEXT) in the table. Converting it to an integer would be possible, depending on your URL structure - you could also use some kind of hash function. But why don't you make your life easier?
You could save your cache results directly to the disk and create a mod_rewrite filter (put it tou your .htaccess file), that matches if the file exists, otherwise invokes the PHP script. This would have 2 advantages:
If the cache is hot, PHP will not run. This saves time and memory.
If the file is requested often and it is small enough (or you have lots of RAM), it will be held in the RAM. This is much faster than MySQL.
select all cached urls with a hash, then search for exact url in all hash colisions
select page from (select * from cache where HASHEDURL = STOREDHASH) where url = 'someurl'

Categories