URL Shortener algorithm

URL Shortener algorithm - php

I recently bought myself a domain for personal URL shortening.
And I created a function to generate alphanumeric strings of 4 characters as reference.
BUT
How do I check if they are already used or not? I can't check for every URL if it exists in the database, or is this just the way it works and I have to do it?
If so, what if I have 13.000.000 URLs generated (out of 14.776.336). Do I need to keep generating strings until I have found one that is not in the DB yet?
This just doesn't look the right way to do it, anyone who can give me some advise?

One memory efficient and faster way I think of is following. This problem can be solved without use of database at all. The idea is that instead of storing used urls in database, you can store them in memory. And since storing them in memory can take a lot of memory usage, so we will use a bit set (an array of bits ) and we only one bit for each url.
For each random string you generate, create a hashcode for that that lies b/w 0 and max number K.
Create a bit set( basically a bit array). Whenever you use some url, set corresponding hash code bit in bit set to 1.
Whenever you generate a new url, see if its hashcode bit is set. If yes, then discard that url and generate a new one. Repeat the process till you get one unused one.
This way you avoid DB forever, your lookups are extremely fast and it takes least amount of memory.
I borrowed the idea from this place

A compromise solution is to generate a random id, and if it is already in the database, find the first empty id that is bigger than it. (Wrapping around if you can't find any empty space in the range above.)
If you don't want the ids to be unguessable (you probably don't if you only use 4 characters), this approach works fine and is quick.

One algorithm is to try a few times to find a free url of N characters, if still not found, increase N. Start with N=4.

Related

Using VARCHAR in MySQL for everything! (on small or micro sites)

I tried searching for this as I felt it would be a commonly asked beginner's question, but I could only find things that nearly answered it.
We have a small PHP app that is at most used by 5 people (total, ever) and maybe 2 simultaneously, so scalability isn't a concern.
However, I still like to do things in a best practice manner, otherwise bad habits form into permanent bad habits and spill into code you write that faces more than just 5 people.
Given this context, my question is: is there any strong reason to use anything other than VARCHAR(250+) in MySQL for a small PHP app that is constantly evolving/changing? If I picked INT but that later needed to include characters, it would be annoying to have to go back and change it when I could have just future-proofed it and made it a VARCHAR to begin with. In other words, choosing anything other than VARCHAR with a large character count seems pointlessly limiting for a small app. Is this correct?
Thanks for reading and possibly answering!

If you have the numbers 1 through 12 in VARCHAR, and you need them in numerical order, you get 1,10,11,12,2,3,4,5,6,7,8,9. Is that OK? Well, you could fix it in SQL by saying ORDER BY col+0. Do you like that kludge?

One of the major drawbacks will be that you will have to add consistency checks in your code. For a small, private database, no problem. But for larger projects...
Using the proper types will do a lot of checks automatically. E.g., are there any wrong characters in the value; is the date valid...
As a bonus, it is easy to add extra constraints when using right types; is the age less than 110; is the start date less than the end date; is the indexing an existing value in another table?
I prefer to make the types as specific as possible. Although server errors can be nasty and hard to debug, it is way better than having a database that is not consistent.

Probably not a great idea to make a habit out of it as with any real amount of data will become inefficient. If you use the text type the amount of storage space used for the same amount of data will be differ depending on your storage engine.
If you do as you suggested don't forget that all values that would normally be of a numeric type will need to be converted to a numeric type in PHP. For example if you store the value "123" as a varchar or text type and retrieve it as $someVar you will have to do:
$someVar = intval($someVar);
in PHP before arithmetic operations can be performed, otherwise PHP will assume that 123 is a string.

As you may already know VARCHAR columns are variable-length strings. We have the advantage of dynamic memory allocation when using VARCHAR.
VARCHAR is stored inline with the table which makes faster when the size is reasonable.
If your app need performance you can go with CHAR which is little faster than VARCHAR.

Why do sites use random alphanumeric ids rather than database ids to identify content?

Why do sites like YouTube, Imgur and most others use random characters as their content ids rather than just sequential numbers, like those created by auto-increment in MySQL?
To explain what I mean:
In the URL: https://www.youtube.com/watch?v=QMlXuT7gd1I
The QMlXuT7gd1I at the end indicates the specific video on that page, but I'm assuming that video also has a unique numeric id in the database. Why do they create and use this alphanumeric string rather than just use the video's database id?
I'm creating a site which identifies content in the URL like above, but I'm currently using just the DB id. I'm considering switching to random strings because all major sites do it, but I'd like to know why this is done before I implement it.
Thanks!

Some sites do that because of sharding.
When you have only one process (one server) writing, it is possible to make an auto-increment id without having duplicate ids, but when you have multiple servers (with multiple processes) writing content, like youtube, it's not possible to use autoincrement id anymore. The costs of synchronization to avoid duplication would be huge.
For example, if you read mongodb's ocjectid documentation you can see this structure for the id:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
At the end, it's only 12 byte. The thing is when you represent in hexadecimal, it seems like 24 bytes, but that is only when you show it.
Another advantage of this system is that the timestamp is included in the id, so you can decouple the id to get the timestamp.

First this is not a random string, it is a base calculation which is depended on the id. They go this way, because Alphanumeric has a bigger base
Something like 99999999 could be 1NJCHR
Take a look here, and play with the bases, and learn more about it.
You will see it is way more shorter. That is the only reason i can imagine, someone would go this way, and it makes sense, if you have ids like 54389634589347534985348957863457438959734
As self and Cameron commented/answered there are chances (especialy for youtube) that there are additional security parameters like time and lenght are calculated into it in some way, so you are not able to guess an identifier.

In addition to Christian's answer above, using a base calculation, hashed value or other non-numeric identifier has the advantage of obscuring your db's size from competitors.
Even if you stayed with numeric and set your auto_increment to start at 50,000, increase by 50, etc., educated guesses can still be made at the db's size and growth. Non-numeric options don't eliminate this possibility, but they inhibit it to a certain extent.

there are major chances for malicious inputs by end users, and by not using ids users cant guess ids and thus can't guess how large db is. However other's answers on base calculation explains well.

Unique url from primary key

I'm trying to create a URL similar to youtube's /v=xxx in look and in behavior. In short, users will upload files and be able to access them through that URL. This URL code needs to be some form of the database's primary key so the page can gather the data needed. I'm new to databases and this is more of a database problem than anything.
In my database I have a auto increment primary key which file data is accessed by. I want to use that number to to create the URL for files. I started looking into different hash functions, but I'm worried about collisions. I don't want the same URL for two different files.
I also considered using uniqid() as my primary key CHAR(13), and just use that directly. But with this I'm worried about efficiency. Also looking around I can't seem to find much about it so it's probably a strange idea. Not to mention I would need to test for collisions when ids are generated which can be inefficient. Auto increment is a lot easier.
Is there any good solution to this? Will either of my ideas work? How can I generate a unique URL from an auto incremented primary key and avoid collisions?
I'm leaning toward my second idea, it won't be greatly efficient, but the largest performance drawbacks are caused when things need to be added to the database (testing for collisions), which for the end user, only happens once. The other performance drawback will probably be in the actual looking of of chars instead of ints. But I'm mainly worried that it's bad practice.
EDIT:
A simple solution would to be just to use the auto incremented value directly. Call me picky, but that looks kind of ugly.

Generating non colliding short hash will indeed be a headache. So, instead the slug format of Stackoverflow is very promising and is guaranteed to produce non duplicate url.
For example, this very same question has
https://stackoverflow.com/questions/11991785/unique-url-from-primary-key
Here, it has unique primary key and also a title to make it more SE friendly.
However as commented, they are few previously asked question, that might clear out, why? what you are trying is better left out.
How to generate a unique hash for a URL?
Create Tinyurl style hash
Creating short hashes increases the chances a collision a lot, so better user base64 or sha512 functions to create a secured hash.

You can simply make a hash of the time, and afterwards check that hash (or part of that hash in your DB.
If you set an index on that field in your DB (and make sure the hash is long enough to not make a lot of collisions), it won't be an issue at all time wise.
<?php
$hashChecked = false;
while( $hashChecked === false ){
$hash = substr( sha1(time().mt_rand(9999,99999999)), 0, 8); //varchar 8 (make sure that is enough with a very big margin)
$q = mysql_query("SELECT `hash` FROM `tableName` WHERE `hash` = '".$hash."'");
$hashChecked = mysql_num_rows() > 0 ? false : true;
}
mysql_query("INSERT INTO `tableName` SET `hash` = '".$hash."'");

This is fairly straightforward if you're willing to use a random number to generate your short URL. For example, you can do this:
SELECT BASE64_ENCODE(CAST(RAND()*1000000 AS UNSIGNED INTEGER)) AS tag
This is capable of giving you one million different tags. To get more possible tags, increase the value by which the RAND() number is multiplied. These tag values will be hard to predict.
To make sure you don't get duplicates you need to dedupe the tag values. That's easy enough to do but will require logic in your program. Insert the tag values into a table which uses them as a primary key. If your insert fails, try again, reinvoking RAND().
If you get close to your maximum number of tags you'll start having lots of insert failures (tag collisions).
BASE64_ENCODE comes from a stored function you need to install. You can find it here:
http://wi-fizzle.com/downloads/base64.sql
If you're using MySQL 5.6 or higher you can use the built-in TO_BASE64 function.

I wanted to do something similar (but with articles, not uploaded documents), and came up with something a bit different:
take a prime number [y] (much) larger than the max number [n] of documents there will ever be (e.g. 25000 will be large enough for the total number of documents, and 1000099 is a much larger prime number than 25001)
for the current document id [x]: (x*y) modulus (n+1)
this will generate a number between 1 and n that is never duplicated
although the url may look like a traditional primary key, it does have the slight advantage that each subsequent document will have a id which is totally unrelated to the previous one; some people also argue that not including the primary key also has a very slight security advantage...

Why do web sites tend to use random id:s on database tables?

I wonder why many web sites choose to use random id:s instead of incrementing from 1 on their database tables. I´ve searched without finding any good reasons, are there any?
Also, which is the best method to use? It seems quite inefficient to check if an id already exists before inserting the data, (takes a second query).
Thanks for your help!

Under the hood, it is likely that they are using incremental ids in the database to identify rows, but the value that gets exposed to end users via the URL parameters is often made into a random string to make the sequence of available objects harder to guess.
It is really a matter of security through obscurity. It hinders automated scripts from proceeding through incremental values and attempting attacks via the URL, and it hinders automated scraping of site content.
If youtube, for example, used incremental ids instead of values like v=HSsdaX4s, you could download every by simply starting at v=1 and incrementing that value millions of times.

Sequential ids do not scale well (they become a synchronization bottle-neck in distributed systems).
Also, you don't need to check if a newly generated random id already exists, you can just assume that it does not (because there are so many of them).

Are you sure that the id's are random? or are they encoded? Either way it is for security.

Codes for scratch cards

I'll try to be simple, clear and direct. My problem is the following: I have a project where I need to
generate codes for scratch cards. The scrath cards are printed like the ones you use for charging your
mobile phone.
The system is that people buy the cards, get the codes on the cards, then call a TOIP server (Asterisk) and inserts the code to access a service. It is given three attempts to enter the right code.
I thought to make a PHP program to generate theses codes, so I surely need to pass by a PRNG (Pseudo Random Number Generator). My constraints are:
-As the people are calling, the code shouldn't be too long, but long enough to ensure security.
-I need the system to be fast enough when the comparison is made between the code entered
and the one stored in the database (needed for statistics purposes).
So my questions is:
-Is it right to use a PRNG?
-If yes, do you know one strong enough to generate good random numbers?
-What standards are used by the industry?
-How to make the comparison algorithm fast enough if the comparison is made on million of codes?
Thanks for your time and answers.

Yes, PRNG will work fine after tweaking it little bit.
http://en.wikipedia.org/wiki/Random_password_generator
You can refer to the password generator code in the link above. You have to make sure first digit is not 0 and use only digits not alphabets.
once a number is generated you have to check if it exists in DB or not before you insert.
Normally, 16 character/digits are used by industries. You can generate 20 digit numbers also to make the whole process faster.
To make a matching faster you have to index the field in database. most probably it will be a char(16) or char(20).
Note : as there is no need of varchar here char is the best option.
Keep the Mysql table engine as MYISAM for fast comparision.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.