md5 hash for urls in unique Index - php

I was asked this before with slight different with current question. but did not got the answer I was looking into.
My question is do I need to store md5($url) in unique index in MySQL?? I have seen this in some code actually I don't remember..this is a large database with more than 5 million urls and the indexing is done by calling urls.
Any ideas?

I don't think you should hash your URLs. The only plausible reason would be to save space (if most of the URLs are larger than 32 chars) at the expense of increased risk of collisions.
What you should do is normalize the URLs.

Some sites uses hashing for urls in the database because they use hashes in urls say for user redirect to external url. I can't see any reason to do this if this is not the case.

are you saying that the url is called as such:
www.yourdomain.com?id=89ce9250e9f469c9d1816e1cc0fb47a1
and then the id (89ce9250e9f469c9d1816e1cc0fb47a1 which is an md5() of the real url querystring) is looked up from the database to resolve the actual url which could be:
www.yourdomain.com?user=23&location=5&eventtype=23&year=2010
Is this the kind of usage you're referring to??
jim

Related

How to properly store images (with numeric names) in server's filesystem

I have a lot of records in the database, and each record will have an image, I'm pretty confused about how to store the images.
I want the access route to be something like /img/record-id.jpg (i.e. /img/15178.jpg).
Alright, storing all images inside /img/ isn't a good thing, because there will be many.
In this question it is suggested to reverse the name of the image, so the example above would be stored under /img/78/51/15178.jpg. The suggestion won't give further info (and for me it's not obvious) about other scenarios. What will happen (this is asked in the last comment for the answer) if the id is a low number like 5, 15, 128, 1517?
Leaving that aside, let's remember I want the path to be /img/15178.jpg. I'd redirect the request using Apache, but for that I'd have to type at least 3 or more rules for different id numbers:
^/img/(\d)(\.jpg)$ /img/$1$2
^/img/(\d\d)(\.jpg)$ /img/$1/$1$2
^/img/(\d\d)(\d\d)(\.jpg)$ /img/$1/$2/$3
And so on?
This doesn't seem to be a nice solution, although it would work just fine.
I could think of other option which is: take the MD5 of the image, store it in its respective record, redirect the request to a PHP script and let it take care of the output.
The script will look the MD5 for the id in the database, build the actual route out of the hash and output the image. This solution is neater, but it involves database and PHP to output an image, sounds like a little too much.
I really don't know what to do here. Mind giving me some advice?
You already have written the perfect answer ! Professionals use it exactly like you (or the guy in the linked question) says: Building a deep directory structure that fits your needs. I have done this with 16 million pictures, and it worked perfectly.
I did it like this:
/firstCharacter/secondCharacter/...
Files with short names, like 5.jpg, will be in /5/5.jpg
EDIT: to keep the performance on top, i'm totally against any further php actions, like salt, md5, etc. Keep it straight and simple.

PHP Obfuscate - Different Values

I'm trying to obfuscate and ID (Ex: 1) with different value.
So if I obfuscate 1 it may give me different values as ADHU6767asD or hiuy76FY and when I will un-obfuscate it, it will give 1.
Any idea on how to do this?
Thanks!
** EDIT :
When I access to a page of my php application (page.php?id=1) where 1 load specific information from the database, I want to obfuscate this is, to an alphanumeric string.
Neither I want the obfuscated string to always have the same value (ex : 1 is ALWAYS ABC543)
I'm also not interested to keep the obfuscated value into a database­.
There's a nice example by Ray Morgan of creating a tamper-proof user-id obfuscation scheme that does not require database storage of the id's encoded form :
http://raymorgan.net/web-development/how-to-obfuscate-integer-ids/
Another approach would be to use symmetric (bi-directional) encryption (concat'ing id with a salt) with AES...
The obfuscation protocol could use filename and/or function name as a salt for the obfuscation that will happen. That way you would only see the same IDs in the same file or function and you can have a repeatable process. Otherwise you would have to have many different ways for keeping track of how you obfuscated each file or function.

Random URL like megaupload

I need to sell pictures. I need to create a megaupload like system to create ramdom url, like this: "http://download.server.com/7fdfug87g89f7g98fd7g/image.jpg" associated with the session and IP address.
I'm using PHP, Apache or Nginx.
How can I achieve this?
Any Ideas?
Use mod_rewrite in the .htaccess file to redirect requests matching some patterns you define to a php file, 'index.php' perhaps.
This way you can pass the requested string as a URL parameter to the page. And then in the script you can use the parameter to find and return the related image.
It's called 'URL rewriting', and is the way how those sites with meaningful URLs work, just like the URLs of stackoverflow.
For the uniqueness; rather than bare hash codes, you'll probably need to keep a DB to map the codes with files. So they may be totally random codes in any length you wish, and never collide, as during the assignment you will create a new random one if the one you just created collides with another one already in the DB. And you can add clear IP and session info to the DB record. This also removes the need of some heavy calculations for hashing algorithms.
Something like md5 would be within reason.
$my_seed = "something random here";
$path = md5($my_seed . $_SESSION['something'] . $_SERVER['REMOTE_ADDR']);
echo "http://download.server.com/" . $path . "/" . $file;
That should give you a pretty unique path to put files in that would be rare to collide. You should still check if the previous hash'd path exists though.
Use some of the popular hash functions out there, such as MD5. There should be a PHP module capable of that.
I usually use a hashing function like sha1 or md5 to generate a pseudorandom string of hexadecimal digits based on the current time plus some other bit of data unique to whatever the URL is about.

URL shortening: using inode as short name?

The site I am working on wants to generate its own shortened URLs rather than rely on a third party like tinyurl or bit.ly.
Obviously I could keep a running count new URLs as they are added to the site and use that to generate the short URLs. But I am trying to avoid that if possible since it seems like a lot of work just to make this one thing work.
As the things that need short URLs are all real physical files on the webserver my current solution is to use their inode numbers as those are already generated for me ready to use and guaranteed to be unique.
function short_name($file) {
$ino = #fileinode($file);
$s = base_convert($ino, 10, 36);
return $s;
}
This seems to work. Question is, what can I do to make the short URL even shorter?
On the system where this is being used, the inodes for newly added files are in a range that makes the function above return a string 7 characters long.
Can I safely throw away some (half?) of the bits of the inode? And if so, should it be the high bits or the low bits?
I thought of using the crc32 of the filename, but that actually makes my short names longer than using the inode.
Would something like this have any risk of collisions? I've been able to get down to single digits by picking the right value of "$referencefile".
function short_name($file) {
$ino = #fileinode($file);
// arbitrarily selected pre-existing file,
// as all newer files will have higher inodes
$ino = $ino - #fileinode($referencefile);
$s = base_convert($ino, 10, 36);
return $s;
}
Not sure this is a good idea : if you have to change server, or change disk / reformat it, the inodes numbers of your files will most probably change... And all your short URL will be broken / lost !
Same thing if, for any reason, you need to move your files to another partition of your disk, btw.
Another idea might be to calculate some crc/md5/whatever of the file's name, like you suggested, and use some algorithm to "shorten" it.
Here are a couple articles about that :
Create short IDs with PHP - Like Youtube or TinyURL
Using Php and MySQL to create a short url service!
Building a URL Shortener
Rather clever use of the filesystem there. If you are guaranteed that inode ids are unique its a quick way of generating the unique numbers. I wonder if this could work consistently over NFS, because obviously different machines will have different inode numbers. You'd then just serialize the link info in the file you create there.
To shorten the urls a bit, you might take case sensitivity into account, and do one of the safe encodings (you'll get about base62 out of it - 10 [0-9] + 26 (a-z) + 26 (A-Z), or less if you remove some of the 'conflict' letters like I vs l vs 1... there are plenty of examples/libraries out there).
You'll also want to 'home' your ids with an offset, like you said. You will also need to figure out how to keep temp file/log file, etc creation from eating up your keyspace.
Check out Lessn by Sean Inman; Haven't played with it yet, but it's a self-hosted roll your own URL solution.

Using SEO-friendly links

I'm developing a PHP website, and currently my links are in a facebook-ish style, like so
me.com/profile.php?id=123
I'm thinking of moving to something more friendly to crawling search engines
(like here at stackoverflow), something like:
me.com/john-adams
But how can I differentiate from two users with the same name - or more correctly, how does stackoverflow tell the difference from two questions with the same title?
I was thinking of doing something like
me.com/john-adams-123
and parsing the url.
Any other recommendations?
Stackoverflow does something similar to your me.com/john-adams-123 option, except more like me.com/123/john-adams where the john-adams part actually has no programmatic meaning. The way you're proposing is slightly better because the semantic-content-free numeric ID is farther to the right in the URL.
What I would do is store a unique slug (these SEO-friendly URL components are generally called slugs) in the user table and do the number append thing when necessary to get a unique one.
In stack overflow's case, it's
http://stackoverflow.com/questions/975240/using-seo-friendly-links
http://stackoverflow.com/questions <- Constant prefix
/975240 <- Unique question id
using-seo-friendly-links <- Any text at all, defaults to title of question.
Facebook, on the other hand, has decided to just make everyone pick a unique ID. Then they are going to use that as a profile page. Something like http://facebook.com/p/username/. They are solving the problem of uniqueness between users, by just requiring it to be some string that the user picks that is unique among all existing users.
SO 'cheats' :-).
The link for your question is "Using SEO-friendly links" but "Using SEO-friendly links" also works.
The part after the number is the SEO friendly bit, but SO doesn't really care what's there. I think it defaults to the question title.
So in your case you could construct a link like:
me.com/123/john-adams
a second john adams would have a different Id and a unique URL like :
me.com/111/john-adams
I would say that your proposed solution is a better solution to that of stackoverflows as it preserves content hierarchy:
me.com/john-adams-123
Usage of the unique ID before the username is simply nonsensical.
I would, however, recommend enforcement of content type:
me.com/john-adams-123.html
This will allow for consistent urls while serving a variety of content types.
Additionally, you could make use of sexatrigesimal for the unique id, to further reduce the amount of unnecessary cruft in your URL, especially for high end numbers, but this is often overkill :D
me.com/john-adams-123.html -> me.com/john-adams-3F.html
me.com/john-adams-1234567890.html -> me.com/john-adams-KF12OI.html
Finally, be sure to utilize 301 redirects on non-conforming accessible URIs to redirect to the "correct" seo-friendly schema to prevent duplicate content penalties.
I'd go with your style of me.com/john-adams-123, because I think the leftmost part of the URI has more importance in SEO ranking.
Actually, if you are willing to use this on several controllers (not just user profile), you may want to do it more like me.com/john-adams-profile-123 with a rewriting rule redirecting /.+-profile-(\d+) to profile.php?uid=$1 and still be able to use, say, me.com/john-adams-articles-123 for this user's articles...
To avoid dealing with the links contain special characters, you can use this plugin for Zend Framework.
https://github.com/btlagutoli/CharConvert
$filter2 = new Zag_Filter_CharConvert(array(
'onlyAlnum' => true,
'replaceWhiteSpace' => '-'
));
echo $filter2->filter('éééé ááááá ? 90 :');//eeee-aaaaa-90
this can help you deal with strings in other languages

Categories