URL shortening: using inode as short name? - php

The site I am working on wants to generate its own shortened URLs rather than rely on a third party like tinyurl or bit.ly.
Obviously I could keep a running count new URLs as they are added to the site and use that to generate the short URLs. But I am trying to avoid that if possible since it seems like a lot of work just to make this one thing work.
As the things that need short URLs are all real physical files on the webserver my current solution is to use their inode numbers as those are already generated for me ready to use and guaranteed to be unique.
function short_name($file) {
$ino = #fileinode($file);
$s = base_convert($ino, 10, 36);
return $s;
}
This seems to work. Question is, what can I do to make the short URL even shorter?
On the system where this is being used, the inodes for newly added files are in a range that makes the function above return a string 7 characters long.
Can I safely throw away some (half?) of the bits of the inode? And if so, should it be the high bits or the low bits?
I thought of using the crc32 of the filename, but that actually makes my short names longer than using the inode.
Would something like this have any risk of collisions? I've been able to get down to single digits by picking the right value of "$referencefile".
function short_name($file) {
$ino = #fileinode($file);
// arbitrarily selected pre-existing file,
// as all newer files will have higher inodes
$ino = $ino - #fileinode($referencefile);
$s = base_convert($ino, 10, 36);
return $s;
}

Not sure this is a good idea : if you have to change server, or change disk / reformat it, the inodes numbers of your files will most probably change... And all your short URL will be broken / lost !
Same thing if, for any reason, you need to move your files to another partition of your disk, btw.
Another idea might be to calculate some crc/md5/whatever of the file's name, like you suggested, and use some algorithm to "shorten" it.
Here are a couple articles about that :
Create short IDs with PHP - Like Youtube or TinyURL
Using Php and MySQL to create a short url service!
Building a URL Shortener

Rather clever use of the filesystem there. If you are guaranteed that inode ids are unique its a quick way of generating the unique numbers. I wonder if this could work consistently over NFS, because obviously different machines will have different inode numbers. You'd then just serialize the link info in the file you create there.
To shorten the urls a bit, you might take case sensitivity into account, and do one of the safe encodings (you'll get about base62 out of it - 10 [0-9] + 26 (a-z) + 26 (A-Z), or less if you remove some of the 'conflict' letters like I vs l vs 1... there are plenty of examples/libraries out there).
You'll also want to 'home' your ids with an offset, like you said. You will also need to figure out how to keep temp file/log file, etc creation from eating up your keyspace.

Check out Lessn by Sean Inman; Haven't played with it yet, but it's a self-hosted roll your own URL solution.

Related

How to properly store images (with numeric names) in server's filesystem

I have a lot of records in the database, and each record will have an image, I'm pretty confused about how to store the images.
I want the access route to be something like /img/record-id.jpg (i.e. /img/15178.jpg).
Alright, storing all images inside /img/ isn't a good thing, because there will be many.
In this question it is suggested to reverse the name of the image, so the example above would be stored under /img/78/51/15178.jpg. The suggestion won't give further info (and for me it's not obvious) about other scenarios. What will happen (this is asked in the last comment for the answer) if the id is a low number like 5, 15, 128, 1517?
Leaving that aside, let's remember I want the path to be /img/15178.jpg. I'd redirect the request using Apache, but for that I'd have to type at least 3 or more rules for different id numbers:
^/img/(\d)(\.jpg)$ /img/$1$2
^/img/(\d\d)(\.jpg)$ /img/$1/$1$2
^/img/(\d\d)(\d\d)(\.jpg)$ /img/$1/$2/$3
And so on?
This doesn't seem to be a nice solution, although it would work just fine.
I could think of other option which is: take the MD5 of the image, store it in its respective record, redirect the request to a PHP script and let it take care of the output.
The script will look the MD5 for the id in the database, build the actual route out of the hash and output the image. This solution is neater, but it involves database and PHP to output an image, sounds like a little too much.
I really don't know what to do here. Mind giving me some advice?
You already have written the perfect answer ! Professionals use it exactly like you (or the guy in the linked question) says: Building a deep directory structure that fits your needs. I have done this with 16 million pictures, and it worked perfectly.
I did it like this:
/firstCharacter/secondCharacter/...
Files with short names, like 5.jpg, will be in /5/5.jpg
EDIT: to keep the performance on top, i'm totally against any further php actions, like salt, md5, etc. Keep it straight and simple.

Saving user images - reaching maximum folder limit

I'm using ext3 and according to Wikipedia, the maximum sub directories allowed is around 32000. Currently, each user is given their own directory to upload images on the filesystem. This makes it simple to retrieve images and ease of access. the folder structure is like this:
../images/<user id>/<image>
../images/<another user id>/<image>
I don't want to commit to a design that is doomed to fail with scalability, specifically when 32k users have upload images. While this may never be achieved, I still think it is bad practice.
Does anyone have an idea to avoid this problem? I would prefer not to use the database if possible for reasons of unnecessary queries and speed.
You could have a multi-level hierarchy, where each level is guaranteed to never exceed the maximum.
For example, if your user ids are defined with the regular expression [A-Za-z0-9_]+, you have 64 possible choices for any given character (I'm adding a space to account for spaces at the end when ids are shorter). Taking two characters together you have 64*64 = 4096 total possibilities. You cannot do three characters as that takes you over your limit. Then with this info you can create the directories by splitting the ids in groups of two letters. Example: user ids "miguel" and "miguel12345" would go to:
/images/mi/gu/el/<image>
/images/mi/gu/el/12/34/5/<image>
Note how the last component can be one char long if the length of the id is odd. This is fine, since the space is accounted as a possible char, you will still be within the max sub-directory limit.
Good luck!
Create a subdirectory for when the previous one gets full
/images/<a>/<user id 1>/<image>
/images/<a>/<user id 2>/<image>
...
/images/<a>/<user id 32000>/<image>
/image/<b>/<user id 32001>/<image>
...
If i'm getting this right and this ir some sort of web app You could use some abstract layer to imitate that folder structure and save the files in one directory. save file real name in database, and save uploaded file with some unique name. then list users files from database.

Random URL like megaupload

I need to sell pictures. I need to create a megaupload like system to create ramdom url, like this: "http://download.server.com/7fdfug87g89f7g98fd7g/image.jpg" associated with the session and IP address.
I'm using PHP, Apache or Nginx.
How can I achieve this?
Any Ideas?
Use mod_rewrite in the .htaccess file to redirect requests matching some patterns you define to a php file, 'index.php' perhaps.
This way you can pass the requested string as a URL parameter to the page. And then in the script you can use the parameter to find and return the related image.
It's called 'URL rewriting', and is the way how those sites with meaningful URLs work, just like the URLs of stackoverflow.
For the uniqueness; rather than bare hash codes, you'll probably need to keep a DB to map the codes with files. So they may be totally random codes in any length you wish, and never collide, as during the assignment you will create a new random one if the one you just created collides with another one already in the DB. And you can add clear IP and session info to the DB record. This also removes the need of some heavy calculations for hashing algorithms.
Something like md5 would be within reason.
$my_seed = "something random here";
$path = md5($my_seed . $_SESSION['something'] . $_SERVER['REMOTE_ADDR']);
echo "http://download.server.com/" . $path . "/" . $file;
That should give you a pretty unique path to put files in that would be rare to collide. You should still check if the previous hash'd path exists though.
Use some of the popular hash functions out there, such as MD5. There should be a PHP module capable of that.
I usually use a hashing function like sha1 or md5 to generate a pseudorandom string of hexadecimal digits based on the current time plus some other bit of data unique to whatever the URL is about.

how to store and search mp3 by its content

I want to store multiple mp3 files and search them by giving some part of song, to detect which song it is.
I am thinking of storing all binary content in mysql and when I want to search for a specific song by content I will take some middle portion of song and actually match it with the binary data in MySQL.
My questions are:
Is this a reasonable way to find songs by their content?
Is it right to store the songs' content in the database or should I use the filesystem?
This is not going to work. MP3 is a "lossy" format. That means that it constantly alters subtle nuances of the music when encoding, thus producing totally different byte-wise data on almost every encoding for the same song.
Also, even in an uncompressed format like WAV, two identical records at different volumes will produce different byte data. So, it is impossible to compare music by comparing the byte values of the file's contents.
A binary comparison will work only for two exact identical copies of the same MP3 file. It won't even work anymore when you re-encode the same MP3 file with identical settings.
Comparing music is not a trivial matter, several approaches exist but to my knowledge none that can be used in PHP.
If you're lucky, there exists a web service that allows some kind of matching. Expect it to be commercial in some way, though - I doubt we are at the stage where this kind of thing can be used free of charge.
Is it a right way to find songs by content of song.
Only if you can be sure that the part you get as search criterium will actually be an excerpt from that particular MP3 file... and that is very, very unlikely. If the part can be from a different source (i.e. a different recording of the same song, or just a differently compressed MP3), you'll have to use audio fingerprinting which is vastly more complicated.
Is it right to store songs content in database or file store normally will work?
If you do simple binary matching, there is no point in using a database. If you have a more complex indexing technique (such as audio fingerprints) then using a database can make sense.
As others have pointed out - comparing MP3s by looking at the binary content of files is not going to work.
I wrote something like this in Java whilst at university for my final year project. I'd be more than happy to send you the source code. It dealt in relative similarities - "song X is more similar to song Y than it is to song Z", rather than matches, but it might be a step in the right direction.
And please, whatever you do, don't try and do this in PHP. The algorithm I used needed me to compute (if I remember correctly - I worked on this around 3 years ago) 30 30x30 matrices for each MP3 it analysed. Each song took around 30 seconds to process to a set of matrices on my clunky old machine (I'm sure my new PC could get the job done significantly quicker). Once I had those matrices for n songs a second step computed differences between each pair of songs, and a third step reduced those differences down to m-dimensional space. Each of these 3 steps takes a fair amount of horsepower, and PHP definitely isn't the right horse for the job.
What PHP might work for is a frontend - I ended up with a queryable web-app written in Ruby on Rails, where I had a simple backend which stored the co-ordinates of each song in m-dimensional space (I happened to choose m = 6) - given a particular song, or fragment, X, you could then compute songs within a certain "distance" of X.
NB. I should probably point out that all the code I wrote was basically just a wrapper around libraries others had written - which were by some smart people at a university in Austria - those libraries took two songs and generated the matrices - all I did was compute distances and map distances of lots of songs into m-dimensional space. Wish I was smart enough to have done the first bit too!
I don't fully understand what you're trying to do, but if you're going to index an MP3 collection, it's probably a better idea to store a hash (of sufficient length) rather than the actual file.
The problem is that the bytes don't give you any insight to the CONTENT of the file, i.e. the music in it. Even if you cut the metadata from the bytes to compare (to get rid of noise like changes in spelling/capitalisation of metadata), you only know something about the unique file itself. So you could compare two identical files (i.e. exact duplicates) for equality, but you couldn't compare any two random files for similarity.
To search songs, you may probably want to index their tags and focus on a nice, easy to use UI so users can look for them in flexible ways.
As said above, same song will show different content bytes depending on the encoding.
However, one idea pointing to your direction, and I'm not sure how feasible is, would be to index some songs patterns that may uniquely identify it. For ex. what do all Johnny Cash songs have in common? Volume, tone, a combination of them? And when you get a portion of content, you may extract that same pattern from it and match. That would be an interesting concept.

Best way to get a random word for a captcha script in PHP

I am working on a new captcha script and it is almost completed except I would like to have a list of words for example lets say I have a list of 300 5 letter words that I would like to use for the captcha image text.
What would be the best way for performance on a high traffic site to deal with this list for it?
Read the words from a text file on every load
Store in an array
other?
Using a fixed list of words could make your Captcha weak since it restricts the number of variations to just n! / (n - k)! options. With n = 300 words and k=2 different words per captcha it would be just 89700 options no matter how long the words are.
If you would use a sequence of four random letters (a-z) you would get more options (exactly n^k = 26^4 = 456976).
If you just want 300 hundred words to choose from, I'd just put them all in an array in straight php code and pull one out randomly. That would be the best performance.
Best option for performance
It would be best, to put list of random numbers in memory (APC or Memcache => google/stackoverflow search for APC or Memcache) to get the best performance, because disc IO is what will make your site slow most of the time. For this you should have a box with enough memory(>= 128MB) and you can install software (APC/Memcache). If you want good performance on a high traffic site, you should be willing to pay for !!!
If you are on a shared hosting provider (but then you won't get best performance), then it would be best to put the words in an array in the same file, because every require statement will fetch the file from disc.
return random word
Like lucky said you can fetch a random number, by a simple rand function call
return ($words[rand(0, count($words)-1);
Where $words is the array with all the words.
VPS hosting
vpslink
Slicehost
These are some cheap VPS hosting I found using google, but I think you should do some more research finding the best VPS hosting for your high performance site.
Instead of 300 words, you could simply generate a random number and display that. No need for a list, or loading a list, or managing the list, ....
Just how many logons per second do you need to handle? This doesn't seem like the right place to spend time in optimization. Just about any way you find the random word should be fine, especially if your word list is only 300 words.
I'd start with a simple text file, one word per line, and just do something simple like
$words = file("wordlist.txt");
return ($words[rand(0, count($word)-1);
and only if it really proved to be a bottleneck would I change it to do a random fseek() or some other "high performance" trick.

Categories