I'm using ext3 and according to Wikipedia, the maximum sub directories allowed is around 32000. Currently, each user is given their own directory to upload images on the filesystem. This makes it simple to retrieve images and ease of access. the folder structure is like this:
../images/<user id>/<image>
../images/<another user id>/<image>
I don't want to commit to a design that is doomed to fail with scalability, specifically when 32k users have upload images. While this may never be achieved, I still think it is bad practice.
Does anyone have an idea to avoid this problem? I would prefer not to use the database if possible for reasons of unnecessary queries and speed.
You could have a multi-level hierarchy, where each level is guaranteed to never exceed the maximum.
For example, if your user ids are defined with the regular expression [A-Za-z0-9_]+, you have 64 possible choices for any given character (I'm adding a space to account for spaces at the end when ids are shorter). Taking two characters together you have 64*64 = 4096 total possibilities. You cannot do three characters as that takes you over your limit. Then with this info you can create the directories by splitting the ids in groups of two letters. Example: user ids "miguel" and "miguel12345" would go to:
/images/mi/gu/el/<image>
/images/mi/gu/el/12/34/5/<image>
Note how the last component can be one char long if the length of the id is odd. This is fine, since the space is accounted as a possible char, you will still be within the max sub-directory limit.
Good luck!
Create a subdirectory for when the previous one gets full
/images/<a>/<user id 1>/<image>
/images/<a>/<user id 2>/<image>
...
/images/<a>/<user id 32000>/<image>
/image/<b>/<user id 32001>/<image>
...
If i'm getting this right and this ir some sort of web app You could use some abstract layer to imitate that folder structure and save the files in one directory. save file real name in database, and save uploaded file with some unique name. then list users files from database.
Related
i'm new to PHP, and i'm trying to upload file to file server and file information to mysql database, i have done uploading file server and database part but i need to retrieve the info of specific file from my file server folder if i click that file, i'm trying get that logic. please help me if there is any solid solution for this. (correct me if i'm wrong, my idea was to upload the file path to database along with info, is this will give me solution? but the filename can be duplicate)
I figured I would write a short(for me this is short) "answer" just so I could summarize my points.
Some "Best Practices" when creating a file storage system. File storage is a broad category so your mileage may vary for some of these. Take them just as suggestion of what I found works well.
Filenames
Don't store the file with the name give it by an end user. They can and will use all kind of crappy characters that will make your life miserable. Some can be as bad as ' single quotes, which on linux basically makes it so it's impossible to read, or even delete the file ( directly ). Some things can seem simple like a space but depending on where you use it and the OS on your server you could wind up with one%20two.txt or one+two.txt or one two.txt which may or may not create all kinds of issues in your links.
The best thing to do is create a hash, something like sha1 this can be as simple as {user_id}{orgianl_name} The username make it less likely of collisions with other users filenames.
I prefer doing file_hash('sha1', $contents) that way if someone uploads the same file more then once you can catch that ( the contents are the same the hash is the same). But if you expect to have large files you may want to do some bench marking on it to see what type of performance it has. I mostly handle small files so it works fine for that.
-note- that with the timestamp the file can still be saved because the full name is different, but it makes it quite easy to see, and it can be verified in the database.
Regardless of what you do I would prefix it with a timestamp time().'-'.$filename. This is useful information to have, because its the absolute time the file was created.
As for the name a user give the file. Just store that in the database record. This way you can show them the name they expect, but use a name you know is always safe for links.
$filename = 'some crapy^ fileane.jpg';
$ext = strrchr($filename, '.');
echo "\nExt: {$ext}\n";
$hash = sha1('some crapy^ fileane.jpg');
echo "Hash: {$hash}\n";
$time = time();
echo "Timestamp: {$time}\n";
$hashname = $time.'-'.$hash.$ext;
echo "Hashname: $hashname\n";
Ouputs
Ext: .jpg
Hash: bb9d2c2c7c73bb8248537a701870e35742b41c02
Timestamp: 1511853063
Hashname: 1511853063-bb9d2c2c7c73bb8248537a701870e35742b41c02.jpg
You can try it here
Paths never store the full path to the file. All you need in the database is the hash from creating the hashed name. The "root" path to the folder the file is stored in should be done in PHP. This has several benefits.
prevents directory transferal. Because your not passing any part of the path around you don't have to worry as much about someone slipping a \..\.. in there and going places they shouldn't. A poor example of this would be someone overwriting a .htpassword file by uploading a file named that with directory transverse in it.
Has more uniform looking links, uniform size, uniform set of
characters.
https://en.wikipedia.org/wiki/Directory_traversal_attack
Maintenance. Paths change, Servers change. Demands on your system change. If you need to relocate those files, but you stored the absolute full path to them in the DB your stuck gluing everything together with symlinks or updating all your records.
There are some exceptions to this. If you want to store them in a monthly folder or by username. You could save that part of the path, in a seperate field. But even in that case, you could build it dynamically based on data saved in the record. I have found it's best to save as little path info as possible. And them make a config or a constant you can use in all the places you need to put the path to the file.
Also the path and the link are very different, so by saving only the name you can link it from whatever PHP page you want without having to subtract data from the path. I've always found it easier to add to the filename then to subtract from a path.
Database (just some suggestions, use may vary )
As always with data ask yourself, who, what, where, when
id - int primary key auto increment
user_id - int foreign key, who uploaded it
hash - char[40] *sha1*, unique what the hash
hashname - varchar {timestampl}-{hash}.{ext} where the files name on the hard drive
filename - varchar the original name give by the user, that way we can show them the name they expect ( if that is important )
status - enum[public,private,deleted,pending.. etc] status of the file, depending on your use case, you may have to review the files, or maybe some are private only the user can see them, maybe some are public etc.
status_date - timestamp|datetime time the status was changed.
create_date - timestamp|datetime when time the file was created, a timestamp is prefered as it makes some things easier but it should be the same timestamp use in the hashname, in that case.
type - varchar - mime type, can be useful for setting the mime type when downloading etc.
If you expect different users to upload the same file and you use the file_hash you can make the hash field a combined unique index of the user_id and the hash this way it would only conflict if the same user uploaded the same file. You could also do it based on the timestamp and hash, depending on your needs.
That's the basic stuff I could think of, this isn't an absolute just some fields I thought would be useful.
It's useful to have the hash by itself, if you store it by it's self you can store it in a CHAR(40) for sha1 (takes up less space in the DB then VARCHAR) and set the collation, to UTF8_bin which is binary. This makes searches on it case sensitive. Although there is little possibility of a hash collision, this adds just a bit more protection because hashes are upper an lower case letters.
You can always build the hashname on the fly if you store the extension, and the timestamp separate. If you find yourself creating things time and time again you may just want to store it in the DB to simplify the work in PHP.
I like just putting the hash in the link, no extension no anything so my links look like this.
http://www.example.com/download/ad87109bfff0765f4dd8cf4943b04d16a4070fea
Real simple, real generic, safe in urls always the same size etc..
The hashname for this "file" would be like this
1511848005-ad87109bfff0765f4dd8cf4943b04d16a4070fea.jpg
If you do have conflicts with the same file and different user(which I mentioned above). You can always add the timestamp part into the link, the user_id or both. If you use the user_id, it might be useful to left pad it with zeros. For example some users may have ID:1 and some may be ID:234 so you could left pad it to 4 places and make them 0001 and 0234. Then add that to the hash, which is almost unnoticeable:
1511848005-ad87109bfff0765f4dd8cf4943b04d16a4070fea0234.jpg
The important thing here is that because sha1 is always 40 and the id is always 4 we can separate the two accurately and easily. And this way, you can still look it up uniquely. There are a lot of different options but so much depends on your needs.
Access
Such as downloading. You should always output the file with PHP, don't give them direct access to the file. The best way is to store the files outside of the webroot ( above the public_html, or www folder ). Then in PHP you can set the headers to the correct type ans basically read out the file. This works for pretty much everything except video. I don't handle videos so that's a topic outside of my experience. But I find it best to think of it as all file data is text, its the headers that make that text into an image, or an excel file or a pdf.
The big advantage of not giving them direct access to the file is if you have a membership site, of don't want your content accessible without a login, you can easily check in PHP if they are logged in before giving them the content. And, as the file is outside the webroot, they can't access it any other way.
The most important thing is to pick something consistent, that is still flexible enough to handle all your needs.
I'm sure I can come up with more, but if you have any suggest feel free to comment.
BASIC PROCESS FLOW
User submits form (enctype="multipart/form-data")
https://www.w3schools.com/tags/att_form_enctype.asp
Server receives the post from the form, Super Globals $_POST and the $_FILES
http://php.net/manual/en/reserved.variables.files.php
$_FILES = [
'fieldname' => [
'name' => "MyFile.txt" // (comes from the browser, so treat as tainted)
'type' => "text/plain" // (not sure where it gets this from - assume the browser, so treat as tainted)
'tmp_name' => "/tmp/php/php1h4j1o" // (could be anywhere on your system, depending on your config settings, but the user has no control, so this isn't tainted)
'error' => "0" //UPLOAD_ERR_OK (= 0)
'size' => "123" // (the size in bytes)
]
];
Check for errors if(!$_FILES['fielname']['error'])
Sanitize display name $filename = htmlentities($str, ENT_NOQUOTES, "UTF-8");
Save file, create DB record ( PSUDO-CODE )
Like this:
$path = __DIR__.'/uploads/'; //for exmaple
$time = time();
$hash = hash_file('sha1',$_FILES['fielname']['tmp_name']);
$type = $_FILES['fielname']['type'];
$hashname = $time.'-'.$hash.strrchr($_FILES['fielname']['name'], '.');
$status = 'pending';
if(!move_uploaded_file ($_FILES['fielname']['tmp_name'], $path.$hashname )){
//failed
//do somehing for errors.
die();
}
//store record in db
http://php.net/manual/en/function.move-uploaded-file.php
Create link ( varies based on routing ), the simple way is to do your link like this http://www.example.com/download?file={$hash} but it's uglier then http://www.example.com/download/{$hash}
user clicks link goes to download page.
get INPUT and look up record
$hash = $_GET['file'];
$stmt = $PDO->prepare("SELECT * FROM attachments WHERE hash = :hash LIMIT 1");
$stmt->execute([":hash" => $hash]);
$row = $stmt->fetch(PDO::FETCH_ASSOC);
print_r($row);
http://php.net/manual/en/intro.pdo.php
Etc....
Cheers!
I was wondering how should I name my images upload using PHP & MySQL, should I use the auto increment number as the name of the image for example, 1.gif or should I use some random numbers or something. I was thinking auto increment was better. But what would be best?
Since no one's officially offered this yet, I'd advise to simply store the file name as the database unique id, nothing more, and store the extension in the database (unless you are forcing all images to be .jpg or something, then you don't need to).
It is always going to be a safe file name (an integer)
It will always be unique
No need to store the file name in the db or worry about scrubbing it.
It will be as small as possible.
Why I would not use the user's username/id, as suggested by others:
There's no benefit, and no reason to expose a user's id in the file name if you don't need to.
No need to scrub it for allowed characters, which may even end up with multiple users with the same "file safe" user name.
User names may change, so it doesn't always make sense, and you don't want to have to rename files if you want them to match.
Why I would not use the original file name in any form:
There's no benefit.
You have to scrub it for allowed characters.
There will be duplicates.
Unless you are interested in vanity file names, I can't think of any reason not to just use the auto-increment id. If your DB ids are unique, your file names will be too.
If later on you do want "pretty" file names, you can use .htaccess to rewrite the requests, and/or output your images through a php script, which also has the benefit of checking for permissions and whatnot if you need it.
What about
md5(microtime())
?
It is pretty unique
I like to use a combination of an auto incrementing id and filename.
So if I upload the image my_photo.jpg and it gets stored with an id of 5, I would save it as 5_my_photo.jpg
This way, the original filename and extension are preserved and I can deliver it back to the user without the id prefix if I want to.
One good way to name the images is to append a user name to an autoincrement value padded on the left with zero, such as "00000027MyPic.jpg".
if you are worried about the image being unique, store it as time().$extention.
I also prefer to put the user's username as a prefix, but thats just me, there is no reason to do that.
I am facing a problem on developing my web app, here is the description:
This webapp (still in alpha) is based on user generated content (usually short articles although their length can become quite large, about one quarter of screen), every user submits at least 10 of these articles, so the number should grow pretty fast. By nature, about 10% of the articles will be duplicated, so I need an algorithm to fetch them.
I have come up with the following steps:
On submission fetch a length of text and store it in a separated table (article_id,length), the problem is the articles are encoded using PHP special_entities() function, and users post content with slight modifications (some one will miss the comma, accent or even skip some words)
Then retrieve all the entries from database with length range = new_post_length +/- 5% (should I use another threshold, keeping in mind that human factor on articles submission?)
Fetch the first 3 keywords and compare them against the articles fetched in the step 2
Having a final array with the most probable matches compare the new entry using PHP's levenstein() function
This process must be executed on article submission, not using cron. However I suspect it will create heavy loads on the server.
Could you provide any idea please?
Thank you!
Mike
Text similarity/plagiat/duplicate is a big topic. There are so many algos and solutions.
Lenvenstein will not work in your case. You can only use it on small texts (due to its "complexity" it would kill your CPU).
Some projects use the "adaptive local alignment of keywords" (you will find info on that on google.)
Also, you can check this (Check the 3 links in the answer, very instructive):
Cosine similarity vs Hamming distance
Hope this will help.
I'd like to point out that git, the version control system, has excellent algorithms for detecting duplicate or near-duplicate content. When you make a commit, it will show you the files modified (regardless of rename), and what percentage changed.
It's open source, and largely written in small, focused C programs. Perhaps there is something you could use.
You could design your app to reduce the load by not having to check text strings and keywords against all other posts in the same category. What if you had the users submit the third party content they are referencing as urls? See Tumblr implementation-- basically there is a free-form text field so each user can comment and create their own narrative portion of the post content, but then there are formatted fields also depending on the type of reference the user is adding (video, image, link, quote, etc.) An improvement on Tumblr would be letting the user add as many/few types of formatted content as they want in any given post.
Then you are only checking against known types like a url or embed video code. Combine that with rexem's suggestion to force users to classify by category or genre of some kind, and you'll have a much smaller scope to search for duplicates.
Also if you can give each user some way of posting to their own "stream" then it doesn't matter if many people duplicate the same content. Give people some way to vote up from the individual streams to a main "front page" level stream so the community can regulate when they see duplicate items. Instead of a vote up/down like Digg or Reddit, you could add a way for people to merge/append posts to related posts (letting them sort and manage the content as an activity on your app rather than making it an issue of behind the scenes processing).
The site I am working on wants to generate its own shortened URLs rather than rely on a third party like tinyurl or bit.ly.
Obviously I could keep a running count new URLs as they are added to the site and use that to generate the short URLs. But I am trying to avoid that if possible since it seems like a lot of work just to make this one thing work.
As the things that need short URLs are all real physical files on the webserver my current solution is to use their inode numbers as those are already generated for me ready to use and guaranteed to be unique.
function short_name($file) {
$ino = #fileinode($file);
$s = base_convert($ino, 10, 36);
return $s;
}
This seems to work. Question is, what can I do to make the short URL even shorter?
On the system where this is being used, the inodes for newly added files are in a range that makes the function above return a string 7 characters long.
Can I safely throw away some (half?) of the bits of the inode? And if so, should it be the high bits or the low bits?
I thought of using the crc32 of the filename, but that actually makes my short names longer than using the inode.
Would something like this have any risk of collisions? I've been able to get down to single digits by picking the right value of "$referencefile".
function short_name($file) {
$ino = #fileinode($file);
// arbitrarily selected pre-existing file,
// as all newer files will have higher inodes
$ino = $ino - #fileinode($referencefile);
$s = base_convert($ino, 10, 36);
return $s;
}
Not sure this is a good idea : if you have to change server, or change disk / reformat it, the inodes numbers of your files will most probably change... And all your short URL will be broken / lost !
Same thing if, for any reason, you need to move your files to another partition of your disk, btw.
Another idea might be to calculate some crc/md5/whatever of the file's name, like you suggested, and use some algorithm to "shorten" it.
Here are a couple articles about that :
Create short IDs with PHP - Like Youtube or TinyURL
Using Php and MySQL to create a short url service!
Building a URL Shortener
Rather clever use of the filesystem there. If you are guaranteed that inode ids are unique its a quick way of generating the unique numbers. I wonder if this could work consistently over NFS, because obviously different machines will have different inode numbers. You'd then just serialize the link info in the file you create there.
To shorten the urls a bit, you might take case sensitivity into account, and do one of the safe encodings (you'll get about base62 out of it - 10 [0-9] + 26 (a-z) + 26 (A-Z), or less if you remove some of the 'conflict' letters like I vs l vs 1... there are plenty of examples/libraries out there).
You'll also want to 'home' your ids with an offset, like you said. You will also need to figure out how to keep temp file/log file, etc creation from eating up your keyspace.
Check out Lessn by Sean Inman; Haven't played with it yet, but it's a self-hosted roll your own URL solution.
I want to store multiple mp3 files and search them by giving some part of song, to detect which song it is.
I am thinking of storing all binary content in mysql and when I want to search for a specific song by content I will take some middle portion of song and actually match it with the binary data in MySQL.
My questions are:
Is this a reasonable way to find songs by their content?
Is it right to store the songs' content in the database or should I use the filesystem?
This is not going to work. MP3 is a "lossy" format. That means that it constantly alters subtle nuances of the music when encoding, thus producing totally different byte-wise data on almost every encoding for the same song.
Also, even in an uncompressed format like WAV, two identical records at different volumes will produce different byte data. So, it is impossible to compare music by comparing the byte values of the file's contents.
A binary comparison will work only for two exact identical copies of the same MP3 file. It won't even work anymore when you re-encode the same MP3 file with identical settings.
Comparing music is not a trivial matter, several approaches exist but to my knowledge none that can be used in PHP.
If you're lucky, there exists a web service that allows some kind of matching. Expect it to be commercial in some way, though - I doubt we are at the stage where this kind of thing can be used free of charge.
Is it a right way to find songs by content of song.
Only if you can be sure that the part you get as search criterium will actually be an excerpt from that particular MP3 file... and that is very, very unlikely. If the part can be from a different source (i.e. a different recording of the same song, or just a differently compressed MP3), you'll have to use audio fingerprinting which is vastly more complicated.
Is it right to store songs content in database or file store normally will work?
If you do simple binary matching, there is no point in using a database. If you have a more complex indexing technique (such as audio fingerprints) then using a database can make sense.
As others have pointed out - comparing MP3s by looking at the binary content of files is not going to work.
I wrote something like this in Java whilst at university for my final year project. I'd be more than happy to send you the source code. It dealt in relative similarities - "song X is more similar to song Y than it is to song Z", rather than matches, but it might be a step in the right direction.
And please, whatever you do, don't try and do this in PHP. The algorithm I used needed me to compute (if I remember correctly - I worked on this around 3 years ago) 30 30x30 matrices for each MP3 it analysed. Each song took around 30 seconds to process to a set of matrices on my clunky old machine (I'm sure my new PC could get the job done significantly quicker). Once I had those matrices for n songs a second step computed differences between each pair of songs, and a third step reduced those differences down to m-dimensional space. Each of these 3 steps takes a fair amount of horsepower, and PHP definitely isn't the right horse for the job.
What PHP might work for is a frontend - I ended up with a queryable web-app written in Ruby on Rails, where I had a simple backend which stored the co-ordinates of each song in m-dimensional space (I happened to choose m = 6) - given a particular song, or fragment, X, you could then compute songs within a certain "distance" of X.
NB. I should probably point out that all the code I wrote was basically just a wrapper around libraries others had written - which were by some smart people at a university in Austria - those libraries took two songs and generated the matrices - all I did was compute distances and map distances of lots of songs into m-dimensional space. Wish I was smart enough to have done the first bit too!
I don't fully understand what you're trying to do, but if you're going to index an MP3 collection, it's probably a better idea to store a hash (of sufficient length) rather than the actual file.
The problem is that the bytes don't give you any insight to the CONTENT of the file, i.e. the music in it. Even if you cut the metadata from the bytes to compare (to get rid of noise like changes in spelling/capitalisation of metadata), you only know something about the unique file itself. So you could compare two identical files (i.e. exact duplicates) for equality, but you couldn't compare any two random files for similarity.
To search songs, you may probably want to index their tags and focus on a nice, easy to use UI so users can look for them in flexible ways.
As said above, same song will show different content bytes depending on the encoding.
However, one idea pointing to your direction, and I'm not sure how feasible is, would be to index some songs patterns that may uniquely identify it. For ex. what do all Johnny Cash songs have in common? Volume, tone, a combination of them? And when you get a portion of content, you may extract that same pattern from it and match. That would be an interesting concept.