I have a table that caches data (shared hosting so no memcached) to a MySQL table.
The concept is this:
I have a page that loads (static) data and then cache then:
If the cache does not exist then it queries the page then render the HTML and save it to the cache table.
If a page does not exist in cache, it executes 12 queries (menu, page content, SEO, product list, etc.) then saves the rendered HTML in the table.
The cache table is like this:
=cache=
url varchar(255) - primary key
page mediumtext
Now I think I'm doing the right thing, based on what I have (shared host, no caching like memcached, etc.) but my question is this:
Because the URL is a varchar index but because numeric IDs (like int) are faster, is there a way to convert a URL like /contact-us/ or /product-category/product-name/ to a unique integer? Or is there any other way to optimize this?
I would create some form of hash which would allow a shorter key. In many cases something simple like a hash of the request path may be viable. Alternatively something even simpler like CRC32('/your/path/here') may be suitable in your situation as a primary key. In this example the following columns would exist
urlCRC INT(11) UNSIGNED NOT NULL (PRIMARY KEY)
url VARCHAR(255) NOT NULL
page MEDIUMTEXT
You could then take this a step further, and add trigger BEFORE INSERT which would calculate the value for urlCRC, i.e. containing
NEW.urlCRC = CRC32(NEW.url)
You could then create a stored procedure which takes as argument inURL (string), and internally it would do
SELECT * FROM cacheTable WHERE urlCRC = CRC32(inURL);
If the number of rows returned is 0, then you can trigger logic to cache it.
This may of course be overkill, but would provide you a numeric key to work on which, assuming there are no conflicts, would suffice. By storing the url as VARCHAR(255) also then if a conflict does occur, you can easily regenerate new hashes using a different algorithm.
Just to be clear, I just use CRC32() as an example off the top of my head, chances are there are more suitable algorithms. The main point to take away is that a numeric key is more efficient to search so if you can convert your strings into unique numerics it would be more efficient when retrieving data.
Changing your url column to a fixed-size string would make indexing slightly faster, if there wasn't another dynamically-sized column (TEXT) in the table. Converting it to an integer would be possible, depending on your URL structure - you could also use some kind of hash function. But why don't you make your life easier?
You could save your cache results directly to the disk and create a mod_rewrite filter (put it tou your .htaccess file), that matches if the file exists, otherwise invokes the PHP script. This would have 2 advantages:
If the cache is hot, PHP will not run. This saves time and memory.
If the file is requested often and it is small enough (or you have lots of RAM), it will be held in the RAM. This is much faster than MySQL.
select all cached urls with a hash, then search for exact url in all hash colisions
select page from (select * from cache where HASHEDURL = STOREDHASH) where url = 'someurl'
Related
Consider an application which accepts arbitrary-length text input from users, similar to Twitter 'tweets' but up to 1 MiB in size. Due to the distributed nature of the application the same text input may be delivered multiple times to any particular node. In order to prevent the same text from appearing twice in the index (based on Apache Solr), I am using an MD5 hash of the text as a unique key.
Unfortunately, Solr does not support an SQL-like "INSERT IGNORE", as such all duplicate documents replace the content of the original document. Since the user of the application can add additional fields, this replacement is problematic. In order to prevent it, I have two choices:
Before each insert, query the index for documents with the MD5 hashed unique key. If I get a result, then I know that the document already exists in the index. I found this approach to be too slow, probably because we are indexing a few hundred documents per minute.
Store the MD5 hash in an additional store, such as a flat file, MySQL, or elsewhere. This approach is the basis of this question.
What forms of data storage can handle a few hundred inserts per minute, and quickly let me know if the value exists? I am testing with both MySQL (on a different spindle than the Solr index) and with flat files using grep -w someHash hashes.txt and cat someHash >> hashes.txt. Both approaches seem to slow down as the index grows, but it will take a few days or weeks until I see if either approach is feasible.
What other methods of storing and checking the existence of a hash are possible? What fundamental issues might I run into with the MySQL and flat files approach? What would Knuth do?
From solr side, you can try for Deduplication and UpdateXmlMessages#Optional_attributes which may serve the purpose.
I'm trying to optimize my PHP and MySQL, but my understanding of SQL databases is shoddy at best. I'm creating a website (mostly for learning purposes) which allows users to make different kinds of posts (image/video/text/link).
Here is the basics of what I'm storing
Auto - int (key index)
User ID - varchar
Post id - varchar
Post Type - varchar (YouTube, vimeo, image, text, link)
File Name - varchar (original image name or link title)
Source - varchar (external link or name of file + ext)
Title - varchar (post title picked by user)
Message - text (user's actual post)
Date - int (unix timestamp)
I have other data stored relevant to the post in other tables which I grab with the post id (like user information) but I'm really doubting if this is the method I should be storing information in. I do use PDO, but I'm afraid this format might just be extremely slow.
Would there be any sense in storing the post information in another format? I don't want excessively large tables, so from a performance standpoint should I store some information as a blob/binary/xml/json?
I can't seem to find any good resources on PHP/MySQL optimization. Most information I come across tends to be 5-10 years old, content you have to pay for, too low-level, or just straight documentation which can't hold my attention for more than half an hour.
Databases are made to store 'data', and are fast to retrieve the data. Do not switch to anything else, stick with a database.
Try not to store pictures and video's in a database. Store them on disk, and keep a reference to them in a database table.
Finally, catch up on database normalization, it will help you in getting your database in optimal condition.
What you have seems okay, but you have missed the important bit about indexes and keys.
Firstly, I am assuming that your primary key will be field 1. Okay, no problems there, but make sure that you also stick an index on userID, PostID, Date and probably a composite on UserID, Date.
Secondly, are you planning on having search functions on these? In that case you may need to enable full text searches.
Don't muck around trying to store data in a JSON or other such things. Store it plain and simple. The last thing you want to be doing is trying to extract a field from the database just to see what is inside. If you database can't work it out, it is bad design.
On that note, there isn't anything wrong with large tables. As long as they are indexed nicely, a small table or large table will make very little difference in terms of accessing it (short of huge badly written SQL joins), so worry about simplicity to be able to get the data back from it.
Edit: A Primary Key is lovely way to identify a row by a unique column of some sort. So, if you want to delete a row, in your example, you might specify a delete from yourTable where ID=6 and you know that this will only delete one row as only one row can have ID=6.
On the other hand, an index is different to a key, in that it is like a cheat-sheet for the database to know where certain information is inside the table. For example, if you have an index on the UserID column, when you pass a userID in a query, the database won't have to look though the entire table, it looks at the index and knows the location of all the rows for that user.
A composite index is taking this one step further again, if you know what you will want to constantly query data for both UserID and ContentType, you can add in a composite index (meaning an index on BOTH fields in one index) which will then allow the database to return only the data you specify in a query using both those columns without having to sift through the entire table - nor even sift through all of a users posts to find the right content type.
Now, indexes take up some extra space on the server, so keep that in mind, but if your tables grow to be larger (which is perfectly fine) the improved efficiency is staggering.
At this time, stick with RDMS for now. Once you will be comfortable with PHP and MySQL then may be later on there will be more to learn like NoSQL, MongoDB etc. but for current purpose of yours as every thing has its purpose, this is quite right and will not slow down. Your table schema seems right with few modifications.
User id and Post id will be integer and I think this table is post so post id will be auto incremented and it will be primary key.
Other thing is that you are using 2 fields, filename and source, please note that filename will be file's name that is uploaded but if by source you mean complete path of file then then DB is not the place for storing complete path. Generate path from PHP function. to access that path every time not in DB. Otherwise if you will need to change path then it will be much overhead.
Also you asked about blob etc. Please note that it is better to store file in file system not in db while these fields like blob etc are good when one want to store file in DB table, that I don't recommend here.
I'm trying to create a URL similar to youtube's /v=xxx in look and in behavior. In short, users will upload files and be able to access them through that URL. This URL code needs to be some form of the database's primary key so the page can gather the data needed. I'm new to databases and this is more of a database problem than anything.
In my database I have a auto increment primary key which file data is accessed by. I want to use that number to to create the URL for files. I started looking into different hash functions, but I'm worried about collisions. I don't want the same URL for two different files.
I also considered using uniqid() as my primary key CHAR(13), and just use that directly. But with this I'm worried about efficiency. Also looking around I can't seem to find much about it so it's probably a strange idea. Not to mention I would need to test for collisions when ids are generated which can be inefficient. Auto increment is a lot easier.
Is there any good solution to this? Will either of my ideas work? How can I generate a unique URL from an auto incremented primary key and avoid collisions?
I'm leaning toward my second idea, it won't be greatly efficient, but the largest performance drawbacks are caused when things need to be added to the database (testing for collisions), which for the end user, only happens once. The other performance drawback will probably be in the actual looking of of chars instead of ints. But I'm mainly worried that it's bad practice.
EDIT:
A simple solution would to be just to use the auto incremented value directly. Call me picky, but that looks kind of ugly.
Generating non colliding short hash will indeed be a headache. So, instead the slug format of Stackoverflow is very promising and is guaranteed to produce non duplicate url.
For example, this very same question has
https://stackoverflow.com/questions/11991785/unique-url-from-primary-key
Here, it has unique primary key and also a title to make it more SE friendly.
However as commented, they are few previously asked question, that might clear out, why? what you are trying is better left out.
How to generate a unique hash for a URL?
Create Tinyurl style hash
Creating short hashes increases the chances a collision a lot, so better user base64 or sha512 functions to create a secured hash.
You can simply make a hash of the time, and afterwards check that hash (or part of that hash in your DB.
If you set an index on that field in your DB (and make sure the hash is long enough to not make a lot of collisions), it won't be an issue at all time wise.
<?php
$hashChecked = false;
while( $hashChecked === false ){
$hash = substr( sha1(time().mt_rand(9999,99999999)), 0, 8); //varchar 8 (make sure that is enough with a very big margin)
$q = mysql_query("SELECT `hash` FROM `tableName` WHERE `hash` = '".$hash."'");
$hashChecked = mysql_num_rows() > 0 ? false : true;
}
mysql_query("INSERT INTO `tableName` SET `hash` = '".$hash."'");
This is fairly straightforward if you're willing to use a random number to generate your short URL. For example, you can do this:
SELECT BASE64_ENCODE(CAST(RAND()*1000000 AS UNSIGNED INTEGER)) AS tag
This is capable of giving you one million different tags. To get more possible tags, increase the value by which the RAND() number is multiplied. These tag values will be hard to predict.
To make sure you don't get duplicates you need to dedupe the tag values. That's easy enough to do but will require logic in your program. Insert the tag values into a table which uses them as a primary key. If your insert fails, try again, reinvoking RAND().
If you get close to your maximum number of tags you'll start having lots of insert failures (tag collisions).
BASE64_ENCODE comes from a stored function you need to install. You can find it here:
http://wi-fizzle.com/downloads/base64.sql
If you're using MySQL 5.6 or higher you can use the built-in TO_BASE64 function.
I wanted to do something similar (but with articles, not uploaded documents), and came up with something a bit different:
take a prime number [y] (much) larger than the max number [n] of documents there will ever be (e.g. 25000 will be large enough for the total number of documents, and 1000099 is a much larger prime number than 25001)
for the current document id [x]: (x*y) modulus (n+1)
this will generate a number between 1 and n that is never duplicated
although the url may look like a traditional primary key, it does have the slight advantage that each subsequent document will have a id which is totally unrelated to the previous one; some people also argue that not including the primary key also has a very slight security advantage...
Building a system that has the potential to get hammered pretty hard with hits and traffic.
It's a typical Apache/PHP/MySql setup.
Have build plenty of systems before, but never had a scenario where I really had to make decisions regarding potential scalability of this size. I have dozens of questions regarding building a system of this magniture, but for this particular question, I am trying to decide on what to use as the data type.
Here is the 100ft view:
We have a table which (among other things) has a description field. We have decided to limit it to 255 characters. It will be searchable (ie: show me all entries with description that contains ...). Problem: this table is likely to have millions upon millions of entries at some point (or so we think).
I have not yet figured out the strategy for the search (the MySql LIKE operator is likely to be slow and/or a hog I am guessing for such a large # records), but thats for another SO question. For this question, I am wondering what the pro's and cons are to creating this field as a tinytext, varchar, and char.
I am not a database expert, so any and all commentary is helpful. Thanks -
Use a CHAR.
BLOB's and TEXT's are stored outside the row, so there will be an access penalty to reading them.
VARCHAR's are variable length, which saves storage space by could introduce a small access penalty (since the rows aren't all fixed length).
If you create your index properly, however, either VARCHAR or CHAR can be stored entirely in the index, which will make access a lot faster.
See: varchar(255) v tinyblob v tinytext
And: http://213.136.52.31/mysql/540
And: http://forums.mysql.com/read.php?10,254231,254231#msg-254231
And: http://forums.mysql.com/read.php?20,223006,223683#msg-223683
Incidentally, in my experience the MySQL regex operator is a lot faster than LIKE for simple queries (i.e., SELECT ID WHERE SOME_COLUMN REGEX 'search.*'), and obviously more versatile.
I believe with varchar you've got a variable length stored in the actual database at the low levels, which means it could take less disk space, with the text field its fixed length even if a row doesn't use all of it. The fixed length string should be faster to query.
Edit: I just looked it up, text types are stored as variable length as well. Best thing to do would be to benchmark it with something like mysqlslap
In regards to your other un-asked question, you'd probably want to build some sort of a search index that ties every useful word in the description field individually to a description, then you you can index that and search it instead. will be way way faster than using %like%.
In your situation all three types are bad if you'll use LIKE (a LIKE '%string%' won't use any index created on that column, regardless of its type) . Everything else is just noise.
I am not aware of any major difference between TINYTEXT and VARCHAR up to 255 chars, and CHAR is just not meant for variable length strings.
So my suggestion: pick VARCHAR or TINYTEXT (I'd personally go for VARCHAR) and index the content of that column using a full text search engine like Lucene, Sphinx or any other that does the job for you. Just forget about LIKE (even if that means you need to custom build the full text search index engine yourself for whatever reasons you might have, i.e. you need support for a set of features that no engine out there can satisfy).
If you want to search among millions of rows, store all these texts in a different table (which will decrease row size of your big table) and use VARCHAR if your text data is short, or TEXT if you require greater length.
Instead of searching with LIKE use a specialized solution like Lucene, Sphinx or Solr. I don't remember which, but at least one of them can be easily configured for real-time or near real-time indexing.
EDIT
My proposition of storing text in different table reduces IO required for main table, but when data is inserted it requires to keep an additional index and adds join overhead in selects, so is valid only if you use your table to read a few descriptions at once and other data from the table is is used more often.
Anyone know of an API (php preferable but I'd be interested in any language) for creating wiki-like data storage?
How about any resources on rolling your own plaintext wiki? How do other plaintext wikis handle the format of the text file?
I understand I can use Markdown or Textile for the formatting. But what I'm most interested in is how to approach the plaintext storage of multi-user edits.
I'm writing a web application that is primarily database driven. I want at least one text field of this database to be in a wiki-like format. Specifically, this text can be edited by multiple users with the ability to roll back to any version. Think the wiki/bio section of Last.FM (almost the entire site is strictly structured by a database except for this one section per artist).
So far, my approach of taking apart MediaWiki and wedging it into a database seems like overkill. I'm thinking it would be much easier to roll my own plaintext wiki, and store this file in the database's appropriate text field.
So, basically this is a "how do I version text information in my DB".
Well, the simplest way is simply copying the data.
Simply, create a "version" table that holds "old versions" of the data, and link it back to your main table.
create table docs {
id integer primary key not null,
version integer not null,
create_date date,
change_date date,
create_user_id integer not null references users(id),
change_user_id integer references users(id),
text_data text
}
create table versions {
id integer primary key not null,
doc_id integer not null references docs(id),
version integer,
change_date date,
change_user integer not null references users(id),
text_data text
}
Whenever you update your original document, you copy the old text value in to this table, copy the user and change date and bump the version.
select version, change_date, change_user, text_data
into l_version, l_change_data, l_change_user, l_text_data
from docs where id = l_doc_id;
insert into versions values (newid, l_doc_id, l_version,
l_change_date, l_change_user, l_text_data);
update docs set version = version + 1, change_date = now,
change_user = cur_user, text_data = l_new_text where id = l_doc_id;
You could even do this in a trigger if your DB supports those.
Faults with this method are that its a full copy of the data (so if you have a large document, the version stay large). You can mitigate that by using something like diff(1) and patch(1).
For example:
diff version2.txt version1.txt > difffile
Then you can store that difffile as "version 1".
In order to recover version 1 from version 2, you grab the version 2 data, run patch on it using the diff file data, and that gives you v1.
If you want to go from v3 to v1, you need to do this twice (once to get v2, and then again to get v1).
This lowers your storage burden, but increases your processing (obviously), so you'll have to judge how you want to do this.
Will's huge answer is right on, but can be summed up, I think: you need to store the versions, and then you need to store the metadata (who what when of the data).
But your question was about resources on Wiki-like versioning. I have none (well, one: Will's answer above). However, about the storage of Wikis, I have one. Check out the comparison matrix from DokuWiki. I know. You're thinking "what do I care what brand of DB different Wikis use?" Because DokuWiki uses plain text files. You can open them and they are indeed plain. So that's one approach, and they've got some interesting arguments as to why DBMS are not the best way to go. They don't even hold much metadata: most of the stuff is done through the flat files themselves.
The point of the DokuWiki for you is that maybe it's a relatively simple problem (depending on how well you want to solve it :)
Here's a list of all 12 wikis on WikiMatrix that are written in PHP and do their storage using text files. Perhaps one of them will have a storage method you can adapt into the database:
http://www.wikimatrix.org/search.php?sid=1760
It sounds like you are essentially just looking for version control. If that is the case, you may want to look into a diff algorithm.
Here is the Wikipedia Diff page.
I did a quick php diff google search, but nothing really stood out as a decent example, since I only have basic PHP knowledge.