MySql Tinytext vs Varchar vs Char

MySql Tinytext vs Varchar vs Char - php

Building a system that has the potential to get hammered pretty hard with hits and traffic.
It's a typical Apache/PHP/MySql setup.
Have build plenty of systems before, but never had a scenario where I really had to make decisions regarding potential scalability of this size. I have dozens of questions regarding building a system of this magniture, but for this particular question, I am trying to decide on what to use as the data type.
Here is the 100ft view:
We have a table which (among other things) has a description field. We have decided to limit it to 255 characters. It will be searchable (ie: show me all entries with description that contains ...). Problem: this table is likely to have millions upon millions of entries at some point (or so we think).
I have not yet figured out the strategy for the search (the MySql LIKE operator is likely to be slow and/or a hog I am guessing for such a large # records), but thats for another SO question. For this question, I am wondering what the pro's and cons are to creating this field as a tinytext, varchar, and char.
I am not a database expert, so any and all commentary is helpful. Thanks -

Use a CHAR.
BLOB's and TEXT's are stored outside the row, so there will be an access penalty to reading them.
VARCHAR's are variable length, which saves storage space by could introduce a small access penalty (since the rows aren't all fixed length).
If you create your index properly, however, either VARCHAR or CHAR can be stored entirely in the index, which will make access a lot faster.
See: varchar(255) v tinyblob v tinytext
And: http://213.136.52.31/mysql/540
And: http://forums.mysql.com/read.php?10,254231,254231#msg-254231
And: http://forums.mysql.com/read.php?20,223006,223683#msg-223683
Incidentally, in my experience the MySQL regex operator is a lot faster than LIKE for simple queries (i.e., SELECT ID WHERE SOME_COLUMN REGEX 'search.*'), and obviously more versatile.

I believe with varchar you've got a variable length stored in the actual database at the low levels, which means it could take less disk space, with the text field its fixed length even if a row doesn't use all of it. The fixed length string should be faster to query.
Edit: I just looked it up, text types are stored as variable length as well. Best thing to do would be to benchmark it with something like mysqlslap
In regards to your other un-asked question, you'd probably want to build some sort of a search index that ties every useful word in the description field individually to a description, then you you can index that and search it instead. will be way way faster than using %like%.

In your situation all three types are bad if you'll use LIKE (a LIKE '%string%' won't use any index created on that column, regardless of its type) . Everything else is just noise.
I am not aware of any major difference between TINYTEXT and VARCHAR up to 255 chars, and CHAR is just not meant for variable length strings.
So my suggestion: pick VARCHAR or TINYTEXT (I'd personally go for VARCHAR) and index the content of that column using a full text search engine like Lucene, Sphinx or any other that does the job for you. Just forget about LIKE (even if that means you need to custom build the full text search index engine yourself for whatever reasons you might have, i.e. you need support for a set of features that no engine out there can satisfy).

If you want to search among millions of rows, store all these texts in a different table (which will decrease row size of your big table) and use VARCHAR if your text data is short, or TEXT if you require greater length.
Instead of searching with LIKE use a specialized solution like Lucene, Sphinx or Solr. I don't remember which, but at least one of them can be easily configured for real-time or near real-time indexing.
EDIT
My proposition of storing text in different table reduces IO required for main table, but when data is inserted it requires to keep an additional index and adds join overhead in selects, so is valid only if you use your table to read a few descriptions at once and other data from the table is is used more often.

Related

Using VARCHAR in MySQL for everything! (on small or micro sites)

I tried searching for this as I felt it would be a commonly asked beginner's question, but I could only find things that nearly answered it.
We have a small PHP app that is at most used by 5 people (total, ever) and maybe 2 simultaneously, so scalability isn't a concern.
However, I still like to do things in a best practice manner, otherwise bad habits form into permanent bad habits and spill into code you write that faces more than just 5 people.
Given this context, my question is: is there any strong reason to use anything other than VARCHAR(250+) in MySQL for a small PHP app that is constantly evolving/changing? If I picked INT but that later needed to include characters, it would be annoying to have to go back and change it when I could have just future-proofed it and made it a VARCHAR to begin with. In other words, choosing anything other than VARCHAR with a large character count seems pointlessly limiting for a small app. Is this correct?
Thanks for reading and possibly answering!

If you have the numbers 1 through 12 in VARCHAR, and you need them in numerical order, you get 1,10,11,12,2,3,4,5,6,7,8,9. Is that OK? Well, you could fix it in SQL by saying ORDER BY col+0. Do you like that kludge?

One of the major drawbacks will be that you will have to add consistency checks in your code. For a small, private database, no problem. But for larger projects...
Using the proper types will do a lot of checks automatically. E.g., are there any wrong characters in the value; is the date valid...
As a bonus, it is easy to add extra constraints when using right types; is the age less than 110; is the start date less than the end date; is the indexing an existing value in another table?
I prefer to make the types as specific as possible. Although server errors can be nasty and hard to debug, it is way better than having a database that is not consistent.

Probably not a great idea to make a habit out of it as with any real amount of data will become inefficient. If you use the text type the amount of storage space used for the same amount of data will be differ depending on your storage engine.
If you do as you suggested don't forget that all values that would normally be of a numeric type will need to be converted to a numeric type in PHP. For example if you store the value "123" as a varchar or text type and retrieve it as $someVar you will have to do:
$someVar = intval($someVar);
in PHP before arithmetic operations can be performed, otherwise PHP will assume that 123 is a string.

As you may already know VARCHAR columns are variable-length strings. We have the advantage of dynamic memory allocation when using VARCHAR.
VARCHAR is stored inline with the table which makes faster when the size is reasonable.
If your app need performance you can go with CHAR which is little faster than VARCHAR.

Why do sites use random alphanumeric ids rather than database ids to identify content?

Why do sites like YouTube, Imgur and most others use random characters as their content ids rather than just sequential numbers, like those created by auto-increment in MySQL?
To explain what I mean:
In the URL: https://www.youtube.com/watch?v=QMlXuT7gd1I
The QMlXuT7gd1I at the end indicates the specific video on that page, but I'm assuming that video also has a unique numeric id in the database. Why do they create and use this alphanumeric string rather than just use the video's database id?
I'm creating a site which identifies content in the URL like above, but I'm currently using just the DB id. I'm considering switching to random strings because all major sites do it, but I'd like to know why this is done before I implement it.
Thanks!

Some sites do that because of sharding.
When you have only one process (one server) writing, it is possible to make an auto-increment id without having duplicate ids, but when you have multiple servers (with multiple processes) writing content, like youtube, it's not possible to use autoincrement id anymore. The costs of synchronization to avoid duplication would be huge.
For example, if you read mongodb's ocjectid documentation you can see this structure for the id:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
At the end, it's only 12 byte. The thing is when you represent in hexadecimal, it seems like 24 bytes, but that is only when you show it.
Another advantage of this system is that the timestamp is included in the id, so you can decouple the id to get the timestamp.

First this is not a random string, it is a base calculation which is depended on the id. They go this way, because Alphanumeric has a bigger base
Something like 99999999 could be 1NJCHR
Take a look here, and play with the bases, and learn more about it.
You will see it is way more shorter. That is the only reason i can imagine, someone would go this way, and it makes sense, if you have ids like 54389634589347534985348957863457438959734
As self and Cameron commented/answered there are chances (especialy for youtube) that there are additional security parameters like time and lenght are calculated into it in some way, so you are not able to guess an identifier.

In addition to Christian's answer above, using a base calculation, hashed value or other non-numeric identifier has the advantage of obscuring your db's size from competitors.
Even if you stayed with numeric and set your auto_increment to start at 50,000, increase by 50, etc., educated guesses can still be made at the db's size and growth. Non-numeric options don't eliminate this possibility, but they inhibit it to a certain extent.

there are major chances for malicious inputs by end users, and by not using ids users cant guess ids and thus can't guess how large db is. However other's answers on base calculation explains well.

Mysql: Store array of data in a single column

and thanks in advance for your help.
Well, this is my situation. I have a web system that makes some noise-related calculations based on a sample, created by a sonometer. Originally, the database only stored the results of these calculations. But now, I have been asked to also store the samplings themselves. Each sample is only a list of 300 or 600 numbers with 1 decimal each.
So, the simplest approach I have come up with is to add a column in the table that stores all the calculations for a given sample. This column should contain the list of numbers.
My question then: What is the best way to store this list of numbers in a single column?
Things to consider:
it would be nice if the list could be read by both PHP and javascript with no further complications.
The list is only useful if retrieved in its totality, that is why I'd rather not normalyze it. also, the calculations made on that list are kind of complex and already coded in PHP and javascript, so I won't be doing any SQL queries on elements of a given list
Also, if there are better approaches than storing it, I would love to know about them
Thanks a lot and have a good day/evening :)

First off, you really don't want to do that. A column in a RDBMS is meant to be atomic, in that it contains one and only one piece of information. Trying to store more than one piece of data in a column is a violation of first normal form.
If you absolutely must do it, then you need to convert the data into a form that can be stored as a single item of data, typically a string. You could use PHP's serialize() mechanism, XML parsing (if the data happens to be a document tree), json_encode(), etc.
But how do you query such data effectively? The answer is you can't.
Also, if someone else takes over your project at a later date you're really going to annoy them, because serialized data in a database is horrid to work with. I know because I've inherited such projects.
Did I mention you really don't want to do that? You need to rethink your design so that it can more easily be stored in terms of atomic rows. Use another table for this data, for example, and use foreign keys to relate it to the master record. They're called relational databases for a reason.
UPDATE: I've been asked about data storage requirements, as in whether a single row would be cheaper in terms of storage. The answer is, in typical cases no it's not, and in cases where the answer is yes the price you pay for it isn't worth paying.
If you use a 2 column dependant table (1 column for the foreign key of the record the sample belongs to, one for a single sample) then each column will require at worst require 16 bytes (8 bytes for a longint key column, 8 bytes for a double precision floating point number). For 100 records that's 1600 bytes (ignoring db overhead).
For a serialized string, you store in the best case 1 byte per character in the string. You can't know how long the string is going to be, but if we assume 100 samples with all the stored data by some contrived coincidence all falling between 10000.00 and 99999.99 with there only ever being 2 digits after the decimal point, then you're looking at 8 bytes per sample. In this case, all you've saved is the overhead of the foreign keys, so the amount of storage required comes out at 800 bytes.
That of course is based on a lot of assumptions, such as the character encoding always being 1 byte per character, the strings that make up the samples never being longer than 8 characters, etc.
But of course there's also the overhead of whatever mechanism you use to serialize the data. The absolute simplest method, CSV, means adding a comma between every sample. That adds n-1 bytes to the stored string. So the above example would now be 899 bytes, and that's with the simplest encoding scheme. JSON, XML, even PHP serializations all add more overhead characters than this, and you'll soon have strings that are a lot longer than 1600 bytes. And all this is with the assumption of 1 byte character encoding.
If you need to index the samples, the data requirements will grow even more disproportionately against strings, because a string index is a lot more expensive in terms of storage than a floating point column index would be.
And of course if your samples start adding more digits, the data storage goes up further. 39281.3392810 will not be storable in 8 bytes as a string, even in the best case.
And if the data is serialized the database can't manipulate. You can't sort the samples, do any kind of mathematical operations on them, the database doesn't even know they're numbers!
To be honest though, storage is ridiculously cheap these days, you can buy multiple TB drives for tiny sums. Is storage really that critical? Unless you have hundreds of millions of records then I doubt it is.
You might want to check out a book called SQL Antipatterns

I would recommend creating a separate table with three columns for the samples. One would be the id of the record,second - the id of the sample and the third - the value. Of course if your main table doesn't have a unique id column already, you would have to create it and use it as foreign key.
The reason for my suggestion is simplicity and data integrity. Another argument is that this structure is memory efficient, as you will avoid varchar (which would also then also require parsing and has the offset of additional computations).
UPDATE As GordonM and Darin elaborated below, the memory argument is not necessarily valid (see below for further explanation), but there are also other reasons against a serialized approach.
Finally, this doesn't involve any complex php - java script and is quite straight forward to code.

Why vbulletin uses ENUMs?

Today, I've had some debate with my colleague about choosing data types in our projects.
We're web developers and we code back-end in PHP, and for database we use mySQL.
So, I went on internet a bit, and they don't recommend ENUM data type for various reasons (also I've read here on SO that this is not recommended) - For ENUM('yes','no') for example you should use tinyint(1) .
If ENUMs are bad and should be avoided, why does vBulletin for example, uses them?
Why use them at all when you can use VARCHAR, TEXT and so on and enforce use of 1 of 2 possible values in PHP.
Thank you for your answers.

Enums aren't ideal, but they are waaaay better than your alternative suggestion of using a VARCHAR and enforcing one of a few possible values!
Enums store their data as a numeric value. This is ideal for storing a field with a limited set of possible values such as 'yes' or 'no', because it uses the minimum amount of space, and gives the quickest possible access, especially for searches.
Where enums fall over is if you later need to add additional values to the list. Let's say you need to have 'maybe' as well as 'yes' or 'no'. Because it's stored in an enum, this change requires a database change. This is a bad thing for several reasons - for example, if you have a large data set, it can take a significant amount of time to rebuild the table.
The solution to this is to use a related table which stores a list of possible values, and your original field would now simply contain an ID reference to your new table, and queries would make a join to the lookup table to get the string value. This is called "normalisation" and is considered good database practice. It's a classic relational database scenario to have a large number of these lookup tables.
Obviously, if you care fairly sure that the field will never store anything other than 'yes' or 'no', then it can be overkill to have a whole extra table for it and an enum may be appropriate.
Some database products do not even provide an enum data type, so if you're using these DBs, you are forced to use the lookup table solution (or just a simple numeric field, and map the values in your application).
What is never appropriate in this situation is to use an actual string value in the table. This is considered extremely poor practice.
VARCHARS take up much more disk space than the numeric values used by an enum. They are also slower to read, and slower to look up in a query. In addition, they remove the enforcement of fixed values provided by enum. This means that a bug in your program could result in invalid values going into the data, as could an inadvertant update using PHPMyAdmin or a similar tool.
I hope that helps.

MySQL Table Optimization

I'm looking to optimize a few tables in a database because currently under high load the wait times are far too long...
Ignore the naming schema (it's terrible), but here's an example of one of the mailing list tables with around 1,000,000 records in it. At the moment I don't think I can really normalize it anymore without completely re-doing it all.
Now... How much impact will the following have:
Changing fields like the active field
to use a Boolean as opposed to a
String of Yes/No
Combining some of the fields such as
Address1, 2, 3, 4 to use a single
'TEXT' field
Reducing characters available e.g.
making it a VARCHAR(200) instead of
Setting values to NULL rather than
leaving them blank
One other thing I'm interested in, a couple of tables including this one use InnoDB as opposed to the standard MyISAM, is this recommended?
The front-end is coded in PHP so I'll be looking through that code aswell, at the moment I'm just looking at a DB level but any suggestions or help will be more than welcomed!
Thanks in advance!

None of the changes you propose for the table are likely to have any measurable impact on performance.
Reducing the max length of the VARCHAR columns won't matter if the row format is dynamic, and given the number and length of the VARCHAR columns, dynamic row format would be most appropriate.)
What you really need to tune is the SQL that runs against the table.
Likely, adding, replacing and/or removing indexes is going to be the low hanging fruit.
Without the actual SQL, no one can make any reliable tuning recommendations.
For this query:
SELECT email from table WHERE mailinglistID = X.
I'd make sure I had an index on (mailinglistId, email) e.g.
CREATE INDEX mytable_ix2 ON mytable (mailinglistId, email);
However, beware of adding indexes that aren't needed, because maintenance of indexes isn't free, indexes use resources (memory and i/o).
That's about the only tuning you're going to be able to do on that table, without some coding changes.
To really tune the database, you need to identify the performance bottleneck. (Is it the design of the application SQL: obtaining table locks? concurrent inserts from multiple sessions blocking?, or does the instance need to be tuned: increase size of buffer cache, innodb buffer cache, keysize area, SHOW INNODB STATUS may give you some clues,

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.