UUID Storage as Binary in MySQL - php

I would like to create a column (not a PK) whose value represents as a unique identifier. It is not used for encryption or security purposes - strictly to identify a record. Each time a new record is inserted, I want to generate and store this unique identifier. Not sure if this is relevant, but I have 1 million records now, and anticipate ~3 million in 2 years. I'm using a web app in PHP.
I initially just assumed I'd call UUID() and store it directly as some sort of char data type, but I really wanted to do some research and learn of a more efficient/optimized approach. I found a lot of great articles here on SO but I'm having a hard time with all of the posts because many of them are somewhat older, or disagree on the approach that has ultimately left me very confused. I wanted to ask if someone more wiser/experienced could lend me a hand.
I saw folks linked here on various posts and suggested to implement things this way:
https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
but i'm having a hard time fully knowing what to do after reading that article. Ordered UUID? What should I store it as? I think maybe that particular page is a tad over my head. I wanted to ask if someone could help clarify some of this for me. Specifically:
What data type should my column be for storing binary data (that represents my UUID)?
What function should I use to convert my UUID to and from some binary value?
Any more advance or tips someone could share?
Thanks so much!

If you call MySQL's UUID(), you get a variant that is roughly chronological. So, if you need tend to reference "recent" records and ignore "old" records, then rearranging the bits in the UUID can provide better "locality of reference" (that is, better performance).
Version 4 does not provide such.
You can turn the UUID from the bulky 36-character string into a more compact, 16-byte, (Q1) BINARY(16) by code (Q2) in my UUID blog. That document discusses various other aspects of your question. (Q3)
The Percona link you provided gives some benchmarks 'proving' the benefit.
3M uuids taking 16 bytes each = 48MB. It is bulky, but not likely to cause serious problems. Still, I recommend avoiding uuids whenever practical.

I used UUID v4 on a recent project. The code to generate UUID v4 can be sourced here: PHP function to generate v4 UUID
The main difference is that we compressed it to 22 bytes case-sensitive format. This approach is also used by ElasticSearch.
The resulting values are stored simply as char(22).

Related

How can I parse multiple values in a single field of a mySql table?

I am working with a real-estate database in mySql. This is used with a wordpress site.
Each property that is listed has related agents in a joined table. There may be more than 1 agent to a property.
However, multiple agents are listed in a single field. (So, join is 1-1).
Here is an example
a:2:{i:0;s:4:"1995";i:1;s:3:"260";}
So it appears that this is some sort of array implementation. I don't have much experience with this, as it is a bad way of storing data in a relational DB anyway.
In this example, there are 2 agents.
I know that:
- the first "2" probably refers to the number of agents.
- "1995" and "260" are the record-ids of the agent's record.
How would I craft a SQL statement to extract these two record ids (and find related data in the main record) from php?
If this is a cumbersome solution, I'd want to create a new SQL view to make access easier. Is this possible?
If there are any, I can't find any php functions that interprets this field.
Table redesign is not an option.
Edit:
As I was reading through the answers to a similar question as pointed out, I came to the conclusion that this is most probably impossible to do. And frankly ridiculous.
The solution entailed using the SUBSTRING_INDEX repeatedly. How many times, could vary every time. Even if the number of values is specified, it may not be possible to accomplish in MySQL. I will have to work with the limitations that this creates.
The data format you mention is php's serialization format. Before JSON took over the world, Rasmus Lerdorf and the php creators came up with their own way of doing it.
WordPress uses lots of this stuff.
php's unserialize() function converts it from that string representation into an appropriate php object.
If you must, you can probably use MySQL string functions like LOCATE() and SUBSTRING() to extract certain values from those serialized strings. But that will be a huge pain in the neck, and very specific to the kind of object that's serialized.

Using VARCHAR in MySQL for everything! (on small or micro sites)

I tried searching for this as I felt it would be a commonly asked beginner's question, but I could only find things that nearly answered it.
We have a small PHP app that is at most used by 5 people (total, ever) and maybe 2 simultaneously, so scalability isn't a concern.
However, I still like to do things in a best practice manner, otherwise bad habits form into permanent bad habits and spill into code you write that faces more than just 5 people.
Given this context, my question is: is there any strong reason to use anything other than VARCHAR(250+) in MySQL for a small PHP app that is constantly evolving/changing? If I picked INT but that later needed to include characters, it would be annoying to have to go back and change it when I could have just future-proofed it and made it a VARCHAR to begin with. In other words, choosing anything other than VARCHAR with a large character count seems pointlessly limiting for a small app. Is this correct?
Thanks for reading and possibly answering!
If you have the numbers 1 through 12 in VARCHAR, and you need them in numerical order, you get 1,10,11,12,2,3,4,5,6,7,8,9. Is that OK? Well, you could fix it in SQL by saying ORDER BY col+0. Do you like that kludge?
One of the major drawbacks will be that you will have to add consistency checks in your code. For a small, private database, no problem. But for larger projects...
Using the proper types will do a lot of checks automatically. E.g., are there any wrong characters in the value; is the date valid...
As a bonus, it is easy to add extra constraints when using right types; is the age less than 110; is the start date less than the end date; is the indexing an existing value in another table?
I prefer to make the types as specific as possible. Although server errors can be nasty and hard to debug, it is way better than having a database that is not consistent.
Probably not a great idea to make a habit out of it as with any real amount of data will become inefficient. If you use the text type the amount of storage space used for the same amount of data will be differ depending on your storage engine.
If you do as you suggested don't forget that all values that would normally be of a numeric type will need to be converted to a numeric type in PHP. For example if you store the value "123" as a varchar or text type and retrieve it as $someVar you will have to do:
$someVar = intval($someVar);
in PHP before arithmetic operations can be performed, otherwise PHP will assume that 123 is a string.
As you may already know VARCHAR columns are variable-length strings. We have the advantage of dynamic memory allocation when using VARCHAR.
VARCHAR is stored inline with the table which makes faster when the size is reasonable.
If your app need performance you can go with CHAR which is little faster than VARCHAR.

Handling data with 1000~ variables, preferably using SQL

Basically, I have tons of files with some data. each differ, some lack some variables(null) etc, classic stuff.
The part it gets somewhat interesting is that, since each file can have up to 1000 variables, and has at least 800~ values that is not null, I thought: "Hey I need 1000 columns". Another thing to mention is, they are integers, bools, text, everything. they differ by size, and type. Each variable is under 100 bytes, at all files, alth. they vary.
I found this question Work around SQL Server maximum columns limit 1024 and 8kb record size
Im unfamiliar with capacities of sql servers and table design, but the thing is: people who answered that question say that they should reconsider the design, but I cant do that. I however, can convert what I already have, as long as I still have that 1000 variables.
Im willing to use any sql server, but I dont know what suits my requirements best. If doing something else is better, please tell so.
What I need to do with this data is, look, compare, and search within. I dont need the ability to modify these. I thought of just using them as they are and keeping them as plain text files and reading from, that requires "seconds" of php runtime for viewing data out of "few" of these files and that is too much. Not even considering the fact that I need to check about 1000 or more of these files to do any search.
So the question is, what is the fastest way of having 1000++ entities with 1000 variables each, and searching/comparing for any variable I wish within them, etc. ? and if its SQL, which SQL server functions best for this sort of stuff?
Sounds like you need a different kind of database for what you're doing. Consider a document database, such as MongoDB, or one of the other not-only-SQL database flavors that allows for manipulation of data in different ways than a traditional table structure.
I just saw the note mentioning that you're only reading as well. I've had good luck with Solr on a similar dataset.
You want to use an EAV model. This is pretty common
You are asking for best, I can give an answer (how I solved it), but cant say if it is the 'best' way (in your environment), I had the Problem to collect inventory data of many thousend PCs (no not NSA - kidding)
my soultion was:
One table per PC (File for you?)
Table File:
one row per file, PK FILE_ID
Table File_data
one row per column in file, PK FILE_ID, ATTR_ID, ATTR_NAME, ATTR_VALUE, (ATTR_TYPE)
The Table File_data, was - somehow - big (>1e6 lines) but the DB handled that fast
HTH
EDIT:
I was pretty short in my anwser, lately; I want to put some additional information to my (and still working) solution:
the table 'per info source' has more than the two fields PK, FILE_ID ie. ISOURCE, ITYPE, where ISOURCE and ITYPE dscribe from where (I had many sources) and what basic Information type it is / was. This helps to get a structure into queries. I did not need to include data from 'switches' or 'monitors', when searching for USB divices (edit: to day probably: yes)
the attributes table had more fields, too. I mention here the both fileds: ISOURCE, ITYPE, yes, the same as above, but a slightly different meaning, the same idea behind
What you would have to put into these fields, depends definitely on your data.
I am sure, that if you take a closer look, what information you have to collect, you will find some 'KEY Values' for that
For storage, XML is probably the best way to go. There is really good support for XML in SQL.
For queries, if they are direct SQL queries, 1000+ rows isn't a lot and XML will be plenty fast. If you're moving towards a million+ rows, you're probably going to want to take the data that is most selective out of the XML and index that separately.
Link: http://technet.microsoft.com/en-us/library/hh403385.aspx

Mysql: Store array of data in a single column

and thanks in advance for your help.
Well, this is my situation. I have a web system that makes some noise-related calculations based on a sample, created by a sonometer. Originally, the database only stored the results of these calculations. But now, I have been asked to also store the samplings themselves. Each sample is only a list of 300 or 600 numbers with 1 decimal each.
So, the simplest approach I have come up with is to add a column in the table that stores all the calculations for a given sample. This column should contain the list of numbers.
My question then: What is the best way to store this list of numbers in a single column?
Things to consider:
it would be nice if the list could be read by both PHP and javascript with no further complications.
The list is only useful if retrieved in its totality, that is why I'd rather not normalyze it. also, the calculations made on that list are kind of complex and already coded in PHP and javascript, so I won't be doing any SQL queries on elements of a given list
Also, if there are better approaches than storing it, I would love to know about them
Thanks a lot and have a good day/evening :)
First off, you really don't want to do that. A column in a RDBMS is meant to be atomic, in that it contains one and only one piece of information. Trying to store more than one piece of data in a column is a violation of first normal form.
If you absolutely must do it, then you need to convert the data into a form that can be stored as a single item of data, typically a string. You could use PHP's serialize() mechanism, XML parsing (if the data happens to be a document tree), json_encode(), etc.
But how do you query such data effectively? The answer is you can't.
Also, if someone else takes over your project at a later date you're really going to annoy them, because serialized data in a database is horrid to work with. I know because I've inherited such projects.
Did I mention you really don't want to do that? You need to rethink your design so that it can more easily be stored in terms of atomic rows. Use another table for this data, for example, and use foreign keys to relate it to the master record. They're called relational databases for a reason.
UPDATE: I've been asked about data storage requirements, as in whether a single row would be cheaper in terms of storage. The answer is, in typical cases no it's not, and in cases where the answer is yes the price you pay for it isn't worth paying.
If you use a 2 column dependant table (1 column for the foreign key of the record the sample belongs to, one for a single sample) then each column will require at worst require 16 bytes (8 bytes for a longint key column, 8 bytes for a double precision floating point number). For 100 records that's 1600 bytes (ignoring db overhead).
For a serialized string, you store in the best case 1 byte per character in the string. You can't know how long the string is going to be, but if we assume 100 samples with all the stored data by some contrived coincidence all falling between 10000.00 and 99999.99 with there only ever being 2 digits after the decimal point, then you're looking at 8 bytes per sample. In this case, all you've saved is the overhead of the foreign keys, so the amount of storage required comes out at 800 bytes.
That of course is based on a lot of assumptions, such as the character encoding always being 1 byte per character, the strings that make up the samples never being longer than 8 characters, etc.
But of course there's also the overhead of whatever mechanism you use to serialize the data. The absolute simplest method, CSV, means adding a comma between every sample. That adds n-1 bytes to the stored string. So the above example would now be 899 bytes, and that's with the simplest encoding scheme. JSON, XML, even PHP serializations all add more overhead characters than this, and you'll soon have strings that are a lot longer than 1600 bytes. And all this is with the assumption of 1 byte character encoding.
If you need to index the samples, the data requirements will grow even more disproportionately against strings, because a string index is a lot more expensive in terms of storage than a floating point column index would be.
And of course if your samples start adding more digits, the data storage goes up further. 39281.3392810 will not be storable in 8 bytes as a string, even in the best case.
And if the data is serialized the database can't manipulate. You can't sort the samples, do any kind of mathematical operations on them, the database doesn't even know they're numbers!
To be honest though, storage is ridiculously cheap these days, you can buy multiple TB drives for tiny sums. Is storage really that critical? Unless you have hundreds of millions of records then I doubt it is.
You might want to check out a book called SQL Antipatterns
I would recommend creating a separate table with three columns for the samples. One would be the id of the record,second - the id of the sample and the third - the value. Of course if your main table doesn't have a unique id column already, you would have to create it and use it as foreign key.
The reason for my suggestion is simplicity and data integrity. Another argument is that this structure is memory efficient, as you will avoid varchar (which would also then also require parsing and has the offset of additional computations).
UPDATE As GordonM and Darin elaborated below, the memory argument is not necessarily valid (see below for further explanation), but there are also other reasons against a serialized approach.
Finally, this doesn't involve any complex php - java script and is quite straight forward to code.

Proposed solution: Generate unique IDs in a distributed environment

I've been browsing the net trying to find a solution that will allow us to generate unique IDs in a regionally distributed environment.
I looked at the following options (among others):
SNOWFLAKE (by Twitter)
It seems like a great solutions, but I just don't like the added complexity of having to manage another software just to create IDs;
It lacks documentation at this stage, so I don't think it will be a good investment;
The nodes need to be able to communicate to one another using Zookeeper (what about latency / communication failure?)
UUID
Just look at it: 550e8400-e29b-41d4-a716-446655440000;
Its a 128 bit ID;
There has been some known collisions (depending on the version I guess) see this post.
AUTOINCREMENT IN RELATIONAL DATABASE LIKE MYSQL
This seems safe, but unfortunately, we are not using relational databases (scalability preferences);
We could deploy a MySQL server for this like what Flickr does, but again, this introduces another point of failure / bottleneck. Also added complexity.
AUTOINCREMENT IN A NON-RELATIONAL DATABASE LIKE COUCHBASE
This could work since we are using Couchbase as our database server, but;
This will not work when we have more than one clusters in different regions, latency issues, network failures: At some point, IDs will collide depending on the amount of traffic;
MY PROPOSED SOLUTION (this is what I need help with)
Lets say that we have clusters consisting of 10 Couchbase Nodes and 10 Application nodes in 5 different regions (Africa, Europe, Asia, America and Oceania). This is to ensure that content is served from a location closest to the user (to boost speed) and to ensure redundancy in case of disasters etc.
Now, the task is to generate IDs that wont collide when the replication (and balancing) occurs and I think this can be achieved in 3 steps:
Step 1
All regions will be assigned integer IDs (unique identifiers):
1 - Africa;
2 - America;
3 - Asia;
4 - Europe;
5 - Ociania.
Step 2
Assign an ID to every Application node that is added to the cluster keeping in mind that there may be up to 99 999 servers in one cluster (even though I doubt: just as a safely precaution). This will look something like this (fake IPs):
00001 - 192.187.22.14
00002 - 164.254.58.22
00003 - 142.77.22.45
and so forth.
Please note that all of these are in the same cluster, so that means you can have node 00001 per region.
Step 3
For every record inserted into the database, an incremented ID will be used to identify it, and this is how it will work:
Couchbase offers an increment feature that we can use to create IDs internally within the cluster. To ensure redundancy, 3 replicas will be created within the cluster. Since these are in the same place, I think it should be safe to assume that unless the whole cluster is down, one of the nodes responsible for this will be available, otherwise a number of replicas can be increased.
Bringing it all together
Say a user is signing up from Europe:
The application node serving the request will grab the region code (4 in this case), get its own ID (say 00005) and then get an incremented ID (1) from Couchbase (from the same cluster).
We end up with 3 components: 4, 00005,1. Now, to create an ID from this, we can just join these components into 4.00005.1. To make it even better (I'm not too sure about this), we can concatenate (not add them up) the components to end up with: 4000051.
In code, this will look something like this:
$id = '4'.'00005'.'1';
NB: Not $id = 4+00005+1;.
Pros
IDs look better than UUIDs;
They seem unique enough. Even if a node in another region generated the same incremented ID and has the same node ID as the one above, we always have the region code to set them apart;
They can still be stored as integers (probably Big Unsigned integers);
It's all part of the architecture, no added complexities.
Cons
No sorting (or is there)?
This is where I need your input (most)
I know that every solution has flaws, and possibly more that what we see on the surface. Can you spot any issues with this whole approach?
Thank you in advance for your help :-)
EDIT
As #DaveRandom suggested, we can add the 4th step:
Step 4
We can just generate a random number and append it to the ID to prevent predictability. Effectively, you end up with something like this:
4000051357 instead of just 4000051.
I think this looks pretty solid. Each region maintains consistency, and if you use XDCR there are no collisions. INCR is atomic within a cluster, so you will have no issues there. You don't actually need to have the Machine code part of it. If all the app servers within a region are connected to the same cluster, it's irrelevant to infix the 00001 part of it. If that is useful for you for other reasons (some sort of analytics) then by all means, but it isn't necessary.
So it can simply be '4' . 1' (using your example)
Can you give me an example of what kind of "sorting" you need?
First: One downside of adding entropy (and I am not sure why you would need it), is you cannot iterate over the ID collection as easily.
For Example: If you ID's from 1-100, which you will know from a simple GET query on the Counter key, you could assign tasks by group, this task takes 1-10, the next 11-20 and so on, and workers can execute in parallel. If you add entropy, you will need to use a Map/Reduce View to pull the collections down, so you are losing the benefit of a key-value pattern.
Second: Since you are concerned with readability, it can be valuable to add a document/object type identifier as well, and this can be used in Map/Reduce Views (or you can use a json key to identify that).
Ex: 'u:' . '4' . '1'
If you are referring to ID's externally, you might want to obscure in other ways. If you need an example, let me know and I can append my answer with something you could do.
#scalabl3
You are concerned about IDs for two reasons:
Potential for collisions in a complex network infrastructure
Appearance
Starting with the second issue, Appearance. While a UUID certainly isn't a great beauty when it comes to an identifier, there are diminishing returns as you introduce a truly unique number across a complex data center (or data centers) as you mention. I'm not convinced that there is a dramatic change in perception of an application when a long number versus a UUID is used for example in a URL to a web application. Ideally, neither would be shown, and the ID would only ever be sent via Ajax requests, etc. While a nice clean memorable URL is preferable, it's never stopped me from shopping at Amazon (where they have absolutely hideous URLs). :)
Even with your proposal, the identifiers, while they would be shorter in the number of characters than a UUID, they are no more memorable than a UUID. So, the appearance likely would remain debatable.
Talking about the first point..., yes, there are a few cases where UUIDs have been known to generate conflicts. While that shouldn't happen in a properly configured and consistently obtained architecture, I can see how it might happen (but I'm personally a lot less concerned about it).
So, if you're talking about alternatives, I've become a fan of the simplicity of the MongoDB ObjectId and its techniques for avoiding duplication when generating an ID. The full documentation is here. The quick relevant pieces are similar to your potential design in several ways:
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
The timestamp can often be useful for sorting. The machine identifier is similar to your application server having a unique ID. The process id is just additional entropy, and finally to prevent conflicts, there is a counter that is auto incremented whenever the timestamp is the same as the last time an ObjectId is generated (so that ObjectIds can be created rapidly). ObjectIds can be generated on the client or on the database. Further, ObjectIds do take up fewer bytes than a UUID (but only 4). Of course, you could not use the timestamp and drop 4 bytes.
For clarification, I'm not suggesting you use MongoDB, but be inspired by the technique they use for ID generation.
So, I think your solution is decent (and maybe you want to be inspired by MongoDB's implementation of a unique ID) and doable. As to whether you need to do it, I think that's a question only you can answer.

Categories