Migrating from Mysql to Cassandra

Migrating from Mysql to Cassandra - php

Previously I was using the class found here to convert userID to some random string.
From his blog:
Running:
alphaID(9007199254740989);
will return 'PpQXn7COf' and:
alphaID('PpQXn7COf', true);
will return '9007199254740989'
So the idea was that users could do www.mysite.com/user/PpQXn7COf and i convert that to a normal integer so i could do in mysql
"Select * from Users where userID=".alphaID('PpQXn7COf', true)
Now i'm just started working with Cassandra an i'm looking for some replacement.
I want url like www.mysite.com/user/PpQXn7COf not like www.mysite.com/user/username1
The "PpQXn7COf" uuid must be as short as possible.
In the Twissandra example explained here: http://www.rackspace.com/cloud/blog/2010/05/12/cassandra-by-example/
They create some long uuid (i guess it is so long because then its almost 100 percent sure its random).
In mysql i just had a userID column with auto increasement so when i used the alphaID() function i always got a very short random string.
Anyone an idea how to solve this as clean as possible?
Edit:
It is used for a social media site so it must be persistent.
Thats also why i don't want to use usernames/realnames in urls, user cant remain google undetected if they need.
I just got a simple idea, however i don't know how scalable it is
<?php
//createUUID() makes +- 14 char string with A-Z a-z 1-0 based on micro/milli/nanoseconds
while(get_count(createUUID()) > 0){//uuid is unique
//insert username pass, uuid etc into cassandra
if($result == "1"){
header('Location: http://www.mysite.com/usercenter');
}else{
echo "error";
}
}
?>
When this gets the size of lets say twitter/facebook:
Will it execute in acceptable time?
Will it still generate unique uuid fast enough so if 10000 users/second are registering it isnt cluttering up?

Auto-increments are not suitable for a robust distributed system. You can only assign a unique ID if every node in your system is available, to ensure it's unique.
You can of course, invent your own unique-id generator, but you must then ensure that it will generate unique IDs anywhere in your infrastructure.
For example, each node can just have a file which it (with suitable locking etc) just increments, but you will also need to ensure that they don't clash - for instance, by having the server ID included in the generation algorithm.
This may be operationally nontrivial - your ops engineers will need to ensure that all the servers in the infrastructure are configured correctly with their own ID generators set up so that they don't generate the same ID. However, it's possible.
UUIDs are the reasonable alternative, because they will definitely be unique.
A UUID is 128 bits; if we store 6 bits per character (i.e. base64) then that takes 22 characters, which is quite a long URI. If you want it shorter, you will need to generate unique IDs a different way.
Plus it all depends on "how unique" you actually need your IDs to be. If your IDs can safely be reused after a few months, you can probably do it in < 60 bits (depending also on the number of servers in your infrastructure, and how frequently you need to generate them).
We use
Server ID
Time (granularity = 2 seconds), but wraps after a few months
A per-server counter (which wraps frequently, but not within 2 seconds)
And stick all the bits together. This generates an ID which is < 64 bits long, but is guaranteed to be unique for the length of time it needs to be (which in our case is only a couple of months)
Our algorithm will malfunction and generate a duplicate ID if:
The system clock on one of our nodes goes backwards by the same amount of time in which the counter wraps.
Our operations engineers make a mistake and assign the same server ID to two servers.
Eventually, after about 9 months.

Related

Mixing other sources of random numbers with ones generated by /dev/urandom

Further to my question here, I'll be using the random_compat polyfill (which uses /dev/urandom) to generate random numbers in the 1 to 10,000,000 range.
I do realise, that all things being correct with how I code my project, the above tools should produce good (as in random/secure etc) data. However, I'd like to add extra sources of randomness into the mix - just in case 6 months down the line I read there is patch available for my specific OS version to fix a major bug in /dev/urandom (or any other issue).
So, I was thinking I can get numbers from random.org and fourmilab.ch/hotbits
An alternative source would be some logs from a web site I operate - timed to the microsecond, if I ignore the date/time part and just take the microseconds - this has in effect been generated by when humans decide to click on a link. I know this may be classed as haphazard rather than random, but would it be good for my use?
Edit re timestamp logs - will use PHP microtime() which will creaet a log like:
0.**832742**00 1438282477
0.**57241**000 1438282483
0.**437752**00 1438282538
0.**622097**00 1438282572
I will just use the bolded portion.
So let's say I take two sources of extra random numbers, A and B, and the output of /dev/urandom, call that U and set ranges as follows:
A and B are 1 - 500,000
U is 1 - 9,000,000
Final random number is A+B+U
I will be needing several million final numbers between 1 and 10,000,000
But the pool of A and B numbers will only contain a few thousand, but I think by using prime number amounts I can stretch that into millions of A&B combinations like so
// this pool will be integers from two sources and contain a larger prime number
// of members instead of the 7 & 11 here - this sequence repeats at 77
$numbers = array("One","Two","Three","Four","Five","Six","Seven");
$colors = array("Silver","Gray","Black","Red","Maroon","Yellow","Olive","Lime","Green","Aqua","Orange");
$ni=0;
$ci=0;
for ($i=0;$i<$num_numbers_required;$i++)
{
$offset = $numbers[$ni] + $colors[$ci];
if ($ni==6) // reset at prime num 7
$ni=0;
else
$ni++;
if ($ci==10) // reset at prime num 11
$ci=0;
else
$ci++;
}
Does this plan make sense - is there any possibility I can actually make my end result less secure by doing all this? And what of my idea to use timestamp data?
Thanks in advance.

I would suggest reading RFC4086, section 5. Basically it talks about how to "mix" different entropy sources without compromising security or bias.
In short, you need a "mixing function". You can do this with xor, where you simply set the result to the xor of the inputs: result = A xor B.
The problem with xor is that if the numbers are correlated in any way, it can introduce strong bias into the result. For example, if bits 1-4 of A and B are the current timestamp, then the result's first 4 bits will always be 0.
Instead, you can use a stronger mixing function based on a cryptographic hash function. So instead of A xor B you can do HMAC-SHA256(A, B). This is slower, but also prevents any correlation from biasing the result.
This is the strategy that I used in RandomLib. I did this because not every system has every method of generation. So I pull as many methods as I can, and mix them strongly. That way the result is never weaker than the strongest method.
HOWEVER, I would ask why. If /dev/urandom is available, you're not going to get better than it. The reason is simple, even if you call random.org for more entropy, your call is encrypted using random keys generated from /dev/urandom. Meaning if an attacker can compromise /dev/urandom, your server is toast and you will be spinning your wheels trying to make it better.
Instead, simply use /dev/urandom and keep your OS updated...

Why do sites use random alphanumeric ids rather than database ids to identify content?

Why do sites like YouTube, Imgur and most others use random characters as their content ids rather than just sequential numbers, like those created by auto-increment in MySQL?
To explain what I mean:
In the URL: https://www.youtube.com/watch?v=QMlXuT7gd1I
The QMlXuT7gd1I at the end indicates the specific video on that page, but I'm assuming that video also has a unique numeric id in the database. Why do they create and use this alphanumeric string rather than just use the video's database id?
I'm creating a site which identifies content in the URL like above, but I'm currently using just the DB id. I'm considering switching to random strings because all major sites do it, but I'd like to know why this is done before I implement it.
Thanks!

Some sites do that because of sharding.
When you have only one process (one server) writing, it is possible to make an auto-increment id without having duplicate ids, but when you have multiple servers (with multiple processes) writing content, like youtube, it's not possible to use autoincrement id anymore. The costs of synchronization to avoid duplication would be huge.
For example, if you read mongodb's ocjectid documentation you can see this structure for the id:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
At the end, it's only 12 byte. The thing is when you represent in hexadecimal, it seems like 24 bytes, but that is only when you show it.
Another advantage of this system is that the timestamp is included in the id, so you can decouple the id to get the timestamp.

First this is not a random string, it is a base calculation which is depended on the id. They go this way, because Alphanumeric has a bigger base
Something like 99999999 could be 1NJCHR
Take a look here, and play with the bases, and learn more about it.
You will see it is way more shorter. That is the only reason i can imagine, someone would go this way, and it makes sense, if you have ids like 54389634589347534985348957863457438959734
As self and Cameron commented/answered there are chances (especialy for youtube) that there are additional security parameters like time and lenght are calculated into it in some way, so you are not able to guess an identifier.

In addition to Christian's answer above, using a base calculation, hashed value or other non-numeric identifier has the advantage of obscuring your db's size from competitors.
Even if you stayed with numeric and set your auto_increment to start at 50,000, increase by 50, etc., educated guesses can still be made at the db's size and growth. Non-numeric options don't eliminate this possibility, but they inhibit it to a certain extent.

there are major chances for malicious inputs by end users, and by not using ids users cant guess ids and thus can't guess how large db is. However other's answers on base calculation explains well.

Fast search on encrypted data?

I've got a requirement to encrypt Personally identifiable information (PII) data in an application DB. The application uses smart searches in the system that use sound like, name roots and part words searches to find name and address quickly.
If we put in encryption on those fields (the PII data encrypted at the application tier), the searches will be impacted by the volume of records because we cant rely on SQL in the normal way and the search engine (in the application) would switch to reading all values, decrypt them and do the searches.
Is there any easy way of solving this so we can always encrypt the PII data and also give our user base the fast search functionality?
We are using a PHP Web/App Tier (Zend Server and a SQL Server DB). The application does not currently use technology like Lucene etc.
Thanks
Cheers

Encrypting the data also makes it look a great deal like randomized bit strings. This precludes any operations the shortcut searching via an index.
For some encrypted data, e.g. Social security number, you can store a hash of the number in a separate column, then index this hash field and search for the hash. This has limited utility obviously, and is of no value in searches name like 'ROB%'
If your database is secured properly may sound nice, but it is very difficult to achieve if the bad guys can break in and steal your servers or backups. And if it is truly as requirement (not just a negotiable marketing driven item), you are forced to comply.
You may be able to negotiate storing partial data in unencrypted, e.g., first 3 character of last name or such like so that you can still have useful (if not perfect) indexing.
ADDED
I should have added that you might be allowed to hash part of a name field, and search on that hash -- assuming you are not allowed to store partial name unencrypted -- you lose usefulness again, but it may still be better than no index at all.
For this hashing to be useful, it cannot be seeded -- i.e., all records must hash based on the same seed (or no seed), or you will be stuck performing a table scan.
You could also create a covering index, still encrypted of course, but a table scan could be considerable quicker due to the reduced I/O & memory required.

I'll try to write about this simply because often the crypto community can be tough to understand (I resisted the urge to insert a pun here).
A specific solution I have used which works nicely for names is to create index tables for things you wish to index and search quickly like last names, and then encrypt these index column(s) only.
For example, you could create a table where the key column contains one entry for every possible combination of characters A-Z in a 3-letter string (and include spaces for all but the first character). Like this:
A__
AA_
AAA
AAB
AAC
AAD
..
..
..
ZZY
ZZZ
Then when you add a person to your database, you add their index to a second column which is just a list of person ID's.
Example: In your patients table, you would have an entry for smith like this:
231 Smith John A 1/1/2016 .... etc
and this entry would be encrypted, perhaps all columns but the ID 231. You would then add this person to the index table:
SMH [342, 2342, 562, 12]
SMI [123, 175, 11, 231]
Now you encrypt this second column (the list of ID's). So when you search for a last name, you can type in 'smi' and quickly retrieve all of the last names that start with this letter combination. If you don't have the key, you will just see a cyphertext. You can actually create two columns in such a table, one for first name and one for last name.
This method is just as quick as a plaintext index and uses some of the same underlying principles. You can do the same thing with a soundex ('sounds like') by constructing a table with all possible soundex patterns as your left column, and person (patient?) Id's as another column. By creating multiple such indices you can develop a nice way to hone in on the name you are looking for.
You can also extend to more characters if you like but obviously this lengthens your table by more than an order of magnitude for each letter. It does have the advantage of making your index more specific (not always what you want). Truthfully any type of histogram where you can bin the persons using their names will work. I have seen this done with date of birth as well. anything you need to search on.
A table like this suffers from some vulnerabilities, particularly that because the number of entries for certain buckets may be very short, it would be possible for an attacker to determine which names have no entries in the system. However, using a sort of random 'salt' in your index list can help with this. Other problems include the need to constantly update all of your indices every time values get updated.
But even so, this method creates a nicely encrypted system that goes beyond data-at-rest. Data-at-rest only protects you from attackers who cannot gain authorization to your systems, but this system provides a layer of protection against DBA's and other personnel who may need to work in the database but do not need (or want) to see the personal data contained within. They will just see ciphertext. So, an additional key is needed by the users or systems that actually need/want to access this information. Ashley Madison would have been wise to employ such a tactic.
Hope this helps.

Sometimes, "encrypt the data" really means "encrypt the data at rest". Which is to say that you can use Transparent Data Encryption to protect your database files, backups, and the like but the data is plainly viewable through querying. Find out if this would be sufficient to meet whatever regulations you're trying to satisfy and that will make your job a whole lot easier.

Proposed solution: Generate unique IDs in a distributed environment

I've been browsing the net trying to find a solution that will allow us to generate unique IDs in a regionally distributed environment.
I looked at the following options (among others):
SNOWFLAKE (by Twitter)
It seems like a great solutions, but I just don't like the added complexity of having to manage another software just to create IDs;
It lacks documentation at this stage, so I don't think it will be a good investment;
The nodes need to be able to communicate to one another using Zookeeper (what about latency / communication failure?)
UUID
Just look at it: 550e8400-e29b-41d4-a716-446655440000;
Its a 128 bit ID;
There has been some known collisions (depending on the version I guess) see this post.
AUTOINCREMENT IN RELATIONAL DATABASE LIKE MYSQL
This seems safe, but unfortunately, we are not using relational databases (scalability preferences);
We could deploy a MySQL server for this like what Flickr does, but again, this introduces another point of failure / bottleneck. Also added complexity.
AUTOINCREMENT IN A NON-RELATIONAL DATABASE LIKE COUCHBASE
This could work since we are using Couchbase as our database server, but;
This will not work when we have more than one clusters in different regions, latency issues, network failures: At some point, IDs will collide depending on the amount of traffic;
MY PROPOSED SOLUTION (this is what I need help with)
Lets say that we have clusters consisting of 10 Couchbase Nodes and 10 Application nodes in 5 different regions (Africa, Europe, Asia, America and Oceania). This is to ensure that content is served from a location closest to the user (to boost speed) and to ensure redundancy in case of disasters etc.
Now, the task is to generate IDs that wont collide when the replication (and balancing) occurs and I think this can be achieved in 3 steps:
Step 1
All regions will be assigned integer IDs (unique identifiers):
1 - Africa;
2 - America;
3 - Asia;
4 - Europe;
5 - Ociania.
Step 2
Assign an ID to every Application node that is added to the cluster keeping in mind that there may be up to 99 999 servers in one cluster (even though I doubt: just as a safely precaution). This will look something like this (fake IPs):
00001 - 192.187.22.14
00002 - 164.254.58.22
00003 - 142.77.22.45
and so forth.
Please note that all of these are in the same cluster, so that means you can have node 00001 per region.
Step 3
For every record inserted into the database, an incremented ID will be used to identify it, and this is how it will work:
Couchbase offers an increment feature that we can use to create IDs internally within the cluster. To ensure redundancy, 3 replicas will be created within the cluster. Since these are in the same place, I think it should be safe to assume that unless the whole cluster is down, one of the nodes responsible for this will be available, otherwise a number of replicas can be increased.
Bringing it all together
Say a user is signing up from Europe:
The application node serving the request will grab the region code (4 in this case), get its own ID (say 00005) and then get an incremented ID (1) from Couchbase (from the same cluster).
We end up with 3 components: 4, 00005,1. Now, to create an ID from this, we can just join these components into 4.00005.1. To make it even better (I'm not too sure about this), we can concatenate (not add them up) the components to end up with: 4000051.
In code, this will look something like this:
$id = '4'.'00005'.'1';
NB: Not $id = 4+00005+1;.
Pros
IDs look better than UUIDs;
They seem unique enough. Even if a node in another region generated the same incremented ID and has the same node ID as the one above, we always have the region code to set them apart;
They can still be stored as integers (probably Big Unsigned integers);
It's all part of the architecture, no added complexities.
Cons
No sorting (or is there)?
This is where I need your input (most)
I know that every solution has flaws, and possibly more that what we see on the surface. Can you spot any issues with this whole approach?
Thank you in advance for your help :-)
EDIT
As #DaveRandom suggested, we can add the 4th step:
Step 4
We can just generate a random number and append it to the ID to prevent predictability. Effectively, you end up with something like this:
4000051357 instead of just 4000051.

I think this looks pretty solid. Each region maintains consistency, and if you use XDCR there are no collisions. INCR is atomic within a cluster, so you will have no issues there. You don't actually need to have the Machine code part of it. If all the app servers within a region are connected to the same cluster, it's irrelevant to infix the 00001 part of it. If that is useful for you for other reasons (some sort of analytics) then by all means, but it isn't necessary.
So it can simply be '4' . 1' (using your example)
Can you give me an example of what kind of "sorting" you need?
First: One downside of adding entropy (and I am not sure why you would need it), is you cannot iterate over the ID collection as easily.
For Example: If you ID's from 1-100, which you will know from a simple GET query on the Counter key, you could assign tasks by group, this task takes 1-10, the next 11-20 and so on, and workers can execute in parallel. If you add entropy, you will need to use a Map/Reduce View to pull the collections down, so you are losing the benefit of a key-value pattern.
Second: Since you are concerned with readability, it can be valuable to add a document/object type identifier as well, and this can be used in Map/Reduce Views (or you can use a json key to identify that).
Ex: 'u:' . '4' . '1'
If you are referring to ID's externally, you might want to obscure in other ways. If you need an example, let me know and I can append my answer with something you could do.
#scalabl3

You are concerned about IDs for two reasons:
Potential for collisions in a complex network infrastructure
Appearance
Starting with the second issue, Appearance. While a UUID certainly isn't a great beauty when it comes to an identifier, there are diminishing returns as you introduce a truly unique number across a complex data center (or data centers) as you mention. I'm not convinced that there is a dramatic change in perception of an application when a long number versus a UUID is used for example in a URL to a web application. Ideally, neither would be shown, and the ID would only ever be sent via Ajax requests, etc. While a nice clean memorable URL is preferable, it's never stopped me from shopping at Amazon (where they have absolutely hideous URLs). :)
Even with your proposal, the identifiers, while they would be shorter in the number of characters than a UUID, they are no more memorable than a UUID. So, the appearance likely would remain debatable.
Talking about the first point..., yes, there are a few cases where UUIDs have been known to generate conflicts. While that shouldn't happen in a properly configured and consistently obtained architecture, I can see how it might happen (but I'm personally a lot less concerned about it).
So, if you're talking about alternatives, I've become a fan of the simplicity of the MongoDB ObjectId and its techniques for avoiding duplication when generating an ID. The full documentation is here. The quick relevant pieces are similar to your potential design in several ways:
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
The timestamp can often be useful for sorting. The machine identifier is similar to your application server having a unique ID. The process id is just additional entropy, and finally to prevent conflicts, there is a counter that is auto incremented whenever the timestamp is the same as the last time an ObjectId is generated (so that ObjectIds can be created rapidly). ObjectIds can be generated on the client or on the database. Further, ObjectIds do take up fewer bytes than a UUID (but only 4). Of course, you could not use the timestamp and drop 4 bytes.
For clarification, I'm not suggesting you use MongoDB, but be inspired by the technique they use for ID generation.
So, I think your solution is decent (and maybe you want to be inspired by MongoDB's implementation of a unique ID) and doable. As to whether you need to do it, I think that's a question only you can answer.

URL Shortener algorithm

I recently bought myself a domain for personal URL shortening.
And I created a function to generate alphanumeric strings of 4 characters as reference.
BUT
How do I check if they are already used or not? I can't check for every URL if it exists in the database, or is this just the way it works and I have to do it?
If so, what if I have 13.000.000 URLs generated (out of 14.776.336). Do I need to keep generating strings until I have found one that is not in the DB yet?
This just doesn't look the right way to do it, anyone who can give me some advise?

One memory efficient and faster way I think of is following. This problem can be solved without use of database at all. The idea is that instead of storing used urls in database, you can store them in memory. And since storing them in memory can take a lot of memory usage, so we will use a bit set (an array of bits ) and we only one bit for each url.
For each random string you generate, create a hashcode for that that lies b/w 0 and max number K.
Create a bit set( basically a bit array). Whenever you use some url, set corresponding hash code bit in bit set to 1.
Whenever you generate a new url, see if its hashcode bit is set. If yes, then discard that url and generate a new one. Repeat the process till you get one unused one.
This way you avoid DB forever, your lookups are extremely fast and it takes least amount of memory.
I borrowed the idea from this place

A compromise solution is to generate a random id, and if it is already in the database, find the first empty id that is bigger than it. (Wrapping around if you can't find any empty space in the range above.)
If you don't want the ids to be unguessable (you probably don't if you only use 4 characters), this approach works fine and is quick.

One algorithm is to try a few times to find a free url of N characters, if still not found, increase N. Start with N=4.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.