Evening all, I've recently been reading the following blog post about sharding at Pinterest and I think there's some great stuff in there https://engineering.pinterest.com/blog/sharding-pinterest-how-we-scaled-our-mysql-fleet
What I'm unsure on though, is how best to decide where a brand new user should be inserted.
So for those that don't know or have bothered to read the above article, Pinterest have a number of shards, each with a number of databases on. They generate IDs for objects based on a 64 bit shifting that determines a shard, the type of object (user,pin etc..) to determine a table and the local auto-increment id for the object in question. Now they try to put pins etc. on the same database as the 'board' they are on. But for a brand new object, what would be the best way of determining the 'shard' it lives on?
For users that sign in via Facebook they use a modulus e.g
shard = md5(“1.2.3.4") % 4096 //4096 is the number of shards
But if I had a simple email/password registration form, do you think using a similar approach on email address would work for working out an initial shard? I'd assume it would have to be email in this case, otherwise they would have no way of knowing what database to validate the logging credentials against. Also I know that post is from 2015 so not too old and computing power moves quickly, but would there be a better option then using md5 here? I know the chance of a collision is minor - especially as we're just talking about hashing the email address here, but would it be worth using a different algorithm? I'm basically interested in the best way to determine a shard here and to work out how to get back to it (hence why I think it has to be email address)
Hope this all makes sense!
(p.s didn't take this with the Pinterest tag as it looks like that's just for api dev, but if someone thinks it might get better 'eyes' on the question then feel free to add it)
When using MD5 to determine the shard, there is no risk on collisions: If collisions occur then it just ends up in the same shard. The MD5 is not the key in that shard (so that is where the collision risk is removed).
The main issue in this shard method is that the number of shards is fixed, so performance in the end might be an issue (re-distributing a running environment is not easy, so in this design you are still dependent on faster machines if there is more growth then expected).
Related
I'm trying to implement data anonymization in MySQL and PHP.
At the moment I'm separating the data by encrypting the foreign key/ID using the user password and save it in the 'user' account table. But I quickly realized that when a user is initially created, and I insert the first data inside the other tables, I can match them together by row count.
What I thought of doing is to randomly swap the user account details each time a new account is created - but this feels very inefficient.
I cannot find anything related online like a basic explanation of how one would properly achieve user data separation so that it is completely anonymized. Can anyone explain here what goes into achieving data anonymization in an RDBMS architecture?
Thanks a lot in advance!
EDIT:
To be more clear, let's imagine I have two tables: One holding the user email & encrypted unique foreign key (account-table). The other holding user preferences/info (this table will always hold 1 row per user).
Now let's say I added a new user in account-table, and data in user-preferences/info table. In reality, I can still know from counting the table rows if this info is owned by that user.
I can't encrypt all this data, because some of it might be public anonymously. And even so, making the rows unrelated to each other continues making it harder on anyone getting hold of this encrypted data from matching it to any user.
I'm looking for complete anonymity and privacy not just by encryption, but by separation of user-data. I want data to be completely private to the user - possibly without duplicating any of it in multiple places.
Would the random swap be the best scenario in this case? (copy a randomly picked user, and swap/overwrite new the data in their original row)
You need to look at differential privacy. The idea here is to preserve the original data in one record, but add carefully randomised data that looks very similar to it.
For example imagine you were storing user year of birth. If you add a single user record and an unrelated separate single birth year record, it's very likely (as you say) that you will be able to reverse the relationship and reassociate the two. However, you could add multiple records with randomised values clustered around the real value (but not exactly centred as that's statistically reversible too), so you could have user1 born in 1970, and add records for 1968, 1969, 1970, and 1971, user2 born in 1980 could have values of 1979, 1980, 1981, 1982. You then can't tell exactly which record is exactly correct, but on average the values are reasonably correct. Note that this even works for a single record.
But there is a further concern here – exactly how anonymous do you want records to be? The degree of anonymity you need may depend on the nature of the data you're processing. This simple example only looks at a single field - one that may indeed not allow reidentification when used alone, but might provide sufficient information when combined with other fields, even if they use a similar approach.
As you may gather, this is a difficult and subtle thing to design effectively – the algorithm for figuring out how much noise you need to add is something that has won mathematics medals!
Another approach is to keep the real data without knowing what it is using homomorphic encryption, allowing you to still do things like searching but without actually being able to see the underlying data.
Since you're in PHP, you might find CipherSweet provides a useful toolkit.
Showing user data on a page by query:
$query = "SELECT * FROM COLLECTIONS WHERE uid = $_GET['user_id']";
But problem is that user can see other users data by changing that uid.
How to solve this problem.
Take your website offline. NOW. Somebody is going to either wipe the data or steal the data or inject malware that's served to all of your customers
Breathe. You've bought yourself some time, assuming it hasn't already been breached.
A small subset of the security measures you NEED to take
These mitigate, in order of "has biggest immediate benefits" to "is probably most important", one problem each. (Apart from number 3, which mitigates anywhere from 4 to 32241 problems of equal or greater magnitude to number 1.)
Look through every instance of every database request, and make sure that you are never using double quotes or the . operator when defining your query string. Rebuild all of your database handling code to use some sort of parametrised SQL query system.
Use an authentication library, or at the very least a crypto library.
Ask about your setup on Security Stack Exchange using an account that is in no way traceable to your website. Not even to your company, if your website is associated with your company.
Why?
Yes, I know, that website is probably important and needs to stay up so people can use it. But try this:
www.badwebsite.com/your/page/here?uid=1 OR 1
All of the data is visible! You are accepting code from the user and running it in your database. Now what if I decided to delete all of your database tables?
That's just covering the first point I made. Please trust that there are bigger problems for your users if you haven't done step 2, the least of which is hundreds of their accounts on other websites (e.g. Gmail, Example Bank) becoming known to cyber criminals.
Take a look at this comic strip:
There's also a unicode-handling bug in the URL request library, and we're storing the passwords unsalted ... so if we salt them with emoji, we can close three issues at once!
This is made to be more funny, but the problem described in this comic strip is probably less bad than the problem you are facing. Please, for the sake of whoever has entrusted you with their data, turn it off for a few days whilst you try to make it something resembling secure.
You might want to bring in a technical consultant; if your developers are not experienced in creating intrusion-proof software then they're probably not up to the task of making insecure software secure (which is orders of magnitude harder, especially if you're new to that sort of thing).
i've recently started learning Redis and am currently building an app using it as sole datastore and I'd like to check with other Redis users if some of my conclusions are correct as well as ask a few questions. I'm using phpredis if that's relevant but I guess the questions should apply to any language as it's more of a pattern thing.
As an example, consider a CRUD interface to save websites (name and domain) with the following requirements:
Check for existing names/domains when saving/validating a new site (duplicate check)
Listing all websites with sorting and pagination
I have initially chosen the following "schema" to save this information:
A key "prefix:website_ids" in which I use INCR to generate new website id's
A set "prefix:wslist" in which I add the website id generated above
A hash for each website "prefix:ws:ID" with the fields name and website
The saving/validation issue
With the above information alone I was unable (as far as I know) to check for duplicate names or domains when adding a new website. To solve this issue I've done the following:
Two sets with keys "prefix:wsnames" and "prefix:wsdomains" where I also SADD the website name and domains.
This way, when adding a new website I can check if the submitted name or domain already exist in either of these sets with SISMEMBER and fail the validation if needed.
Now if i'm saving data with 50 fields instead of just 2 and wanted to prevent duplicates I'd have to create a similar set for each of the fields I wanted to validate.
QUESTION 1: Is the above a common pattern to solve this problem or is there any other/better way people use to solve this type of issue?
The listing/sorting issue
To list websites and sort by name or domain (ascending or descending) as well as limiting results for pagination I use something like:
SORT prefix:wslist BY prefix:ws:*->name ALPHA ASC LIMIT 0 10
This gives me 10 website ids ordered by name. Now to get these results I came to the following options (examples in php):
Option 1:
$wslist = the sort command here;
$websites = array();
foreach($wslist as $ws) {
$websites[$ws] = $redis->hGetAll('prefix:ws:'.$ws);
}
The above gives me a usable array with website id's as key and an array of fields. Unfortunately this has the problem that I'm doing multiple requests to redis inside a loop and common sense (at least coming from RDBMs) tells me that's not optimal.
The better way it would seem to be to use redis pipelining/multi and send all request in a single go:
Option 2:
$wslist = the sort command here;
$redis->multi();
foreach($wslist as $ws) {
$redis->hGetAll('prefix:ws:'.$ws);
}
$websites = $redis->exec();
The problem with this approach is that now I don't get each website's respective ID unless I then loop the $websites array again to associate each one. Another option is to maybe also save a field "id" with the respective website id inside the hash itself along with name and domain.
QUESTIONS 2/3: What's the best way to get these results in a usable array without having to loop multiple times? Is it correct or good practice to also save the id number as a field inside the hash just so I can also get it with the results?
Disclaimer: I understand that the coding and schema building paradigms when using a key->value datastores like Redis are different from RDBMs and document stores and so notions of "best way to do X" are likely to be different depending on the data and application at hand.
I also understand that Redis might not even be the most suitable datastore to use in mostly CRUD type apps but I'd still like to get any insights from more experienced developers since CRUD interfaces are very common on most apps.
Answer 1
Your proposal looks pretty common. I'm not sure why you need an auto-incrementing ID though. I imagine the domain name has to be unique, or the website name has to be unique, or at the very least the combination of the two has to be unique. If this is the case it sounds like you already have a perfectly good key, so why invent an integer key when you don't need it?
Having a SET for domains and a SET for website names is a perfect solution for quickly checking to see if a specific domain or website name already exists. Though, if one of those (domain or website name) is your key you might not even need these SETs since you could just look if the key prefix:ws:domain-or-ws-name-here exists.
Also, using a HASH for each website so you can store your 50 fields of details for the website inside is perfect. That is what hashes are for.
Answer 2
First, let me point out that if your websites and domain names are stored in SORTED SETs instead of SETs, they will already be alphabetized (assuming they are given the same score). If you are trying to support other sort options this might not help much, but wanted to point it out.
Your Option 1 and Option 2 are actually both relatively reasonable. Redis is lightning fast, so Option 1 isn't as unreasonable as it seems at first. Option 2 is clearly even more optimal from the perspective of redis since all the commands will be bufferred and executed all at once. Though, it will require additional processing in PHP afterwards as you noted if you want the array to be indexed by the id.
There is a 3rd option: lua scripting. You can have redis execute a Lua script that returns both the ids and hash values all in one shot. But, not being super familiar with PHP anymore and how redis's multibyte replies map to PHPs arrays I'm not 100% sure what the lua script would look like. You'll need to look for examples or do some trial and error. It should be a pretty simple script, though.
Conclusion
I think redis sounds like a decent solution for your problem. Just keep in mind the dataset needs to always be small enough to keep in memory. If that's not really a concern (unless your fields are huge, you should be able to fit thousands of websites into only a few MB) or if you don't mind having to upgrade your RAM to grow your DB, then Redis is perfectly suitable.
Be familiar with the various persistence options and configurations for redis and what they mean for availability and reliability. Also, make sure you have a backup solution in place. I would recommend having both a secondary redis instance that slaves off of your main instance, and a recurring process that backs up your redis database file at least daily.
This is related to preventing webform resubmission, however this time the context is a web-based RPG. After the player defeats a monster, it would drop an item. So I would want to prevent the user from hitting the back button, or keep refreshing, to 'dupe' the item-drop.
As item drop is frequent, using a DB to store a unique 'drop-transaction-id' seems infeasible to me. I am entertaining an idea below:
For each combat, creating an unique value based on the current date-time, user's id and store it into DB and session. It is possible that given a userid, you can fetch the value back
If the value from session exists in the DB, then the 'combat' is valid and allow the user to access all pages relevant to combat. If it does not exist in DB, then a new combat state is started
When combat is over, the unique value is cleared from DB.
Values which is 30mins old in the DB are purged.
Any opinions, improvements, or pitfalls to this method are welcomed
This question is very subjective, there's things you can do or can not do, depending on the already existing data / framework around it.
The solution you've provided should work, but it depends on the unique combat/loot/user data you have available.
I take it this is what you think is best? It's what I think is best :)
Get the userID, along with a unique piece of data from that fight. Something like combat start time, combat end time, etc
Store it in a Database, or what ever storage system you have
Once you collect the loot, delete that record
That way if the that userID, and that unique fight data exists, they haven't got their loot.
And you are right; tracking each piece of loot is too much, you're better off temporarily storing the data.
Seems like a reasonable approach. I assume you're storing the fact that the player is in combat somewhere anyway. Otherwise, they can just close their browser if they want to avoid a fight?
The combat ending and loot dropping should be treated as an atomary operation. If there is no fight, there can't be any dropping loot.
That depends on your game design: Do you go more in the direction of roguelikes where only turns count, and therefore long pauses in between moves are definitely possible (like consulting other people via chatroom, note: in NetHack that is not considered cheating)? Can users only save their games on certain points or at any place? That makes a huge difference in the design, e.g. making way for exploits similar to the one Thorarin mentions.
If your game goes the traditional roguelike route of only one save, turn basement and permadeath, then it would be possible to save the number of the current turn for any given character along with any game related information (inventory, maps, enemies and their state), and then check against that at any action of the player, therefore to prevent playing the turn twice.
Alternatively you could bundle everything up in client side javascript, so that even if they did resubmit the form it would generate an entirely new combat/treasure encounter.
I am using PHP, AS3 and mysql.
I have a website. A flash(as3) website. The flash website store the members' information in mysql database through php. In "members" table, i have "id" as the primary key and "username" as a unique field.
Now my situation is:
When flash want to display a member's profile. My questions:
Should Flash pass the member "ID" or "username" to php to process the mysql query?
Is there any different passing the "id" or "username"?
Which one is more secure?
Which one you recommend?
I would like to optimize my website in terms of security and performance.
1) Neither is inarguably the thing it should do.
2) The ID is probably shorter and minisculely faster to look up. The ID gives away slightly more information about your system; if you know that a site uses serial IDs at all, and you know what one of them is, that's pretty much as good as knowing all of them, whereas knowing one username does not tell you the usernames of any other users. On the other hand, the username is more revelatory of the user's psychology and may constitute a password hint.
3) Both have extremely marginal downfalls, as described in item 2.
4) I'd use the ID.
The primary key is always the safest method for identifying database rows. For instance, you may later change your mind and allow duplicate usernames.
Depending on how your ActionScript is communicating with PHP, it will likely also require sending fewer bytes if you send an integer ID in your request rather than a username.
Arguments for passing id number:
People never change their id. People do change their names. For a casual games site with disposable accounts, that might not be a problem, but for long-term registered users it can be. I've had to handle a demand by an upset woman that her ex-husband's surname be purged from her user name. A process for doing this had to be rapidly established!
Shorter
Easier to index and partition.
Arguments for passing user name:
Slightly harder (but not impossible) to guess a legal, existing account - e.g. to peruse random people's records, if that's your thing.
Probably you should get intimately familiar with "PHP Sessions", maybe using a framework that already has this in place, because it's non-trivial and you don't want to mess it up. The session management software will then handle all this for you, including login screens, "I forgot my password", etc.
Then you can focus your attention on what your site is really primarily there for.
Sounds like fun (actionscript + php + mysql) - good luck!