Limitations to Bitwise Systems - php

I have just taken some online tutorials on how Bits work. Though I have a couple of questions. I have searched the internet but didn't find what i was looking for. I may have been searching for some incorrect key words.
Lets say i wish to build an option or permission system using bitwise (i think that is the correct terminology). I have the following hestitations:
1) Is it possible to end up with collisions when using & etc?
2) If there is collision opportunities what sort of steps should i take when im designing my permissions? Does the permission numbers change if i have a considerably large set of permissions e.g. over 500?
Hopefully i got my question across correctly, if not please let me know and i will try to rephrase.
EDIT::
Similar Question here which i believe has been answered
User role permissions for different modules using bitwise operators

What kind of collisions?
For 500 different permissions, you'd have to store 500 bits. There's no computers in existence which can directly handle a 500+bit value. With bit systems, you're basically restricted to using what the underlying cpu can provide, e.g. 8, 16, 32, 64-bit sized values. Anything beyond that has to be split into multiple different memory chunks.
e.g.
permission_flags && (1 << 475)
is going to fail on every platform in existence, because once you shift past bit #63, you're past what the cpu can directly support.

Related

Maximum of 63 bits in PHP?

I'm building a system where I need to assign users access to a specific (individual) number of assets. These assets (potentially numbering in the tens of thousands).
I thought to do it with bitwise comparisons, so something like storing the value of 3 if a user has access to assets 1 and 2, a value of 7 for access to 1, 2 and 3, etc. etc.
The access is not necessarily sequential, so a user could easily have access to assets 10, 12 and 24324.
I quickly ran into a problem using bits where the server wouldn't pick up assets beyond the 63rd bit, so obviously I've either misunderstood something, or bits is a dumb way to store this kind of info.
My code, running on a 64-bit Linux system, is this (just for testing purposes obviously, to discover just such limitations as this):
<?php
$bitwise = $_GET['bitwise'];
if (isset($bitwise)) {
echo "<br/>bitwise input: ";
echo $bitwise;
$bitcount = 0;
for ($i=1;$i<=$bitwise;$i*=2) {
if (($i & $bitwise) > 0) {
$bitcount++;
echo "<br/>{$bitcount}: " . $i . " is in " . $bitwise;
}
}
}
?>
And I input test values via the querystring. However, no matter what value I input, the maximum count I can get to is 63.
So, my question is: Is this simply because I'm using bitwise comparisons for something they're not ideal for (my theory), or is my implementation of it just wrong?
My next go-to solution would be to store the "bits" in arrays, so if someone has access to assets 1, 2 and 3 I'll store their list as [1,2,3]. It's unlikely that someone has access to more than, say, a hundred specific assets. Is this a reasonable way to do it? I realize this puts the question somewhat into discussion-worthy territory, but hopefully it's still specific enough.
Paramount concerns are, of course, performance if the server has to serve a large number of clients at the same time.
(please excuse wrong terminology where applicable, hopefully my meaning is clear).
This is standard behavior — on 64 bit compiled PHP, integers have a maximum length of 64 bits. While it warms my secret grey beard heart, if you have more than 64 different roles, a bitwise solution is the wrong one for access control.
Two other things worth mentioning.
First, doing this for performance reasons is probably a premature optimization for a web application. ACL lookups aren't going to be the bottleneck in your system for a long time, if at all. Also, it's not clear if bitwise operators offer that much PHP performance benefit, given the language's dynamically typed nature.
Second, the reason you're limited to 63 bits is because PHP (appears to?) use Two's compliment for their implementation of signed integers. The final bit is for representing positive or negative numbers. I asked this question about the bitwise NOT a while back, which is why this question caught my eye.

Practical method of renaming files for high volume shared media server PHP / mySQL

Ok I am in the midst of developing a shared system/service of sorts. Where people will be able to upload there own media to the server(s). I am using PHP and mySQL for the majority of the build, and am currently using a single server environment. However I need this to be scaleable as I do intend on moving the media to a cluster of servers in the next 6 months leaving the site/service on its own server. Anyway thats a mute point.
My goal, or hope rather is to come up with an extremely low risk naming convention that runs little possibility ever of running into a collision with another file when renaming the file upon upload. I have read to date many concepts and find that UUID (GUID) is the best candidate for my over all needs as it has a number so high of possibilities that I dont think I could ever reach that many shared images ever.
My problem is coming up with a function that generates a UUID preferable v3 or v5 (I understand they are the same, but v5 currently doesn't comply 100% with the standard of UUID). Knowing little about UUID and the constraints there of that makes them unique and or valid when trying to regex over them later when and or if needed I can't seem to come up with a viable solution. Nor do I know which I should really go with v3 or v5. or v4 for that matter. So I am looking for advice as well as help on a function that will return the desired version UUID type.
Save your breath I haven't tried anything yet as I don't know where to begin currently. With that, I intend on saving these files across many folders to offset the loads caused by large directory listings. So I am also reducing my risk of collision there as well. I am also storing these names in a DB with there associated folders and other information tied to each image, so another problem I see there is when I randomly generate a UUID for a file to be renamed I don't want to query the DB multiple times in the event of a collision so I may actually want to return maybe 5 UUID per function call and see what if any have a match in my query where ill use the first one that doesnt have a match.
Anyway I know this is a lot of reading, I know theres no code with it, hopefully the lot of you don't end up down voting this cause theres to much reading, and assume this is a poor question/discussion. As I would seriously like to know how to tackle this from the begining so I can scale up as needed with as little hassel as possible.
If you are going to store a reference to each file in the database anyway .. why don't you use the MySQL auto_increment id to name your files? If you scale the DB to a cluster, the ID is still unique (being a PK, it must be unique!), so why waste precious CPU time with the UUID generation and stuff? this is not what UUIDs are made for.
I'd go for the easiest way (and i've seen that in many other systems, though):
upload file
when upload succeded, insert DB reference (with the path determined by 3.); fetch auto_incremented $ID
rename file to ${YEAR}/{$MONTH}/${DAY}/{$ID} (adjust if you need a more granular path, when too many files uploaded per day)
when rename failed, delete DB reference and show error message
update DB reference with the actual actual path in the file system
My goal, or hope rather is to come up with an extremely low risk
naming convention that runs little possibility ever of running into a
collision with another file when renaming the file upon upload. I have
read to date many concepts and find that UUID (GUID) is the best
candidate for my over all needs as it has a number so high of
possibilities that I dont think I could ever reach that many shared
images ever.
You could build a number (which you would then implement as UUID) made up of:
Date (YYYYMMDD)
Server (NNN)
Counter of images uploaded on that server that day
This number will never generate any collisions since it always increments, and can scale up to one thousand servers. Say that you get at most one million images per day on each server, that's around 43 bits of information. Add other 32 of randomness so that an UUID can't be guessed (in less than 2^31 attempts on average). You have some fifty-odd bits left to allow for further scaling.
Or you could store some digits in BCD to make them human-readable:
20120917-0172-4123-8456-7890d0b931b9
could be image 1234567890, random d0b931b9, uploaded on server 0172 on September 17th, 2012.
The scheme might even double as "directory spreading" scheme: once an image has an UUID which maps to, say, 20120917-125-00001827-d0b931b9, that means server 125, and you can store it in a directory structure called d0/b9/31/b9/20120917-125-00001827.jpg.
The naming convention ensures uniqueness, and the random bit ensure that the directory structure is "flat" (filling equally, with no directories too fuller than others), optimizing retrieval time.

fast large scale key-value store for a php program

I'm working on a full text index system for a project of mine. As one part of the process of indexing pages it splits the data into a very, very large number of very small pieces.
I have gotten the size of the pieces to be as low as a constant 20-30 bytes, and it could be less, it is basically 2 8 byte integers and a float that make up the actual data.
Because of the scale I'm looking for and the number of pieces this creates I'm looking for an alternative to mysql which has shown significant issues at value sets well below my goal.
My current thinking is that a key-value store would be the best option for this and I have adjusted my code accordingly.
I have tried a number but for some reason they all seem to scale even less than mysql.
I'm looking to store on the order of hundreds of millions or billions or more key-value pairs so I need something that won't have a large performance degradation with size.
I have tried memcachedb, membase, and mongo and while they were all easy enough to set up, none of them scaled that well for me.
membase had the most issues due to the number of keys required and the limited memory available. Write speed is very important here as this is a very close to even workload, I write a thing once, then read it back a few times and store it for eventual update.
I don't need much performance on deletes and I would prefer something that can cluster well as I'm hoping to eventually have this able to scale across machines but it needs to work on a single machine for now.
I'm also hoping to make this project easy to deploy so an easy setup would be much better. The project is written in php so it needs to be easy accessed from php.
I don't need to have rows or other higher level abstractions, they are mostly useless in this case and I have already made the code from some of my other tests to get down to a key-value store and that seems to likely be the fastest as I only have 2 things that would be retrieved from a row keyed off a third so there is little additional work done to use a key-value store. Does anyone know any easy to use projects that can scale like this?
I am using this store to store individual sets of three numbers, (the sizes are based on how they were stored in mysql, that may not be true in other storage locations) 2 eight byte integers, one for the ID of the document and one for the ID of the word and a float representation of the proportion of the document that that word was (number of times the work appeared divided by the number of words in the document). The index for this data is the word id and the range the document id falls into, every time I need to retrieve this data it will be all of the results for a given word id. I currently turn the word id, the range, and a counter for that word/range combo each into binary representations of the numbers and concatenate them to form the key along with a 2 digit number to say what value for that key I am storing, the document id or the float value.
Performance measurement was somewhat subjective looking at the output from the processes putting data into or pulling data out of the storage and seeing how fast it was processing documents as well as rapidly refreshing my statistics counters that track more accurate statistics of how fast the system is working and looking at the differences when I was using each storage method.
You would need to provide some more data about what you really want to do...
depending on how you define fast large scale you have several options:
memcache
redis
voldemort
riak
and sooo on.. the list gets pretty big..
Edit 1:
Per this post comments I would say that you take a look to cassandra or voldemort. Cassandra isn't a simple KV storage per se since you can storage much more complex objects than just K -> V
if you care to check cassandra with PHP, take a look to phpcassa. but redis is also a good option if you set a replica.
Here's add a few products and ideas that weren't mentioned above:
OrientDB - this is a graph/document database, but you can use it to store very small "documents" - it is extremely fast, highly scalable, and optimized to handle vast amounts of records.
Berkeley DB - Berkeley DB is a key-value store used at the heart of a number of graph and document databases - supposedly has a SQLite-compatible API that works with PHP.
shmop - Shared memory operations might be one possible approach, if you're willing to do some dirty-work. If you records are small and have a fixed size, this might work for you - using a fixed record-size and padding with zeroes.
handlersocket - this has been in development for a long time, and I don't know how reliable it is. It basically lets you use MySQL at a "lower level", almost like a key/value-store. Because you're bypassing the query parser etc. it's much faster than MySQL in general.
If you have a fixed record-size, few writes and lots of reads, you may even consider reading/writing to/from a flat file. Likely nowhere near as fast as reading/writing to shared memory, but it may be worth considering. I suggest you weigh all the pros/cons specifically for your project's requirements, not only for products, but for any approach you can think of. Your requirements aren't exactly "mainstream", and the solution may not be as obvious as picking the right product.

How many variables is to many when storing in _SESSION?

I'm looking for an idea of best practices here. I have a web based application that has a number of hooks into other systems. Let's say 5, and each of these 5 systems has a number of flags to determine different settings the user has selected in said systems, lets say 5 settings per system (so 5*5).
I am storing the status of these settings in the user sesion variables and was wondering is that a sufficient way of doing it?
I'm learning php as I go along so not sure about any pitfalls that this could run me into!
There's no size limit on session (apart from obvious memory and disk quota limit). Just keep it sane and don't put your entire database in it.
You have to realize that the session usually times out after 20minutes and then the data is garbage collected eventually. 25 values in the session isn't too much, but be sure you store them somewhere a bit more persistent if you can't afford to lose that data.
It's probably fine for a prototype, but you probably want to consider persisting these settings in MySQL/Postgres/Mongo/etc.
In terms of the number of settings that PHP can support, it depends on how many users you have and how much memory your production environment has.
I've seen PHP $_SESSIONs that contain arrays of hundreds of objects, so your little 5x5 matrix is tiny by comparison.
(The hundreds of ojects examples that I've seen are, however, a bit excessive, so just to clarify that I'm not condoning going that far! ;-))

How to maintain chat data?

I have a curious question...
I wanted to know how to maintain chat data in a database.
I have been using a php-mysql application, that stores chat data of users in a database.
Now my question is that, if the chat data increases, say, to some millions of records, how to store it? Does mysql support it, or have any limitations or burden ?
Take the example of gmail chat. I can chat unlimited and can also retrieve all my previous chat data. How is it possible ?
Can anyone answer this typical question of myne ?
Chat history isn't really that heavyweight. If I calculate around 100 bytes per message, 6 messages per minute, and 5 hours per day, (that is a very talkative chatter, though), permanently, as a worst case, that would give about 61MB per user per year (!).
That means with 1 million talkative chatters (very unprobable) you would need around 58TB or data storage.
Saying that this is a worst-case calculation, I would start off with a maximum of 1TB storage, set up the database, and see how things are going. It is highly unprobably for a very young service to evolve that fast.
Also, I would personally not recommend using a Windows system for something like this, unless you know very well what you're doing. MySQL on a Debian distribution will store billions of records, and probably do this faster due to less OS-level limitations (see the MySQL documentation for details, there should be section about the limitations on Windows).
MySQL will happily store millions, even billions of records; but some of the numeric types won't be enough: see this for the maxima of numeric types. As you can see, it would be better to use BIGINT UNSIGNED for e.g. autoincrement fields.
Performance may become a problem for large tables, but that can be mostly solved with indexes (meaning "I've seen performance decrease somewhere around the 100GB mark in a similar situation").
Google has vast amounts of custom storage designed by it for its requirements. What I suggest is you determine your requirements more concretely and determine the platform you need.

Categories