I am currently thinking up a system to allow for online voting system for my old high school (a mock award ceremony really). Due to a restrictive school board I can guarentee that MySQL will not be an option to store votes. I am also under the assumption that should votes be stored in local files, data will overwrite when the file is called multiple times at the same time (which is a large possibility).
Does anyone have any suggestions as to how I might go about this? Perferably a PHP based solution as for the school board's restrictions. Please note the data will only need to be accessible for a few hours on a continuously running web server, so if the data is RAM-like (for a lack of a better term) that would be fine.
While I am tempted to reject the premiss of the question like some commenters have, here's an answer (I'm shamelessly trying to earn 200 reputation to try to help get a new site launched):
Write a recordVote function that stores each vote in its own file in a directory using a unique id in the file name (PHP doesn't have one guaranteed to yield truly unique GUIDs on all platforms, so use something like https://gist.github.com/dahnielson/508447).
When the polls close, run a tallyVote routine to compile the count of votes by reading all files in the directory.
Related
I want to make a detailed logger for my application and because it can get very complex and have to save a lot of different things I wonder where is the best to save it in a database(and if database wich kind of database is better for this kind of opperations) or in file(and if file what kind of format:text,csv,json,xml).My first thought was of course file because in database I see a lot of problems but I also want to be able to show those logs and for this is easier with database.
I am building a log for HIPPA compliance and here is my rough implementation (not finished yet).
File VS. DB
I use a database table to store the last 3 months of data. Every night a cron will run and push the older data (data past 3 months) off into compressed files. I haven't written this script yet but it should not be difficult. That way the last 3 months can be searched, filtered, etc. But the database won't be overwhelmed with log entries.
Database Preference
I am using MSSQL because I don't have a choice. I usually prefer MySQL though as it has better pager optimization. If you are doing more than a very minimal amount of searching and filtering or if you are concerned about performance you may want to consider an apache solr middle man. I'm not a db expert so I can't give you much more than that.
Table Structure
My table is 5 columns. Date, Operation (create, update, delete), Object (patient, appointment, doctor), ObjectID, and Diff (a serialized array of before and after values, changed values only no empty or unchanged values for the sake of saving space).
Summary
The most important piece to consider is: Do you need people to be able to access and filter/search the data regularly? IF yes consider a database for the recent history or the most important data.
If no a file is probably a better option.
My hybrid solution is also worth considering. I'll be pushing the files off to a amz file server so it doesn't take up my web servers space.
You can create the detail & Complex logger with using the some existing libraries like log4php because that is fully tested as part of the performance compare to you design custom for your self and it will also save time of development, I personally used few libraries from php and dotnet for our complex logger need in some financial and medical domain projects
here i would suggest if you need to do from the php then use this
https://logging.apache.org/log4php/
I think the right answer is actually: Neither.
Neither the file or a DB give you proper search, filtering, and you need that when looking at logs. I deal with logs all day long (see http://sematext.com/logsene to see why), and I'd tackle this as follows:
log to file
use a lightweight log shipper (e.g. Logagent or Filebeat)
index logs into either your own Elasticsearch cluster (if you don't mind managing and learning) or one of the Cloud log management services (if you don't want to deal with Elasticsearch management, scaling, etc. -- Logsene, Loggly, Logentries...)
I am working on a project that synchronizes online and offline features due to the unstable Internet. I have come up with a possible solution. That is to create 2 similar databases for both online and offline and sync the two. My question is that is this a good method? Or are there better options?
I have researched online on the subject but I haven't come across anything substantive. One useful link I found was on database Replication. But I want the offline version to detect Internet presence and sync accordingly.
Pls can you help me find solutions or clues to solve my problem?
I'd suggest you have an online storage for syncing and a local database(browser indexeddb, program sqllite or something similar) and log all your changes in your local database but have a record with what data was entered after last sync.
When you have a connection you sync all new data with the online storage at set intervals(like once every 5 mins or constant stream if you have the bandwidth/cpu capacity)
When the user logs in from a "fresh" location the online database pushes all data to the client who fills the local database with the data and then it resumes normal syncing function.
Plan A: Primary-Primary replication (formerly called Master-Master). You do need to be careful PRIMARY KEYs and UNIQUE keys. While the "other" machine is offline, you could write conflicting values to a table. Later, when they try to sync up, replication will freeze, requiring manual intervention. (Not a pretty sight.)
Plan B: Write changes to some storage other than the db. This suffers the same drawbacks as Plan A, plus there is a bunch of coding on your part to implement it.
Plan C: Galera cluster with 3 nodes. When all 3 nodes are up, all can take writes. If one node goes down, or network problems make it seem offline to the other two, it will automatically become read-only. After things get fixed, the sync is done automatically.
Plan D: Only write to a reliable Primary; let the other be a readonly Replica. (But this violates your requirement about an "unstable Internet".)
None of these perfectly fits the requirements. Plan A seems to be the only one that has a chance. Let's look at that.
If you have any UNIQUE key in any table and you might insert new rows into it, the problem exists. Even something as innocuous as a 'normalization table' wherein you insert a name and get back an id for use in other tables has the problem. You might do that on both servers with the same name and get different ids. Now you have a mess that is virtually impossible to fix.
Not sure if its outside the scope of the project but you can try these:
https://pouchdb.com/
https://couchdb.apache.org/
" PouchDB is an open-source JavaScript database inspired by Apache CouchDB that is designed to run well within the browser.
PouchDB was created to help web developers build applications that work as well offline as they do online.
It enables applications to store data locally while offline, then synchronize it with CouchDB and compatible servers when the application is back online, keeping the user's data in sync no matter where they next login. "
I've just finished a basic PHP file, that lets indie game developers / application developers store user data, handle user logins, self-deleting variables etc. It all revolves around storage.
I've made systems like this before, but always hit the max_user_connections issue - which I personally can't currently change, as I use a friends hosting - and often free hosting providers limit the max_user_connections anyway. This time, I've made the system fully text file based (each of them holding JSON structures).
The system works fine currently, as it's being tested by only me and another 4/5 users per second. The PHP script basically opens a text file (based upon query arguments), uses json_decode to convert the contents into the relevant PHP structures, then alters and writes back to the file. Again, this works fine at the moment, as there are few users using the system - but I believe if two users attempted to alter a single file at the same time, the person who writes to it last will overwrite the data that the previous user wrote to it.
Using SQL databases always seemed to handle queries quite slowly - even basic queries. Should I try to implement some form of server-side caching system, or possibly file write stacking system? Or should I just attempt to bump up the max_user_connections, and make it fully SQL based?
Are there limits to the number of users that can READ text files per second?
I know game / application / web developers must create optimized PHP storage solutions all the time, but what are the best practices in dealing with traffic?
It seems most hosting companies set the max_user_connections to a fairly low number to begin with - is there any way to alter this within the PHP file?
Here's the current PHP file, if you wish to view it:
https://www.dropbox.com/s/rr5ua4175w3rhw0/storage.php
And here's a forum topic showing the queries:
http://gmc.yoyogames.com/index.php?showtopic=623357
I did plan to release the PHP file, so developers could host it on their own site, but I would like to make it work as well as possible, before doing this.
Many thanks for any help provided.
Dan.
I strongly suggest you not re-invent the wheel. There are many options available for persistent storage. If you don't want to use SQL consider trying out any of the popular "NoSQL" options like MongoDB, Redis, CouchDB, etc. Many smart people have spent many hours solving the problems you are mentioning already, and they are hard at work improving and supporting their software.
Scaling a MySQL database service is outside the scope of this answer, but if you want to throttle up what your database service can handle you need to move out of a shared hosting environment in any case.
"but I believe if two users attempted to alter a single file at the same time, the person who writes to it last will overwrite the data that the previous user wrote to it."
- that is for sure. It even throws an error if the 2nd tries to save while the first has it open.
"Are there limits to the number of users that can READ text files per second?"
- no, but it is pointless to open a file, just for read multiple times. That file needs to be cached in a content management network.
"I know game / application / web developers must create optimized PHP storage solutions all the time, but what are the best practices in dealing with traffic?"
- usually a new database will do a better job than files, starting from the fact that the most often selects are stored in the RAM, the most often .txt files are not. As #oliakaoil read about the DB difference and see what you need.
Ok I am in the midst of developing a shared system/service of sorts. Where people will be able to upload there own media to the server(s). I am using PHP and mySQL for the majority of the build, and am currently using a single server environment. However I need this to be scaleable as I do intend on moving the media to a cluster of servers in the next 6 months leaving the site/service on its own server. Anyway thats a mute point.
My goal, or hope rather is to come up with an extremely low risk naming convention that runs little possibility ever of running into a collision with another file when renaming the file upon upload. I have read to date many concepts and find that UUID (GUID) is the best candidate for my over all needs as it has a number so high of possibilities that I dont think I could ever reach that many shared images ever.
My problem is coming up with a function that generates a UUID preferable v3 or v5 (I understand they are the same, but v5 currently doesn't comply 100% with the standard of UUID). Knowing little about UUID and the constraints there of that makes them unique and or valid when trying to regex over them later when and or if needed I can't seem to come up with a viable solution. Nor do I know which I should really go with v3 or v5. or v4 for that matter. So I am looking for advice as well as help on a function that will return the desired version UUID type.
Save your breath I haven't tried anything yet as I don't know where to begin currently. With that, I intend on saving these files across many folders to offset the loads caused by large directory listings. So I am also reducing my risk of collision there as well. I am also storing these names in a DB with there associated folders and other information tied to each image, so another problem I see there is when I randomly generate a UUID for a file to be renamed I don't want to query the DB multiple times in the event of a collision so I may actually want to return maybe 5 UUID per function call and see what if any have a match in my query where ill use the first one that doesnt have a match.
Anyway I know this is a lot of reading, I know theres no code with it, hopefully the lot of you don't end up down voting this cause theres to much reading, and assume this is a poor question/discussion. As I would seriously like to know how to tackle this from the begining so I can scale up as needed with as little hassel as possible.
If you are going to store a reference to each file in the database anyway .. why don't you use the MySQL auto_increment id to name your files? If you scale the DB to a cluster, the ID is still unique (being a PK, it must be unique!), so why waste precious CPU time with the UUID generation and stuff? this is not what UUIDs are made for.
I'd go for the easiest way (and i've seen that in many other systems, though):
upload file
when upload succeded, insert DB reference (with the path determined by 3.); fetch auto_incremented $ID
rename file to ${YEAR}/{$MONTH}/${DAY}/{$ID} (adjust if you need a more granular path, when too many files uploaded per day)
when rename failed, delete DB reference and show error message
update DB reference with the actual actual path in the file system
My goal, or hope rather is to come up with an extremely low risk
naming convention that runs little possibility ever of running into a
collision with another file when renaming the file upon upload. I have
read to date many concepts and find that UUID (GUID) is the best
candidate for my over all needs as it has a number so high of
possibilities that I dont think I could ever reach that many shared
images ever.
You could build a number (which you would then implement as UUID) made up of:
Date (YYYYMMDD)
Server (NNN)
Counter of images uploaded on that server that day
This number will never generate any collisions since it always increments, and can scale up to one thousand servers. Say that you get at most one million images per day on each server, that's around 43 bits of information. Add other 32 of randomness so that an UUID can't be guessed (in less than 2^31 attempts on average). You have some fifty-odd bits left to allow for further scaling.
Or you could store some digits in BCD to make them human-readable:
20120917-0172-4123-8456-7890d0b931b9
could be image 1234567890, random d0b931b9, uploaded on server 0172 on September 17th, 2012.
The scheme might even double as "directory spreading" scheme: once an image has an UUID which maps to, say, 20120917-125-00001827-d0b931b9, that means server 125, and you can store it in a directory structure called d0/b9/31/b9/20120917-125-00001827.jpg.
The naming convention ensures uniqueness, and the random bit ensure that the directory structure is "flat" (filling equally, with no directories too fuller than others), optimizing retrieval time.
This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.
I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.
Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.
When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.
For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.