I am doing a small website project. In a page their is a section where the client posts new updates, at any given time there will be a maximum of 5 to 6 posts in this division. I was trying to create a MySQL database for the content. But I wonder if their is anyway I could have all the entries as XML files and use PHP to parse it. Is it possible ?
Which one is the better option MySQL or XML?
XML is a horrid piece of crap in my opinion. It's bloated and rather unpleasant to work with. However, it is a viable option as long as your number of entries and the amount of traffic stays small.
You can use SimpleXML to parse the XML, but the performance is going to degrade as file size increases. MySQL, however, will handle quite a lot of data before performance becomes a concern provided the schema is properly setup.
If you do use XML, you could always use a half-way XML solution. Like parse the file once, then store a serialized array of it.
Though really, if you're going to store it in a file of some sort, I would suggest, in order: SQLite, serialized array, JSON, XML. (Depending on your situation that order may change.)
If you abstract away the low level details enough, you should be able to make adapters that can be used interchangeably, thus allowing you to easily switch out storage backends. (On a large project, that would likely be unfeasible, but it sounds like your data storage/retrieval will remain fairly simple.)
Related
I need to implement a kind of a fast caching mechanism for a PHP application. It works something like this: Multiple node servers are requesting data from a central server (VIA JSON service). The node server should cache the responses on the file system in some fast efficient way. And that is the question - What will be the most optimal solution for the storage part. I have some types - XML (heard it can be inefficient with many records), store array definition with content in a PHP file or just dump an array of records to a file. Which of these would be most efficient for that scenario? Or maybe something else? I need to note that it must be implementend on a clean PHP >=5.2 without any additional libraries nor SQL.
Given the information you have provided, i would suggest simply dumping the JSON string to a file. This means there are no external libs or SQL engines required.
You could use XML if you want something that is "human readable" too, however XML isn't as quick and you would of course have to spend additional time generating the XML before you could store the data cache.
Reading is simply then just a case of getting the string from file and running through json_decode. If you only require parts of the data and not the entire lot, you could improve read performance by splitting the json object into blocks and writing to individual files, this trades off some of the write speed (not too much) but makes the read speed better.
Write speed could be made even better by writing to a partition configured with the ext2 filesystem.
However unless you working with large data sets and multiple cache files, there is no real reason to go to that sort of optimisation extent, writing the json to file as a string, and reading it back should be more than good enough for you.
You shouldn't generate XML files for caching content for only one application. It's overhead generating and parsing the XML and it results in much more bytes required.
Generating PHP-files is effective but there are some issues with it:
- possible parsing errors
- Could cache data twice (Filesystem-Cache of OS + PHP-Opcode-Cache)
I would prefer to wrinting cache files as simple serialized PHP data because It has a low parsing semantic and is very effective. You can also speed-up it by using a binary serializer like igbinary or mgspack.
Btw: If you cache data from a remote service on different web-node I would recommend you to use a caching server like memcached ;)
For a very large site such as a Social Network (say Facebook), which method would you recommend for user accounts storage?
1) Single XML files for each type of features, on the user's directory: basicinfo.xml, comments.xml, photos.xml, ...
2) MySQL, although not sure how to organize on this. Maybe separated tables for each feature? E.g. a tables for Comments, where columns are id,from,message,time?
I know XML is not designed for storage, and PHP (this is the language I use) must read the entire XML file and store in memory before it is used.
But, here are the reasons why I prefer XML (but I may be wrong, please tell me if you disagree with any):
1) If I have user accounts' paths organized in this way
User ID 2342:
/users/00/00/00/00/00/00/00/23/42/
I think it's faster to find the Comments of a user by file path than seeking in a large database.
Also, if each feature is split in tables, each user profile will seek more than once, to display comments, photos, basic info, etc.
2) I heard MySQL is globaly locked when writing on it. Is this true? If yes, I rather to lock a single file than everything.
3) Is MySQL "shared" between the cluster? I mean, if 1 disk gets full, will it "continue" on another? Or do I, as the programmer, have to manage it myself and create new databases on another disk? (note, I use Linux)
It is ok that it is about the same by using XML files, but it is easier to split between disks, because structure is split by account IDs, not by feature as it would be in a database.
4) Note that I don't store each comment on the comments.xml. I just note their attributes in each XML tag, and the messages are in separated text files commentid.txt. Once each XML should not be much large, there should not be problems with memory/time.
As for the problem of parsing entire XML, maybe I should think on using XMLReader/Writer instead of SimpleXML/DOM? Or, will it decrease performance allot?
Thank you!
Facebook uses MySQL.
That being said. Here's the long version:
I always say that XML is a data transfer technology, not a data storage technology, but not everyone agrees. XML is not designed to be use a relational datastore. XML was first introduced in order to provide a standard way of transmitting data from system to system w/o giving access to the originating systems.
Since you are talking about a large application, I would strongly urge you to use MySQL (or other RDBMS), as your dataset grows and grows the XML will be increasingly slower and slower unless you always keep a fresh copy in memory and only read the XML files upon service reboot.
Using an XML database is reportedly more efficient in terms of conversion costs when you're constantly sending XML into and retrieving XML out of a database. The rationale is, when XML is the only transport syntax used to get things in and out of the DB, why squeeze everything through a layer of SQL abstraction and all those relational tables, foreign keys, and the like? It basically takes a parsing layer out of the application and brings it into the data engine - where it's probably going to work faster and more efficiently than the SQL alternative. Probably.
Depends heavily on the nature of your site. On the one hand the XML approach gives you a free pass on things like “SELECT * FROM $table where $table.id=$id” type queries. On the other hand...
For a very large site, in the worst case scenario the data files end up pretty big too. If it is any kind of community site this may easily happen for any account go to any forum with a true number of old-guard members in its community and you'll find a couple of posters that have say 10K posts... This means you will wish for SQL style result sets which are implemented using a memory efficient model, rather than a speed efficient one. To the end user 1s versus 1.1s response time is not that much of a deal; but to you 1K of simultaneous requests versus 1.5K or better definitely is.
Then there is the aspect that if you are mostly reading data XML may be fine if somewhat crude for large data sets and DOM based implementations. But if you are writing a lot, things become much much worse. Caching of data is still possible, but giving ACID like guarantees on these file transactions requires you to pretty much write your own database software.
And then there is storage requirements and such like which mean you may need a distributed approach for storing your data. These kind of setups are relatively well understood in the database world, and they bring a lot of interesting problems with them to the table (like what do you do if a single disk fails?, how do you know on what disk to find the data and how do you implement efficient caching?) that essentially amount to again writing your own mini-database software from scratch.
So for a very large site I think the hard technical requirements of performance at not too great a cost in terms of memory and also a certain reliability and not needing to reinvent 21 wheels at the same time means that your approach would not work that well. I think it is better suited to smallish read-only sites where you can afford to experiment with and pursue alternative routes, where you can easily make changes and roll them out across the entire site.
IME: An in-house application using a single XML file for persistence didn't stand up to use by a single user...
1) What you're suggesting is that an XML file system with a manager application... There are XML databases, and XML there's been increasing support for storing XML within RDBMS. You're looking at re-inventing the wheel...
That's besides the normalization that would come out of storing the data in a RDBMS, which would enforce referential integrity that XML will never do...
2) "Global locking" is without any contextual scope. No database I know of locks globally when writing; most support degrees of locking (table/row/etc, varies between vendors) for sake of retaining concurrency when directed to - not by default.
3) Without a database, data or actual users--being concerned about clustering is definitely premature optimization.
4) If the system crashes without having written the referential integrity to some sort of persistence that will survive the application being turned off, the data will be useless.
Dropping my lurker status to finally ask a question...
I need to know how I can improve on the performance of a PHP script that draws its data from XML files.
Some background:
I've already mapped the bottleneck to CPU - but want to optimize the script's performance before taking a hit on processor costs. Specifically, the most CPU-consuming part of the script is the XML loading.
The reason I'm using XML to store object data because the data needs to be accessible via a browser Flash interface, and we want to provide fast user access in that area. The project is still in early stages though, so if best practice would be to abandon XML altogether, that would be a good answer too.
Lots of data: Currently plotting for roughly 100k objects, albeit usually small ones - and they must ALL be taken up into the script, with perhaps a few rare exceptions. The data set will only grow with time.
Frequent runs: Ideally, we'd run the script ~50k times an hour; realistically, we'd settle for ~1k/h runs. This coupled with data size makes performance optimization completely imperative.
Already taken an optimization step of making several runs on the same data rather than loading it for each run, but it's still taking too long. The runs should generally use "fresh" data with the modifications done by users.
Just to clarify: is the data you're loading coming from XML files for processing in its current state and is it being modified before being sent to the Flash application?
It looks like you'd be better off using a database to store your data and pushing out XML as needed rather than reading it in XML first; if building the XML files gets slow you could cache files as they're generated in order to avoid redundant generation of the same file.
If the XML stays relatively static, you could cache it as a PHP array, something like this:
<xml><foo>bar</foo></xml>
is cached in a file as
<?php return array('foo' => 'bar');
It should be faster for PHP to just include the arrayified version of the XML.
~1k/hour, 3600 seconds per hour, more than 3 runs a second (let alone the 50k/hour)...
There are many questions. Some of them are:
Does your php script need to read/process all records of the data source for each single run? If not, what kind of subset does it need (~size, criterias, ...)
Same question for the flash application + who's sending the data? The php script? "Direct" request for the complete, static xml file?
What operations are performed on the data source?
Do you need some kind of concurrency mechanism?
...
And just because you want to deliver xml data to the flash clients it doesn't necessarily mean that you have to store xml data on the server. If e.g. the clients only need a tiny little subset of the availabe records it probably a lot faster not to store the data as xml but something more suited to speed and "searchability" and then create the xml output of the subset on-the-fly, maybe assisted by some caching depending on what data the client request and how/how much the data changes.
edit: Let's assume that you really,really need the whole dataset and need a continuous simulation. Then you might want to consider a continuous process that keeps the complete "world model" in memory and operates on this model on each run (world tick). This way at least you wouldn't have to load the data on each tick. But such a process is usually written in something else than php.
If I have 50,000-100,000 product skus with accompanying information, including specifications and descriptions, that needs to be updated on a regular basis (at least once a day), is XML the best way to go as a data interchange format? The application is written in PHP, and I'm thinking SimpleXML to PHP's native MySQL calls (as opposed to using application hooks to dump data into the appropriate location in the DB). The server will be Linux-based, and I will have full root access. I know this is a rather generic question, which is why I made it community wiki -- I'm looking for an overall approach that is considered best practice. If it matters the application is Magento.
You have to define the parameters of "best" for your given scenario.
XML is verbose, which means two things
You can supply a lot of detail about the data, including metadata
Filesize is going to be big
The other advantage you gain with XML is more advanced parsing/selection "out-of-the-box" with tools like XPath.
But there are many other formats you could choose, each with their own advantage and disadvange
Database Dump
Data Interchange Format
CSV
JSON
Serialized PHP
And several others.
My point is, that you need to figure out what's important to your system (speed? character-set support? human-readability?) and choose a format that's going to be compatible for both sides.
The only real down side to XML is that it is very verbose. XML files are generally very large compared to other formats. The upside is that it is relatively easy to read (for people) and parse (for software). With only 100K records (without knowing the size of each record) I think I would go with XML.
JSON takes a lot less space than XML, although XML compress very well. XML has also the advantage of a lot of mature libraries and tools.
If you exchange data with 3rd party sources you might want to validate there XML with a Schema. You don't have that for JSON.
Personally I end up using XML most of the time. If space is an issue I apply gzip compression to the XML data.
I currently use XML as an import format on an e-commerce project. It currently has over 10,000 products, attributes and descriptions and and will iterate over the data pretty quickly. I don't have any other choice in this matter, though.
Using SOAP would be a viable alternative to just receiving the raw XML (although, I think this would add to the performance cost, as SOAP uses XML as it's messaging format anyway), however, you can get your data as a native PHP type, such as an array which you could pass directly to your DAL for inserting to the database, side stepping the need for constructing a SimpleXML object.
So I'm going to be working on a home made blog system in PHP and I was wondering which way of storing data is the fastest. I could go in the MySQL direction, or I could go with my own little way of doing it which is storing all of the information (encoded in JSON) in files.
Which way would be the fastest, MySQL or JSON files?
For a small, single user 'database', a file system would likely be quicker - as the size and complexity grows, a database server like MySQL or SQL Server is hard to beat.
I would definately choose a DB option (as you need to be able to search and index stuff). But that does not mean you need a fully realized separate DB service.
MySQL is definitely the more scalable solution.
But the downside is you need to set up and maintain a separate service.
On the other hand there are DBs that are file based and still give you access with standard SQL (SQLite SQLite.org) jumps to mind. You get the advantages of SQL but you do not need to maintain a separate service. The disadvantage is that they are not as scalable.
I would choose a MySQL database - simply because it's easier to manage.
JSON is not really a format for storage, it's for sending data to JavaScripts. If you want to store data in files look into XML or Serialized PHP (which I suspect is what you are after, rather than JSON).
Forgive me if this doesn't answer your question very directly, but since it is a homecooked blog system is it really worth spending time thinking about what storage backend right now is faster?
You're not going to be looking at 10,000 concurrent users from day 1, it doesn't sound like it will need to scale to any maningful degree in the foreseeable future.
Why not just stick with MySQL as a sensible choice rather than a fast one? If you really want some sense that you designed for speed maybe bolt sqlite on instead.
Since you are thinking you may not have the need for a complex relational structure, this might be a fun opportunity to try something more down the middle.
Check out CouchDB, it is a document-based, schema free database (yet still indexable). The database is made of documents that contain named fields (think key-value pairs).
Have fun....
Though I don't know for certain, it seems to me that a MySQL database would be a lot faster, especially as the amount of data gets larger and larger.
Also, using MySQL with PHP is super easy, especially if you use an abstraction class like ezSQL. ezSQL makes working with a database really simple and I think you'd be creating more unnecessary work for yourself by going the home-brewed JSON direction.
I've done both. I like files for very simple problems and databases for complicated problems.
For file solutions, note these problems as the number of files increases:
1) Much more disk space is used than you might expect, because even tiny files use up a whole block. Blocks are fairly large on filesystems which support large drives.
2) Most filesystems get very slow when the number of files in a directory gets very large. My solution to this (assuming the names of the files are reasonably spread out across the alphabet) is to create a directory consisting of the first two letters of the filename. Thus, the file, "animal.txt" would be found at an/animal.txt. This works surprisingly well. If your filenames are not reasonable well-distributed across the alphabet, use some sort of hashing function to create the directories. Sounds a little crazy, but this can work very, very well, and I've used it for very fast solutions with tens of thousands of files.
But the file solutions really only fit sometimes. Unless you have a great reason to go with files, use a database.
This is really cool. It's a PHP class that controls a flat-file database with queries http://www.fsql.org/index.php
For blogs, I recommend caching the pages because blogs usually only have static content. This way, the queries only get run once while caching. You can update the cached pages when a new blog post is added.