PHP custom data service cache - php

I need to implement a kind of a fast caching mechanism for a PHP application. It works something like this: Multiple node servers are requesting data from a central server (VIA JSON service). The node server should cache the responses on the file system in some fast efficient way. And that is the question - What will be the most optimal solution for the storage part. I have some types - XML (heard it can be inefficient with many records), store array definition with content in a PHP file or just dump an array of records to a file. Which of these would be most efficient for that scenario? Or maybe something else? I need to note that it must be implementend on a clean PHP >=5.2 without any additional libraries nor SQL.

Given the information you have provided, i would suggest simply dumping the JSON string to a file. This means there are no external libs or SQL engines required.
You could use XML if you want something that is "human readable" too, however XML isn't as quick and you would of course have to spend additional time generating the XML before you could store the data cache.
Reading is simply then just a case of getting the string from file and running through json_decode. If you only require parts of the data and not the entire lot, you could improve read performance by splitting the json object into blocks and writing to individual files, this trades off some of the write speed (not too much) but makes the read speed better.
Write speed could be made even better by writing to a partition configured with the ext2 filesystem.
However unless you working with large data sets and multiple cache files, there is no real reason to go to that sort of optimisation extent, writing the json to file as a string, and reading it back should be more than good enough for you.

You shouldn't generate XML files for caching content for only one application. It's overhead generating and parsing the XML and it results in much more bytes required.
Generating PHP-files is effective but there are some issues with it:
- possible parsing errors
- Could cache data twice (Filesystem-Cache of OS + PHP-Opcode-Cache)
I would prefer to wrinting cache files as simple serialized PHP data because It has a low parsing semantic and is very effective. You can also speed-up it by using a binary serializer like igbinary or mgspack.
Btw: If you cache data from a remote service on different web-node I would recommend you to use a caching server like memcached ;)

Related

Very small, persistent data structure

I'd like to make a very, very small, but persistent data structure that I can reference quickly server-side, and I'm not sure how.
Basically, what I want is an array that holds little structures that hold 3-10 strings in them. The array would be of size somewhere from 50-5,000 (expandable).
I was considering using a database, but that seems like overkill in this case. I was considering using a file that held JSON, but that just doesn't seem right (I think my server would have to load the file, parse the file, then return every time the cgi is called).
I'd like to be able to have PHP get something out of this persistent data structure in constant, fast time every time it's called.
I'm currently using just vanilla Apache and PHP.
Even without a file APC can store those data! apc_fetch and apc_store. The only problem is that the data is restricted to one server, so as soon as you will have clusters or multiple servers they don't share the data. (http://www.php.net/manual/de/ref.apc.php)
If multiple servers are involved, memcached or redis are worth a check. Redis has built-in arrays.
Edit:
Check if json_encode/json_decode are as fast as serialize/unserialize for your scenario or even faster, jsonlib can be real fast. It removes some php-specific data, which is probably unnecessary for you (object names etc).
Edit2: If the server crashes, the plain apc-solution will lose all data. That is the reason you should also write it to a file if needed. apc is inside the apache process so it will be faster than memcached or redis.

How can I create a dynamic webpage using PHP and XML?

I am doing a small website project. In a page their is a section where the client posts new updates, at any given time there will be a maximum of 5 to 6 posts in this division. I was trying to create a MySQL database for the content. But I wonder if their is anyway I could have all the entries as XML files and use PHP to parse it. Is it possible ?
Which one is the better option MySQL or XML?
XML is a horrid piece of crap in my opinion. It's bloated and rather unpleasant to work with. However, it is a viable option as long as your number of entries and the amount of traffic stays small.
You can use SimpleXML to parse the XML, but the performance is going to degrade as file size increases. MySQL, however, will handle quite a lot of data before performance becomes a concern provided the schema is properly setup.
If you do use XML, you could always use a half-way XML solution. Like parse the file once, then store a serialized array of it.
Though really, if you're going to store it in a file of some sort, I would suggest, in order: SQLite, serialized array, JSON, XML. (Depending on your situation that order may change.)
If you abstract away the low level details enough, you should be able to make adapters that can be used interchangeably, thus allowing you to easily switch out storage backends. (On a large project, that would likely be unfeasible, but it sounds like your data storage/retrieval will remain fairly simple.)

XML vs MySQL for Large Sites

For a very large site such as a Social Network (say Facebook), which method would you recommend for user accounts storage?
1) Single XML files for each type of features, on the user's directory: basicinfo.xml, comments.xml, photos.xml, ...
2) MySQL, although not sure how to organize on this. Maybe separated tables for each feature? E.g. a tables for Comments, where columns are id,from,message,time?
I know XML is not designed for storage, and PHP (this is the language I use) must read the entire XML file and store in memory before it is used.
But, here are the reasons why I prefer XML (but I may be wrong, please tell me if you disagree with any):
1) If I have user accounts' paths organized in this way
User ID 2342:
/users/00/00/00/00/00/00/00/23/42/
I think it's faster to find the Comments of a user by file path than seeking in a large database.
Also, if each feature is split in tables, each user profile will seek more than once, to display comments, photos, basic info, etc.
2) I heard MySQL is globaly locked when writing on it. Is this true? If yes, I rather to lock a single file than everything.
3) Is MySQL "shared" between the cluster? I mean, if 1 disk gets full, will it "continue" on another? Or do I, as the programmer, have to manage it myself and create new databases on another disk? (note, I use Linux)
It is ok that it is about the same by using XML files, but it is easier to split between disks, because structure is split by account IDs, not by feature as it would be in a database.
4) Note that I don't store each comment on the comments.xml. I just note their attributes in each XML tag, and the messages are in separated text files commentid.txt. Once each XML should not be much large, there should not be problems with memory/time.
As for the problem of parsing entire XML, maybe I should think on using XMLReader/Writer instead of SimpleXML/DOM? Or, will it decrease performance allot?
Thank you!
Facebook uses MySQL.
That being said. Here's the long version:
I always say that XML is a data transfer technology, not a data storage technology, but not everyone agrees. XML is not designed to be use a relational datastore. XML was first introduced in order to provide a standard way of transmitting data from system to system w/o giving access to the originating systems.
Since you are talking about a large application, I would strongly urge you to use MySQL (or other RDBMS), as your dataset grows and grows the XML will be increasingly slower and slower unless you always keep a fresh copy in memory and only read the XML files upon service reboot.
Using an XML database is reportedly more efficient in terms of conversion costs when you're constantly sending XML into and retrieving XML out of a database. The rationale is, when XML is the only transport syntax used to get things in and out of the DB, why squeeze everything through a layer of SQL abstraction and all those relational tables, foreign keys, and the like? It basically takes a parsing layer out of the application and brings it into the data engine - where it's probably going to work faster and more efficiently than the SQL alternative. Probably.
Depends heavily on the nature of your site. On the one hand the XML approach gives you a free pass on things like “SELECT * FROM $table where $table.id=$id” type queries. On the other hand...
For a very large site, in the worst case scenario the data files end up pretty big too. If it is any kind of community site this may easily happen for any account go to any forum with a true number of old-guard members in its community and you'll find a couple of posters that have say 10K posts... This means you will wish for SQL style result sets which are implemented using a memory efficient model, rather than a speed efficient one. To the end user 1s versus 1.1s response time is not that much of a deal; but to you 1K of simultaneous requests versus 1.5K or better definitely is.
Then there is the aspect that if you are mostly reading data XML may be fine if somewhat crude for large data sets and DOM based implementations. But if you are writing a lot, things become much much worse. Caching of data is still possible, but giving ACID like guarantees on these file transactions requires you to pretty much write your own database software.
And then there is storage requirements and such like which mean you may need a distributed approach for storing your data. These kind of setups are relatively well understood in the database world, and they bring a lot of interesting problems with them to the table (like what do you do if a single disk fails?, how do you know on what disk to find the data and how do you implement efficient caching?) that essentially amount to again writing your own mini-database software from scratch.
So for a very large site I think the hard technical requirements of performance at not too great a cost in terms of memory and also a certain reliability and not needing to reinvent 21 wheels at the same time means that your approach would not work that well. I think it is better suited to smallish read-only sites where you can afford to experiment with and pursue alternative routes, where you can easily make changes and roll them out across the entire site.
IME: An in-house application using a single XML file for persistence didn't stand up to use by a single user...
1) What you're suggesting is that an XML file system with a manager application... There are XML databases, and XML there's been increasing support for storing XML within RDBMS. You're looking at re-inventing the wheel...
That's besides the normalization that would come out of storing the data in a RDBMS, which would enforce referential integrity that XML will never do...
2) "Global locking" is without any contextual scope. No database I know of locks globally when writing; most support degrees of locking (table/row/etc, varies between vendors) for sake of retaining concurrency when directed to - not by default.
3) Without a database, data or actual users--being concerned about clustering is definitely premature optimization.
4) If the system crashes without having written the referential integrity to some sort of persistence that will survive the application being turned off, the data will be useless.

Quick way to do data lookup in PHP

I have a data table with 600,000 records that is around 25 megabytes large. It is indexed by a 4 byte key.
Is there a way to find a row in such dataset quickly with PHP without resorting to MySQL?
The website in question is mostly static with minor PHP code and no database dependencies and therefore fast. I would like to add this data without having to use MySQL if possible.
In C++ I would memory map the file and do a binary search in it. Is there a way to do something similar in PHP?
PHP (at least 5.3) should already be optimized to use mmap if it's available and it is likely advantageous. Therefore, you can use the same strategy you say you would use with C++:
Open a stream with fopen
Move around for your binary search with fseek and fread
EDIT: actually, it seems to use mmap only in some other circumstances like file_get_contents. It shouldn't matter, but you can also try file_get_contents.
I would suggest memcachedb or something similar. If you are going to handle this entirely in PHP the script will have to read the entire file/datastruct for each request. It's not possible to do this in reasonable time dynamically.
In C++, would you stop and start the application each time a user wanted to view the file in a different way, therefore loading and unloading the file? Probably not, but that is how php is different than an application, and application programming languages.
PHP has tools to help you deal with the environment teardown/buildup. These tools are the database and/or keyed caching utilities like memcache. Use the right tool for the right job.

How to improve on PHP's XML loading time?

Dropping my lurker status to finally ask a question...
I need to know how I can improve on the performance of a PHP script that draws its data from XML files.
Some background:
I've already mapped the bottleneck to CPU - but want to optimize the script's performance before taking a hit on processor costs. Specifically, the most CPU-consuming part of the script is the XML loading.
The reason I'm using XML to store object data because the data needs to be accessible via a browser Flash interface, and we want to provide fast user access in that area. The project is still in early stages though, so if best practice would be to abandon XML altogether, that would be a good answer too.
Lots of data: Currently plotting for roughly 100k objects, albeit usually small ones - and they must ALL be taken up into the script, with perhaps a few rare exceptions. The data set will only grow with time.
Frequent runs: Ideally, we'd run the script ~50k times an hour; realistically, we'd settle for ~1k/h runs. This coupled with data size makes performance optimization completely imperative.
Already taken an optimization step of making several runs on the same data rather than loading it for each run, but it's still taking too long. The runs should generally use "fresh" data with the modifications done by users.
Just to clarify: is the data you're loading coming from XML files for processing in its current state and is it being modified before being sent to the Flash application?
It looks like you'd be better off using a database to store your data and pushing out XML as needed rather than reading it in XML first; if building the XML files gets slow you could cache files as they're generated in order to avoid redundant generation of the same file.
If the XML stays relatively static, you could cache it as a PHP array, something like this:
<xml><foo>bar</foo></xml>
is cached in a file as
<?php return array('foo' => 'bar');
It should be faster for PHP to just include the arrayified version of the XML.
~1k/hour, 3600 seconds per hour, more than 3 runs a second (let alone the 50k/hour)...
There are many questions. Some of them are:
Does your php script need to read/process all records of the data source for each single run? If not, what kind of subset does it need (~size, criterias, ...)
Same question for the flash application + who's sending the data? The php script? "Direct" request for the complete, static xml file?
What operations are performed on the data source?
Do you need some kind of concurrency mechanism?
...
And just because you want to deliver xml data to the flash clients it doesn't necessarily mean that you have to store xml data on the server. If e.g. the clients only need a tiny little subset of the availabe records it probably a lot faster not to store the data as xml but something more suited to speed and "searchability" and then create the xml output of the subset on-the-fly, maybe assisted by some caching depending on what data the client request and how/how much the data changes.
edit: Let's assume that you really,really need the whole dataset and need a continuous simulation. Then you might want to consider a continuous process that keeps the complete "world model" in memory and operates on this model on each run (world tick). This way at least you wouldn't have to load the data on each tick. But such a process is usually written in something else than php.

Categories