Dropping my lurker status to finally ask a question...
I need to know how I can improve on the performance of a PHP script that draws its data from XML files.
Some background:
I've already mapped the bottleneck to CPU - but want to optimize the script's performance before taking a hit on processor costs. Specifically, the most CPU-consuming part of the script is the XML loading.
The reason I'm using XML to store object data because the data needs to be accessible via a browser Flash interface, and we want to provide fast user access in that area. The project is still in early stages though, so if best practice would be to abandon XML altogether, that would be a good answer too.
Lots of data: Currently plotting for roughly 100k objects, albeit usually small ones - and they must ALL be taken up into the script, with perhaps a few rare exceptions. The data set will only grow with time.
Frequent runs: Ideally, we'd run the script ~50k times an hour; realistically, we'd settle for ~1k/h runs. This coupled with data size makes performance optimization completely imperative.
Already taken an optimization step of making several runs on the same data rather than loading it for each run, but it's still taking too long. The runs should generally use "fresh" data with the modifications done by users.
Just to clarify: is the data you're loading coming from XML files for processing in its current state and is it being modified before being sent to the Flash application?
It looks like you'd be better off using a database to store your data and pushing out XML as needed rather than reading it in XML first; if building the XML files gets slow you could cache files as they're generated in order to avoid redundant generation of the same file.
If the XML stays relatively static, you could cache it as a PHP array, something like this:
<xml><foo>bar</foo></xml>
is cached in a file as
<?php return array('foo' => 'bar');
It should be faster for PHP to just include the arrayified version of the XML.
~1k/hour, 3600 seconds per hour, more than 3 runs a second (let alone the 50k/hour)...
There are many questions. Some of them are:
Does your php script need to read/process all records of the data source for each single run? If not, what kind of subset does it need (~size, criterias, ...)
Same question for the flash application + who's sending the data? The php script? "Direct" request for the complete, static xml file?
What operations are performed on the data source?
Do you need some kind of concurrency mechanism?
...
And just because you want to deliver xml data to the flash clients it doesn't necessarily mean that you have to store xml data on the server. If e.g. the clients only need a tiny little subset of the availabe records it probably a lot faster not to store the data as xml but something more suited to speed and "searchability" and then create the xml output of the subset on-the-fly, maybe assisted by some caching depending on what data the client request and how/how much the data changes.
edit: Let's assume that you really,really need the whole dataset and need a continuous simulation. Then you might want to consider a continuous process that keeps the complete "world model" in memory and operates on this model on each run (world tick). This way at least you wouldn't have to load the data on each tick. But such a process is usually written in something else than php.
Related
The application i am working on needs to obtain dataset of around 10mb maximum two times a hour. We use that dataset to display paginated results on the site also simple search by one of the object properties should also be possible.
Currently we are thinking about 2 different ways to implement this
1.) Store the json dataset in the database or a file in the file system, read that and loop over to display results whenever we need.
2.) Store the json dataset in relational MySQL table and query the results and loop over whenever we need to display them.
Replacing/Refreshing the results has to be done multiple times per hour as i said.
Both ways have cons. I am trying to choose a good way which is less evil overall. Reading 10 MB in memory is not a lot and on the other hand rewriting a table few times a hour could produce conflicts in my opinion.
My concern regarding 1.) is how safe the app will be if we read 10mb in the memory all the time? What will happen if multiple users do this at some point of time, is this something to worry about or PHP is able to handle this in background?
What do you think it will be best for this use case?
Thanks!
When php runs on a web server (as it usually does) the server starts new php processes on demand when they're needed to handle concurrent requests. A powerful web server may allow fifty or so php processes. If each of them is handling this large data set, you'll need to have enough RAM for fifty copies. And, you'll need to load that data somehow for each new request. Reading 10mb from a file is not an overwhelming burden unless you have some sort of parsing to do. But it is a burden.
As it starts to handle each request, php offers a clean context to the programming environment. php is not good at maintaining in-RAM context from one request to the next. You may be able to figure out how to do it, but it's a dodgy solution. If you're running on a server that's shared with other web applications -- especially applications you don't trust -- you should not attempt to do this; the other applications will have access to your in-RAM data.
You can control the concurrent processes with Apache or nginx configuration settings, and restrict it to five or ten copies of php. But if you have a lot of incoming requests, those requests get serialized and they will slow down.
Will this application need to scale up? Will you eventually need a pool of web servers to handle all your requests? If so, the in-RAM solution looks worse.
Does your json data look like a big array of objects? Do most of the objects in that array have the same elements as each other? If so, that's conformable to a SQL table? You can make a table in which the columns correspond to the elements of your object. Then you can use SQL to avoid touching every row -- every element of each array -- every time you display or update data.
(The same sort of logic applies to Mongo, Redis, and other ways of storing your data.)
I've got a heavy-read website associated to a MySQL database. I also have some little "auxiliary" information (fits in an array of 30-40 elements as of now), hierarchically organized and yet gets periodically and slowly updated 4-5 times per year. It's not a configuration file though since this information is about the subject of the website and not about its functioning, but still kind of a configuration file. Until now, I just used a static PHP file containing an array of info, but now I need a way to update it via a backend CMS from my admin panel.
I thought of a simple CMS that allows the admin to create/edit/delete entries, periodical rare job, and then creates a static JSON file to be used by the page building scripts instead of pulling this information from the db.
The question is: given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
I just used a static PHP
This sounds like contradiction to me. Either static, or PHP.
given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
Cache was invented for a reason :) Same with your case - it all depends on how often data changes vs how often is read. If data changes once a day and remains static for 100k downloads during the day, then not caching it or not serving from flat file would would simply be stupid. If data changes once a day and you have 20 reads per day average, then perhaps returning the data from code on each request would be less stupid, but from other hand, all these 19 requests could be served from cache anyway, so... If you can, serve from flat file.
Caching is your best option, Redis or Memcached are common excellent choices. For flat-file or database, it's hard to know because the SQL schema you're using, (as in, how many columns, what are the datatype definitions, how many foreign keys and indexes, etc.) you are using.
SQL is about relational data, if you have non-relational data, you don't really have a reason to use SQL. Most people are now switching to NoSQL databases to handle this since modifying SQL databases after the fact is a huge pain.
Brief overview about my usecase: Consider a database (most probably mongodb) having a million entries. The value for each entry needs to be updated everyday by calling an API. How to design such a cronjob? I know Facebook does something similar. The only thing I can think of is to have multiple jobs which divide the database entries into batches and each job updates a batch. I am certain there are smarter solutions out there. I am also not sure what technology to use. Any advise is appreciated.
-Karan
Given the updated question context of "keeping the caches warm", a strategy of touching all of your database documents would likely diminish rather than improve performance unless that data will comfortably fit into available memory.
Caching in MongoDB relies on the operating system behaviour for file system cache, which typically frees cache by following a Least Recently Used (LRU) approach. This means that over time, the working data set in memory should naturally be the "warm" data.
If you force data to be read into memory, you could be loading documents that are rarely (or never) accessed by end users .. potentially at the expense of data that may actually be requested more frequently by the application users.
There is a use case for "prewarming" the cache .. for example when you restart a MongoDB server and want to load data or indexes into memory.
In MongoDB 2.2, you can use the new touch command for this purpose.
Other strategies for prewarming are essentially doing reverse optimization with an explain(). Instead of trying to minimize the number of index entries (nscanned) and documents (nscannedObjects), you would write a query that intentionally will maximize these entries.
With your API response time goal .. even if someone's initial call required their data to be fetched into memory, that should still be a reasonably quick indexed retrieval. A goal of 3 to 4 seconds response seems generous unless your application has a lot of processing overhead: the default "slow" query value in MongoDB is 100ms.
From a technical standpoint, You can execute scripts in the mongodb shell, and execute them via cron. If you schedule cron to run a command like:
./mongo server:27017/dbname--quiet my_commands.js
Mongodb will execute the contents of the my_commands.js script. Now, for an overly simple example just to illustrate the concept. If you wanted to find a person named sara and insert an attribute (yes, unrealistic example) you could enter the following in your .js script file.
person = db.person.findOne( { name : "sara" } );
person.validated = "true";
db.people.save( person );
Then everytime the cron runs, that record will be updated. Now, add a loop and a call to your api, and you might have a solution. More information on these commands and example can be found in the mongodb docs.
However, from a design perspective, are you sure you need to update every single record each night? Is there a way to identify a more reasonable subset of records that need to be processed? Or possibly can the api be called on the data as it's retrieved and served to whomever is going to consume it?
I am going to be writing some software in PHP to parse log files and aggregate the data then display them in graphs (like bar graphs, not vertices and edges).
Yeah, it's basically business intelligence software which my company has an entire team for but apparently they don't do a great job (10 minutes to load a page just doesn't do it).
Here is what i have to do:
Log files are data files which stores the raw data from a stats server we have setup running from our office (we send asynchronous calls to the stats server kinda like google analytics). It stores the data in csv format.
write a script to parse the files and aggregate the data into a database (or i was thinking about redis)
There will be millions and millions of things to aggregate so when displaying stats it must be fast
I know about OLAP for the DB, but if i want to go with redis do you think it would scale for large volumes of data? To parse the files do you think a PHP script would suffice or should i go with something faster like C/C++?
Basically i would like to get some interesting ideas about different ways to accomplish my task. It must be fast and scale.
Any ideas?
It sounds like at the scales you're talking about, you need to separate the data aggregation and display. That is, you should have some process working to receive the log files when they're generated, parse them and insert the data into the database; that will be a long, complicated task. Then when a user wants to display a graph of the data, they can make a request to the PHP server, which will pull the data from the database and construct the display they want. In this way, your parsing is separated from your display request (although it's still serially dependent, your parsing can begin when the logfiles become available, and therefore, the lag involved in parsing them is hidden at display time).
i run a system which needs to update various xml files from data stored in a db. The script runs via a server side php file which is monitored by a daemon so that it executes, finishes to free resources, then is restarted.
I have some benchmarking within the script, and when i have to update 100 xml files, its taking about 15 seconds to complete. A typical xml file which is created is around 6kb - I am creating the xml using php's dom, and writing using dom->save. The db is fully normalised, and the correct indexes are in place, the 3 queries that i need to perform which gets the necessary data i need to update the xml with only takes around 0.05 seconds. Therefore the bottleneck seems to be with the actual creating of the xml via dom and writing the file itself.
Does any have any ideas how i could really speed up the process? I have considered using a crc check to see whether the xml needs to be re-written, but this would still require me to read the xml file which i would be updating and i dont do this at the moment, so surely its just as bad as just saving a new file over the top of the old one? Also, i dont think its possible to edit certain parts of the xml, as the structure isnt uniform, the order of the nodes can change depending on what data is not null after being updated.
Really appreciate your thoughts on this!
Fifteen seconds to write a few XML files? That sounds way too much. Can you do some more profiling and find out which function exactly is the bottleneck?
Have you considered writing plain text XML (fwrite("<item>value</item>")) instead of building it by DOM? Sounds justifiable in this case.
Otherwise, for caching, there's always filemtime() that you could use to quickly get the "last modified" time of your XML file, and see whether the DB entry is younger than that. In a system like you describe, there should be no need to compare the contents.