Handling big arrays in PHP

Handling big arrays in PHP - php

The application i am working on needs to obtain dataset of around 10mb maximum two times a hour. We use that dataset to display paginated results on the site also simple search by one of the object properties should also be possible.
Currently we are thinking about 2 different ways to implement this
1.) Store the json dataset in the database or a file in the file system, read that and loop over to display results whenever we need.
2.) Store the json dataset in relational MySQL table and query the results and loop over whenever we need to display them.
Replacing/Refreshing the results has to be done multiple times per hour as i said.
Both ways have cons. I am trying to choose a good way which is less evil overall. Reading 10 MB in memory is not a lot and on the other hand rewriting a table few times a hour could produce conflicts in my opinion.
My concern regarding 1.) is how safe the app will be if we read 10mb in the memory all the time? What will happen if multiple users do this at some point of time, is this something to worry about or PHP is able to handle this in background?
What do you think it will be best for this use case?
Thanks!

When php runs on a web server (as it usually does) the server starts new php processes on demand when they're needed to handle concurrent requests. A powerful web server may allow fifty or so php processes. If each of them is handling this large data set, you'll need to have enough RAM for fifty copies. And, you'll need to load that data somehow for each new request. Reading 10mb from a file is not an overwhelming burden unless you have some sort of parsing to do. But it is a burden.
As it starts to handle each request, php offers a clean context to the programming environment. php is not good at maintaining in-RAM context from one request to the next. You may be able to figure out how to do it, but it's a dodgy solution. If you're running on a server that's shared with other web applications -- especially applications you don't trust -- you should not attempt to do this; the other applications will have access to your in-RAM data.
You can control the concurrent processes with Apache or nginx configuration settings, and restrict it to five or ten copies of php. But if you have a lot of incoming requests, those requests get serialized and they will slow down.
Will this application need to scale up? Will you eventually need a pool of web servers to handle all your requests? If so, the in-RAM solution looks worse.
Does your json data look like a big array of objects? Do most of the objects in that array have the same elements as each other? If so, that's conformable to a SQL table? You can make a table in which the columns correspond to the elements of your object. Then you can use SQL to avoid touching every row -- every element of each array -- every time you display or update data.
(The same sort of logic applies to Mongo, Redis, and other ways of storing your data.)

Related

Planning, Scaling and Optimizing Large Web Application

I'm currently designing and developing a web application that has the potential to grow very large at a fast rate. I will give some general information and move on to my question(s). I would say I am a mid-level web programmer.
Here are some specifications:
MySQL - Database Backend
PHP - Used in front/backend. Also used for SOAP Client
HTML, CSS, JS, jQuery - Front end widgets (highcharts, datatables, jquery-ui, etc.)
I can't get into too many fine details as it is a company project, but the main objective is to construct a dashboard that thousands of users will be accessing from various devices.
The data for this project is projected to grow by 50,000 items per year ( ~1000 items per week ).
1 item = 1 row in database
An item will also record a daily history starting at the day it was inserting.
1 day of history per item = 1 record
365 records per 1 year per device
365 * 50,000 = ~18,500,000 [first year]
multiply ~18,500,000 records by x for each year after.
(My forumla is a bit off since items will be added periodically throughout that year)
All items and history are accessed through a SOAP Client that connects to an API service, then writes the record to the database.
Majority of this data will be read and remain static (read only). But some item data may be updated or changed. The data will also be updated each day and need to write another x amount of history.
Questions:
1) Is MySQL a good solution to handle these data requirements? ~100 million records at some point.
2) I am limited to synchronous calls with my PHP Soap Client (as far as I know). This is becoming time consuming as more items are being extracted. Is there a better option for writing a SOAP Client so that I can send asynchronous requests without waiting for a response?
3) Are there any other requirements I should be thinking about?

The difficulty involved in scaling is almost always a function of users times data. If you have a lot of users, but not much data, it's not hard to scale. A typical example is a popular blog. Likewise, if you have a lot of data but not very many users, you're also going to be fine. This represents things like accounting systems or data-warehouse situations.
The first step towards any solution is to rough in a schema and test it at scale. You will have no idea how your application is going to perform until you run it through the paces. No two applications ever have exactly the same problems. Most of the time you'll need to adjust your schema, de-normalize some data, or cache things more aggressively, but these are just techniques and there's no standard cookbook for scaling.
In your specific case you won't have many problems if the rate of INSERT activity is low and your indexes aren't too complicated. What you'll probably end up doing is splitting out those hundreds of millions of rows into several identical tables each with a much smaller set of records in them.
If you're having trouble getting your queries to execute, consider the standard approach: index, optimize, then denormalize, then cache.
Where PHP can't cut it, consider using something like Python, Ruby, Java/Scala or even NodeJS to help facilitate your database calls. If you're writing a SOAP interface, you have many options.

1) Is MySQL a good solution to handle these data requirements? ~100 million records at some point.
Absolutely. Make sure you've got everything indexed properly, and if you hit a storage or query-per-second limit, you've got plenty of options that apply to most/all DBMS's. You can get beefier hardware, start sharding data across servers, clustering, etc..
2) I am limited to synchronous calls with my PHP Soap Client (as far as I know). This is becoming time consuming as more items are being extracted. Is there a better option for writing a SOAP Client so that I can send asynchronous requests without waiting for a response?
PHP 5+ allows you to execute multiple requests in parallel with CURL. Refer to the curl_muli* function for this, such as curl_multi_exec(). As far as I know, this requires you to handle SOAP/XML processing disjointly from the requests.
3) Are there any other requirements I should be thinking about?
Probably. But, you're usually on the right track if you start with a properly indexed, normalized database, for which you've thought about your objects at least mostly correctly. Start denormalizing if/when you find instances wherein denormalization solves an existing or obvious near-future efficiency problem. But, don't optimize for things that could become problems if the moons of Saturn align. Only optimize for problems that users will notice somewhat regularly.

While talking about large scale app the all the efforts and credits should not be given to the database alone. However it is the core part as our data in the main thing in any web aplication and side my side the your application depends upon the code optimization too that includes your backend and frontend script. Images and mainly server. Oh god many factors affecting the application.

Memcache get optimization with php

Okay so I have some weird-er questions about Memcache. The whole basic idea of my caching technique is to save data to be requested by my PHP script in Memcached server. The main issue me and my team faced is that sometimes saving large amounts of data can sometimes pass the 1MB limit for the item data size in Memcached.
To further explain the approach imagine the following:
We have lots of data to configure a certain object and that data contains a lot of text and numbers..etc. And we need to save almost 200 items of those objects so the first approach we went with is to cache the entire 200ish objects to one big item in Memcached. That item may surpass the limit of 1Mb so we figured we can go with a new approach.
The new approach we went with is that we break down the data configuring the object into smaller building blocks (and since we don't use all the data in the same page) we would then use the smaller building blocks to get exactly the amount of data that we would use in that particular page.
The question is as follows:
Does the GET speed change when you get bigger data? Or would the limitation on the amount of requests handled by Memcached server in parallel get in the way of the second approach because we would then use multi GET to get the multiple building blocks configuring the object?
I know this is a weird question but it's vital to the new approach that we're going with since it would determine the size of the building blocks that we will use and whether or not we will add data to it if we need to.
Edit 1:
Bear in mind that we can use the MULTIGET function with the second approach so we don't have to connect to Memecached and wait for a response for each bit of data that we're getting. So parallel requests will be used to get the multiple keys.

Without getting into the 'what the heck are you storing in memcache and why not use another solution (like a DB with a memory table storage engine)....
I'd say the cost of the multiple requests is indeed a concern--especially with memcached running on remote nodes/hosts. A single request for a large object is most likely overall faster--you still need the same amount of data transferred, but will not have the additional separate request overhead vs. the 200 pieces.
BTW... If you're using APC and you don't have many of these huge items, you can use it instead of memcache to do local user level memory caching--the max size is easily tweakable via the php config settings. You won't get the benefit of distibuted access/sharing across hosts, but it's fast and simple.

Efficient cronjob recommendation

Brief overview about my usecase: Consider a database (most probably mongodb) having a million entries. The value for each entry needs to be updated everyday by calling an API. How to design such a cronjob? I know Facebook does something similar. The only thing I can think of is to have multiple jobs which divide the database entries into batches and each job updates a batch. I am certain there are smarter solutions out there. I am also not sure what technology to use. Any advise is appreciated.
-Karan

Given the updated question context of "keeping the caches warm", a strategy of touching all of your database documents would likely diminish rather than improve performance unless that data will comfortably fit into available memory.
Caching in MongoDB relies on the operating system behaviour for file system cache, which typically frees cache by following a Least Recently Used (LRU) approach. This means that over time, the working data set in memory should naturally be the "warm" data.
If you force data to be read into memory, you could be loading documents that are rarely (or never) accessed by end users .. potentially at the expense of data that may actually be requested more frequently by the application users.
There is a use case for "prewarming" the cache .. for example when you restart a MongoDB server and want to load data or indexes into memory.
In MongoDB 2.2, you can use the new touch command for this purpose.
Other strategies for prewarming are essentially doing reverse optimization with an explain(). Instead of trying to minimize the number of index entries (nscanned) and documents (nscannedObjects), you would write a query that intentionally will maximize these entries.
With your API response time goal .. even if someone's initial call required their data to be fetched into memory, that should still be a reasonably quick indexed retrieval. A goal of 3 to 4 seconds response seems generous unless your application has a lot of processing overhead: the default "slow" query value in MongoDB is 100ms.

From a technical standpoint, You can execute scripts in the mongodb shell, and execute them via cron. If you schedule cron to run a command like:
./mongo server:27017/dbname--quiet my_commands.js
Mongodb will execute the contents of the my_commands.js script. Now, for an overly simple example just to illustrate the concept. If you wanted to find a person named sara and insert an attribute (yes, unrealistic example) you could enter the following in your .js script file.
person = db.person.findOne( { name : "sara" } );
person.validated = "true";
db.people.save( person );
Then everytime the cron runs, that record will be updated. Now, add a loop and a call to your api, and you might have a solution. More information on these commands and example can be found in the mongodb docs.
However, from a design perspective, are you sure you need to update every single record each night? Is there a way to identify a more reasonable subset of records that need to be processed? Or possibly can the api be called on the data as it's retrieved and served to whomever is going to consume it?

Multiple memcache lookup Vs one lookup (with big output)

I am working on an application using memcache pool (5 servers) and some processing nodes. I have two different possible approaches and I was wondering if you guys have any comments on comparison based on performance (speed primarily) between the two
I extract a big chunk of data from memcache once per request, itereate over it and discard the bits I dont need for the particular request
I extract small small bits from memcached and only extract the ones I need. i.e. I extract value of a and based on value of a, extract value of either b or c. Use this combination to find the next key I want to extract.
The difference between the two is that the number of memcached lookups (which is a pool of servers) reduces in 1. but the size of response increases. Any benchmarking reports around it someone has seen before?
Unfortunately I cant use a better key based on request directly as I dont have enough memcache to support all possible combinations of values, so I got to construct some of it at run time
Thanks

You would have to benchmark for your own setup. The parts that would matter wold be the time spent on:
requesting large amount of data from memcache + retrieving it + extracting data from the resonse
sending several requests to memcache + retrieving the data
Basically first thing you have to measure is how large the overhead for interaction with your cache pool is. And there is that small matter of how this whole thing will react when load increases. What might be fast now, can turn out to be a terrible decision later, when the users start pouring in.
This kinda depends on your definition of "large chunk". Are we talking megabytes here or an array with 100 keys? You also have to consider, that php still needs to process that information.
There are two things you can do at this point:
take a hard looks at how you are storing the information. Maybe you can cut it down to two small requests. One to retrieve the specific data for the conditions, and other to get the conditional information.
setup your own benchmark-thing for your server. Some random article on the web will not be relevant to your system architecture.
I know this is not the answer you wanted to hear, but that's my two cents .. here ya go.

How to improve on PHP's XML loading time?

Dropping my lurker status to finally ask a question...
I need to know how I can improve on the performance of a PHP script that draws its data from XML files.
Some background:
I've already mapped the bottleneck to CPU - but want to optimize the script's performance before taking a hit on processor costs. Specifically, the most CPU-consuming part of the script is the XML loading.
The reason I'm using XML to store object data because the data needs to be accessible via a browser Flash interface, and we want to provide fast user access in that area. The project is still in early stages though, so if best practice would be to abandon XML altogether, that would be a good answer too.
Lots of data: Currently plotting for roughly 100k objects, albeit usually small ones - and they must ALL be taken up into the script, with perhaps a few rare exceptions. The data set will only grow with time.
Frequent runs: Ideally, we'd run the script ~50k times an hour; realistically, we'd settle for ~1k/h runs. This coupled with data size makes performance optimization completely imperative.
Already taken an optimization step of making several runs on the same data rather than loading it for each run, but it's still taking too long. The runs should generally use "fresh" data with the modifications done by users.

Just to clarify: is the data you're loading coming from XML files for processing in its current state and is it being modified before being sent to the Flash application?
It looks like you'd be better off using a database to store your data and pushing out XML as needed rather than reading it in XML first; if building the XML files gets slow you could cache files as they're generated in order to avoid redundant generation of the same file.

If the XML stays relatively static, you could cache it as a PHP array, something like this:
<xml><foo>bar</foo></xml>
is cached in a file as
<?php return array('foo' => 'bar');
It should be faster for PHP to just include the arrayified version of the XML.

~1k/hour, 3600 seconds per hour, more than 3 runs a second (let alone the 50k/hour)...
There are many questions. Some of them are:
Does your php script need to read/process all records of the data source for each single run? If not, what kind of subset does it need (~size, criterias, ...)
Same question for the flash application + who's sending the data? The php script? "Direct" request for the complete, static xml file?
What operations are performed on the data source?
Do you need some kind of concurrency mechanism?
...
And just because you want to deliver xml data to the flash clients it doesn't necessarily mean that you have to store xml data on the server. If e.g. the clients only need a tiny little subset of the availabe records it probably a lot faster not to store the data as xml but something more suited to speed and "searchability" and then create the xml output of the subset on-the-fly, maybe assisted by some caching depending on what data the client request and how/how much the data changes.
edit: Let's assume that you really,really need the whole dataset and need a continuous simulation. Then you might want to consider a continuous process that keeps the complete "world model" in memory and operates on this model on each run (world tick). This way at least you wouldn't have to load the data on each tick. But such a process is usually written in something else than php.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.