PHP Loop Performance Optimization - php

I am writing a PHP function that will need to loop over an array of pointers and for each item, pull in that data (be it from a MySQL database or flat file). Would anyone have any ideas of optimizing this as there could potentially be thousands and thousands of iterations?
My first idea was to have a static array of cached data that I work on and any modifications will just change that cached array then at the end I can flush it to disk. However in a loop of over 1000 items, this would be useless if I only keep around 30 in the array. Each item isn't too big but 1000+ of them in memory is way too much, hence the need for disk storage.
The data is just gzipped serialized objects. Currently I am using a database to store the data but I am thinking maybe flat files would be quicker (I don't care about concurrency issues and I don't need to parse it, just unzip and unserialize). I already have a custom iterator that will pull in 5 items at a time (to cut down on DB connections) and store them in this cache. But again, using a cache of 30 when I need to iterate over thousands is fairly useless.
Basically I just need a way to iterate over these many items quickly.

Well, you haven't given a whole lot to go on. You don't describe your data, and you don't describe what your data is doing or when you need one object as opposed to another, and how those objects get released temporarily, and under what circumstances you need it back, and...
So anything anybody says here is going to be a complete shot in the dark.
...so along those lines, here's a shot in the dark.
If you are only comfortable holding x items in memory at any one time, set aside space for x items. Then, every time you access the object, make a note of the time (this might not mean clock time so much as it may mean the order in which you access them). Keep each item in a list (it may not be implemented in a list, but rather as a heap-like structure) so that the most recently used items appear sooner in the list. When you need to put a new one into memory, you replace the one that was used the longest time ago and then you move that item to the front of the list. You may need to keep another index of the items so that you know where exactly they are in the list when you need them. What you do then is look up where the item is located, link its parent and child pointers as appropriate, then move it to the front of the list. There are probably other ways to optimize lookup time, too.
This is called the LRU algroithm. It's a page replacement scheme for virtual memory. What it does is it delays your bottleneck (the disk I/O) until it's probably impossible to avoid. It is worth noting that this algorithm does not guarantee optimal replacement, but it performs pretty well nonetheless.
Beyond that, I would recommend parallelizing your code to a large degree (if possible) so that when one item needs to hit the hard disk to load or to dump, you can keep that processor busy doing real work.
< edit >
Based off of your comment, you are working on a neural network. In the case of your initial fedding of the data (before the correction stage), or when you are actively using it to classify, I don't see how the algorithm is a bad idea, unless there is just no possible way to fit the most commonly used nodes in memory.
In the correction stage (perhaps back-prop?), it should be apparent what nodes you MUST keep in memory... because you've already visited them!
If your network is large, you aren't going to get away with no disk I/O. The trick is to find a way to minimize it.
< /edit >

Clearly, keeping it in memory is faster than anything else. How big is each item? Even if they are 1K each, ten thousand of them is only 10 M.

you can always break out on a loop after you get the data you need. so that it will not continue on looping. if it is a flat file you are storing.. you server HDD will suffer containing thousands or millions of files with different file size. But if you are talking about the whole actual file stored in a DB. then it is much better to store it in a folder and just save the path of that file in the DB. And try putting the pulled items in an XML. so that it is much easier to access and it can contain many attributes for the details of the item pulled e.g (Name, date uploaded, etc).

You could use memcached to store objects the first time they are read, then use the cached version in the subsequent calls. Memcached use the RAM to store objects so as long you have enough memory, you will have a great accceleration. There is a php api to memcached

Related

Reading a file or searching in a database?

I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn
In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.
Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.
That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)
It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.

How many lines are too many when reading from a file?

I intend to make a dynamic list in php, for which I have a plain text file with an element of the list in every line. Every line has a string that needs to be parsed into several smaller chunks before rendering the final html document.
Last time I did something similar, I used a file() function to load my file into an array, but in this case I have a 12KB file with more than 50 lines, that will most certainly grow bigger over time. Should I load the entries from the file to a SQL database to avoid performance issues?
Yes, put the information into a data base. Not for performance reasons (in terms of sequential reading) because a 12KB file will be read very quickly, but for the part about parsing into separate chunks. Make those chunks into columns of your DB table. It will make the whole programming process go faster, with greater flexibility.
Breaking stuff up in to properly formatted database is -almost- always a good idea and will be a performance saver.
However, 50 lines is pretty minor (even a few hundred lines is pretty minor). A bit of quick math, 12KB / 50 lines tells me each line is only about 240 characters long on average.
I doubt that amount of processing (or even several times that much) will be a significant enough performance hit to cause dread unless this is a super high performance site.
While 50 lines doesn't seem like too much, it would be a good idea to use the database now rather than making the change later. One think you would have to remember is that using database won't straight-away eliminate performance issues, but help you make better use of resources. In fact, you can write a similarly optimized process using files too, and they would work just about the same except for I/O difference.
I reread the question and you realize that you might mean that you would load the file to the database every time. I don't see how this can help unless you are using database as a form of cache to avoid repeated hits to the file. Ultimately, reading from a file or database would only differ in how the script uses I/O, disk caches, etc... The processing you do on the list might make more of a difference here.

Memcache get optimization with php

Okay so I have some weird-er questions about Memcache. The whole basic idea of my caching technique is to save data to be requested by my PHP script in Memcached server. The main issue me and my team faced is that sometimes saving large amounts of data can sometimes pass the 1MB limit for the item data size in Memcached.
To further explain the approach imagine the following:
We have lots of data to configure a certain object and that data contains a lot of text and numbers..etc. And we need to save almost 200 items of those objects so the first approach we went with is to cache the entire 200ish objects to one big item in Memcached. That item may surpass the limit of 1Mb so we figured we can go with a new approach.
The new approach we went with is that we break down the data configuring the object into smaller building blocks (and since we don't use all the data in the same page) we would then use the smaller building blocks to get exactly the amount of data that we would use in that particular page.
The question is as follows:
Does the GET speed change when you get bigger data? Or would the limitation on the amount of requests handled by Memcached server in parallel get in the way of the second approach because we would then use multi GET to get the multiple building blocks configuring the object?
I know this is a weird question but it's vital to the new approach that we're going with since it would determine the size of the building blocks that we will use and whether or not we will add data to it if we need to.
Edit 1:
Bear in mind that we can use the MULTIGET function with the second approach so we don't have to connect to Memecached and wait for a response for each bit of data that we're getting. So parallel requests will be used to get the multiple keys.
Without getting into the 'what the heck are you storing in memcache and why not use another solution (like a DB with a memory table storage engine)....
I'd say the cost of the multiple requests is indeed a concern--especially with memcached running on remote nodes/hosts. A single request for a large object is most likely overall faster--you still need the same amount of data transferred, but will not have the additional separate request overhead vs. the 200 pieces.
BTW... If you're using APC and you don't have many of these huge items, you can use it instead of memcache to do local user level memory caching--the max size is easily tweakable via the php config settings. You won't get the benefit of distibuted access/sharing across hosts, but it's fast and simple.

Retrieving Records : Array VS DB

Concern about my page loading speed, I know there are a lot of factors that affect page loading time.
Does retrieving records (Categories) in a array instead of DB is faster?
Thanks
It is faster to keep it all in PHP till you have an absurd amount of records and you use up RAM.
BUT, both of these things are super fast. Selecting a handful of records on a single table that has an index should take less than a msec. Are you sure that you know the source of your web page slowness?
I would be a little bit cautious of having your Data in your code. It will make your system less maintainable. How will users change categories?
THis gets back to deciding if you want your site static versus dynamic.
Yes of course retrieving data from an array is much faster than retrieving data from a Database, but usually arrays and databases have totally different use cases, because data in an array is static (you type the value in code or in a separate file and you can't modify them) while data in a database is dynamic
Yes, it's probably faster to have an array of your categories directly in your PHP script, especially if you need all the categories on every page load. This makes it possible for APC to cache the array (if you have APC running), and also lessen the traffic to/from the database.
But is this where your bottleneck is? It seems to me as the categories should have been cached in the query cache and therefore be easily retrieved. If this is not your biggest bottleneck, chances are you won't see any decrease in loading times. Make sure to profile your application to find the large bottlenecks or you will waste your time on getting only small performance gains.
If you store categories in a database, you have to connect to the database, prepare a SQL statement, send it to the server, fetch the result set, and (probably) store the results in an array. (But you'll probably already have a connection to the database anyway, and hardware and software is designed to do this kind of work quickly.)
Storing and retrieving categories
from a database trades speed for
maintenance. Your data is always up
to date; it might take a little
longer to get it.
You can also store categories as constants or as literals in an array assignment. It would be smart to generate the constants or the array literals from data stored in the database, but you might not have to do that for every page load. If "categories" doesn't change much, you might be able to get away with generating the code once or twice a day, plus whenever someone adds a category. It depends on your application.
Storing and retrieving categories
from an array trades maintenance for
speed. Your data loads a little
faster; it might be incomplete.
The unsatisfying answer is that you're not going to be able to tell how different storage and page generation strategies affect page loading speed until you test them. And even testing isn't that easy, because the effect of changing server and database parameters can be, umm, surprising.
(You can also generate static pages from the database using php. I suggest you test some static pages to give you an idea of "best case" performance.)

Caching variables in PHP

Long story short, I am looking for the best way to quickly and efficently store, mostly, boolean variables, like:
Has current user viewed this page? (Boolean)
Has current user voted for this page? (Boolean again)
How many times today this user got points for voting? (Integer)
These variables are going to be stored only for ONE day, that is at midnight each day they will be removed.
I can think of five ways to accomplish this, but I don't know how to properly speedtest them, so I could certainly use some help with this.
1. Single File - Single Variable
The first idea is to store some variable in a file like this <?php $___XYZ = true;, then include it and return $___XYZ. The problem is, most likely there are going to be hundreds of these variables and this can take potentially a lot of space (since, correct me if I am wrong, each file takes minimum ~4KB of space, depending on partition format). Big plus is ease of access, easy to work with, and easy to clear the whole thing at the beginning of a day (just delete the whole folder with contents). Any problems with speed of access?
2. Single File - Many Variables
I could store groups of variables in one file, in such fashion:
0:1
1:1
14:0
154:0
Then use fgets to find and read the variable but what about writing mid-file? Can fwrite be used effectively? I am not really confident this way is much better than 1., but what do you think?
3. APC
Use apc_store and others to store, modify and access the data. I have three concerns here - I read somewhere that enabling APC can seriously slow down your site, that there are sometimes strange problems with caching, and am curious about how to effectively remove only the "daily" cache, and leave anything else I might have cached? And how fine is it with hundreds of thousands variables stored in it?
4. MySQL Database
I could create a table with two rows (name and variable) but... I have this feeling it will be painfully slow when compared to any from the above options.
To sum it up - which of these ways to store variables in PHP is the best? Or maybe there is something even better?
For profiling, you can use Xdebug, which stores profiling informations in the defined folder, and use webgrind to view the profiling data.
My settings in php.ini for xdebug:
zend_extension=C:/WEB/PHP-ts/php_xdebug-2.1.0-5.3-vc6.dll
xdebug.collect_params=4
xdebug.show_local_vars=on
xdebug.scream=1
xdebug.collect_vars=on
xdebug.dump_globals=on
xdebug.profiler_enable=1
xdebug.profiler_output_dir=C:/WEB/_profiler/
xdebug.profiler_output_name=cachegrind.%s.out
xdebug.collect_return=1
xdebug.collect_assignments=1
xdebug.show_mem_delta=1
And I found a blog post about cache performance comparison (but it's from 2006!):
Cache Type Cache Gets/sec
Array Cache 365000
APC Cache 98000
File Cache 27000
Memcached Cache (TCP/IP) 12200
MySQL Query Cache (TCP/IP) 9900
MySQL Query Cache (Unix Socket) 13500
Selecting from table (TCP/IP) 5100
Selecting from table (Unix Socket) 7400
What about memcached? It's really fast, and when you're just storing bools it all fits in memory no problem. It is definitely the fastest option of them all. At midnight, you can easily read out all the stats gathered during the day and clear the cache.
Use memcached, you can store variables in memory and set their expiry time, so they can be rotated like everyday.
Memcached is way faster than any other method you listed, if you're new to it, try this class i did.
My favorite way of doing this is a variation of #2, where I make an array out of the number pairs. Then it is easy to serialize the array and save it to a file.
For your application this has a disadvantage if multiple visitors/processes need access to the array at the same time. Perhaps there is a way around that by using a separate file for each user.
I would opt for mySQL and/or an in memory cache like APC or memcache.
Honestly, a properly indexed database table will probably be plenty fast for most operations. And you can easily clear all the records via a DELETE statement comparing timestamps.
It will definitely be faster than home-brew solution on the filesystem. And assuming you're already using mySQL for the rest of your site, you don't need to worry about an extra storage layer.
EDIT: I'd also like to point out that memory is volatile. Should your server lose power, your data will disappear if it's not persisted somewhere (like a database).

Categories