How many lines are too many when reading from a file?

How many lines are too many when reading from a file? - php

I intend to make a dynamic list in php, for which I have a plain text file with an element of the list in every line. Every line has a string that needs to be parsed into several smaller chunks before rendering the final html document.
Last time I did something similar, I used a file() function to load my file into an array, but in this case I have a 12KB file with more than 50 lines, that will most certainly grow bigger over time. Should I load the entries from the file to a SQL database to avoid performance issues?

Yes, put the information into a data base. Not for performance reasons (in terms of sequential reading) because a 12KB file will be read very quickly, but for the part about parsing into separate chunks. Make those chunks into columns of your DB table. It will make the whole programming process go faster, with greater flexibility.

Breaking stuff up in to properly formatted database is -almost- always a good idea and will be a performance saver.
However, 50 lines is pretty minor (even a few hundred lines is pretty minor). A bit of quick math, 12KB / 50 lines tells me each line is only about 240 characters long on average.
I doubt that amount of processing (or even several times that much) will be a significant enough performance hit to cause dread unless this is a super high performance site.

While 50 lines doesn't seem like too much, it would be a good idea to use the database now rather than making the change later. One think you would have to remember is that using database won't straight-away eliminate performance issues, but help you make better use of resources. In fact, you can write a similarly optimized process using files too, and they would work just about the same except for I/O difference.
I reread the question and you realize that you might mean that you would load the file to the database every time. I don't see how this can help unless you are using database as a form of cache to avoid repeated hits to the file. Ultimately, reading from a file or database would only differ in how the script uses I/O, disk caches, etc... The processing you do on the list might make more of a difference here.

Related

List or single line array/parameters: Does one or the other perform better?

A little bit of a generic question but it has been playing on my mind for a while.
Whilst learning php coding, to help me create a WordPress Theme from scratch, I have noticed that some arrays/parameters are kept to a single line whilst others are listed underneath one an other. Personally, I prefer listing the arrays underneath one and other as I feel this helps with readability and generally just looks tidier - Especially, if the array is long.
Does anyone know if listing arrays/parameters have any performance 'ill effects' such as slowing down the page load speed etc? As far as I can see, it is just a coder's preference. Is this a correct assumption?

Code formatting has no effect on performance.
Even if you claim that a larger file takes longer to read, if you are using at least PHP 5.5 then PHP will use an opcode cache - it will cache how it parsed your files for subsequent requests, eliminating any formatting that you have in your file.

Should a very long function/series of functions be in one php file, or broken up into smaller ones?

At the moment I am writing a series of functions for fetching Dota 2 matches from the Steam API. When someone fetches their games, I have to (for my use) take a history of all of their games (lets say 3 api calls), then all the details from each of those games (so if there's 200 games, another 200 api calls). This takes a long time, and so far I'm programming all of the above to be in one php file "FetchMatchHistory.php", which is run by the user clicking a button on the web page.
Another thing that is making me feel it should be in one file, is that I imagine it is probably good practice to put all of the information (In this case, match history, match details, id's etc.) into the database all at once, so that there doesn't have to be null values in the database?
My question is whether or not having a function that takes a very long time should be in just one PHP file (should meaning, is generally considered good practice), or whether I should break the seperate functions down into smaller files. This is very context dependent, I know, so please forgive me.
Is it common to have API calls spanning several PHP files if that is what you are making? Is there a security/reliability issue with having only one file doing all the leg-work (so to speak)?

Good practice is to have a number of relevant functions grouped together in a php file that describes them, for organize them better and also for caching reasons for the parts that get updated more slowly than other.
But speaking of performance, i doubt you'll get the performance improvements you seek by just moving code through files.
Personally i had the habit to put everything in a file, consistently:
making my files fat
hard-to-update
hard-to-read
hard to find the thing i want (Ctrl+F meltdown)
wasting bandwidth uploading parts they did not need to be updated
virtually disabling caching on server
I dont know if any of the above is of any use for your App, but breaking files into their relevant files/places did my life easier.
UPDATE:
About the database practice, you're going to query only the parts you want to be updated.
I dont understand why you split that logic in files, there's not going to give you performance. Instead, what is going to give you performance is to update only the relevant parts and having tables with relevant content. Speaking of multiple tables have a lot more sense, since you could use them as pointers to the large data contained in another tables, reducing the possible waste of data having just one table.
Also, dont forget a single table has limitations; I personally try to have as few columns as possible. Adding more and more and a day you can't add more because of the row limit. There is a maximum number of columns in general, but this limit rarely ever get maxed by developer; the increased per-row content itself is going to suck out that limit.

Whether to split server side code to multiple files or keep it in a single one is an organizational issue, more than a security/reliability one...
I don't think it's more secure to keep your code in separate source files.
It's entirely a of how you prefer to organize and mantain your code base.
Usually, I separate it when I can find some kind of "categories" in my code.
Obviously, if you write OO code, the most common choice is to keep each class in a single file...

PHP Loop Performance Optimization

I am writing a PHP function that will need to loop over an array of pointers and for each item, pull in that data (be it from a MySQL database or flat file). Would anyone have any ideas of optimizing this as there could potentially be thousands and thousands of iterations?
My first idea was to have a static array of cached data that I work on and any modifications will just change that cached array then at the end I can flush it to disk. However in a loop of over 1000 items, this would be useless if I only keep around 30 in the array. Each item isn't too big but 1000+ of them in memory is way too much, hence the need for disk storage.
The data is just gzipped serialized objects. Currently I am using a database to store the data but I am thinking maybe flat files would be quicker (I don't care about concurrency issues and I don't need to parse it, just unzip and unserialize). I already have a custom iterator that will pull in 5 items at a time (to cut down on DB connections) and store them in this cache. But again, using a cache of 30 when I need to iterate over thousands is fairly useless.
Basically I just need a way to iterate over these many items quickly.

Well, you haven't given a whole lot to go on. You don't describe your data, and you don't describe what your data is doing or when you need one object as opposed to another, and how those objects get released temporarily, and under what circumstances you need it back, and...
So anything anybody says here is going to be a complete shot in the dark.
...so along those lines, here's a shot in the dark.
If you are only comfortable holding x items in memory at any one time, set aside space for x items. Then, every time you access the object, make a note of the time (this might not mean clock time so much as it may mean the order in which you access them). Keep each item in a list (it may not be implemented in a list, but rather as a heap-like structure) so that the most recently used items appear sooner in the list. When you need to put a new one into memory, you replace the one that was used the longest time ago and then you move that item to the front of the list. You may need to keep another index of the items so that you know where exactly they are in the list when you need them. What you do then is look up where the item is located, link its parent and child pointers as appropriate, then move it to the front of the list. There are probably other ways to optimize lookup time, too.
This is called the LRU algroithm. It's a page replacement scheme for virtual memory. What it does is it delays your bottleneck (the disk I/O) until it's probably impossible to avoid. It is worth noting that this algorithm does not guarantee optimal replacement, but it performs pretty well nonetheless.
Beyond that, I would recommend parallelizing your code to a large degree (if possible) so that when one item needs to hit the hard disk to load or to dump, you can keep that processor busy doing real work.
< edit >
Based off of your comment, you are working on a neural network. In the case of your initial fedding of the data (before the correction stage), or when you are actively using it to classify, I don't see how the algorithm is a bad idea, unless there is just no possible way to fit the most commonly used nodes in memory.
In the correction stage (perhaps back-prop?), it should be apparent what nodes you MUST keep in memory... because you've already visited them!
If your network is large, you aren't going to get away with no disk I/O. The trick is to find a way to minimize it.
< /edit >

Clearly, keeping it in memory is faster than anything else. How big is each item? Even if they are 1K each, ten thousand of them is only 10 M.

you can always break out on a loop after you get the data you need. so that it will not continue on looping. if it is a flat file you are storing.. you server HDD will suffer containing thousands or millions of files with different file size. But if you are talking about the whole actual file stored in a DB. then it is much better to store it in a folder and just save the path of that file in the DB. And try putting the pulled items in an XML. so that it is much easier to access and it can contain many attributes for the details of the item pulled e.g (Name, date uploaded, etc).

You could use memcached to store objects the first time they are read, then use the cached version in the subsequent calls. Memcached use the RAM to store objects so as long you have enough memory, you will have a great accceleration. There is a php api to memcached

Which is better performance in PHP?

I generally include 1 functions file into the hader of my site, now this site is pretty high traffic and I just like to make every little thing the best that I can, so my question here is,
Is it better to include multiple smaller function type files with just the code that's needed for that page or does it really make no difference to just load it all as 1 big file, my current functions file has all the functions for my whole site, it's about 4,000 lines long and is loaded on every single page load sitewide, is that bad?

It's difficult to say. 4,000 lines isn't that large in the realms of file parsing. In terms of code management, that's starting to get on the unwieldy side, but you're not likely to see much of a measurable performance difference by breaking it up into 2, 5 or 10 files, and having pages include only the few they need (it's better coding practice, but that's a separate issue). Your differential in number-of-lines read vs. number-of-files that the parser needs to open doesn't seem large enough to warrant anything significant. My initial reaction is that this is probably not an issue you need to worry about.
On the opposite side of the coin, I worked on an enterprise-level project where some operations had an include() tree that often extended into the hundreds of files. Profiling these operations indicated that the time taken by the include() calls alone made up 2-3 seconds of a 10 second load operation (this was PHP4).

If you can install extensions on your server, you should take a look at APC (see also).
It is free, by the way ;-) ; but you must be admin of your server to install it ; so it's generally not provided on shared hosting...
It is what is called an "opcode cache".
Basically, when a PHP script is called, two things happen :
the script is "compiled" into opcodes
the opcodes are executed
APC keeps the opcodes in RAM ; so the file doesn't have to be re-compiled each time it is called -- and that's a great thing for both CPU-load and performances.
To answer the question a bit more :
4,000 lines is not that much, speaking of performances ; Open a couple of files of any big application / Framework, and you'll rapidly get to a couple thousand of lines
a really important thing to take into account is maintenability : what will be easier to work with for you and your team ?
loading many small files might imply many system calls, which are slow ; but those would probably be cached by the OS... So probably not that relevant
If you are doing even 1 database query, this one (including network round-trip between PHP server and DB server) will probably take more time than the parsing of a couple thousand lines ;-)

I think it would be better if you could split the functions file up into components that is appropriate for each page; and call for those components in the appropriate pages. Just my 2 cents!
p/s: I'm a PHP amateur and I'm trying my hands on making a PHP site; I'm not using any functions. So can you enlighten me on what functions would you need for a site?

In my experience having a large include file which gets included everywhere can actually kill performance. I worked on a browser game where we had all game rules as dynamically generated PHP (among others) and the file weighed in at around 500 KiB. It definitely affected performance and we considered generating a PHP extension instead.
However, as usual, I'd say you should do what you're doing now until it is a performance problem and then optimize as needed.

If you load a 4000 line file and use maybe 1 function that is 10 lines, then yes I would say it is inefficient. Even if you used lots of functions of a combined 1000 lines, it is still inefficient.
My suggestion would be to group related functions together and store them in separate files. That way if a page only deals with, for example, database functions you can load just your database functions file/library.
Anothe reason for splitting the functions up is maintainability. If you need to change a function you need to find it in your monalithic include file. You may also have functions that are very, very similar but don't even realise it. Sorting functions by what they do allows you to compare them and get rid of things you don't need or merge two functions into one more general purpose function.

Most of the time Disc IO is what will kill your server so I think the lesser files you fetch from disc the better. Furthermore if it is possible to install APC then the file will be stored compiled into memory which is a big win.

Generally it is better, file management wise, to break stuff down into smaller files because you only need to load the files that you actually use. But, at 4,000 lines, it probably won't make too much of a difference.
I'd suggest a solution similar to this
function inc_lib($name)
{
include("/path/to/lib".$name.".lib.php");
}
function inc_class($name)
{
include("/path/to/lib".$name.".class.php");
}

Which one is less costly in terms of resources?

Im on an optimization crusade for one of my sites, trying to cut down as many mysql queries as I can.
Im implementing partial caching, which writes .txt files for various modules of the site, and updates them on demand. I've came across one, that cannot remain static for all the users, so the .txt file thats written on the HD, will need to be altered on the fly via php.
Which is done via
flush();
ob_start();
include('file.txt');
$contents = ob_get_clean();
Then I modify the html in the $contents variable, and echo it out for different users.
Alternatively, I can leave it as it is, which runs a mysql query, which queries a small table that has category names (about 13 of them).
Which one is less expensive? Running a query every single time.... or doing it via the method I posted above, to inject html code on the fly, into a static .txt file?

Reading the file (save in very weird setups) will be minutely faster than querying the DB (no network interaction, &c), but the difference will hardly be measurable -- just try and see if you can measure it!

Optimize your queries first! Then use memcache or similar caching system, for data that is accessed frequently and then you can add file caching. We use all three combined and it runs very smooth. Small optimized queries aren't so bad. If your DB is in local server - network is not an issue. And don't forger to use MySQL query cache (i guess you do use MySQL).

Where is your the performance bottleneck?
If you don't know the bottleneck, you can't make any sensible assessment about optimisations.
Collect some metrics, and optimise accordingly.

Try both and choose the one that either is a clear winner or if not available, more maintainable. This depends on where the DB is, how much load it's getting, and whether you'll need to run more than one application instance (then they'd need to share this file on the network and it's not local anymore).

Here are the patterns that work for me when I'm refactoring PHP/MySQL site code.
The number of queries per page is absolutely critical - one complex query with joins is fastest as long as indexes are proper. A single page can almost always be generated with five or fewer queries in my experience, plus good use of classes and arrays of classes. Often one query for the session and one query for the app.
After indexes the biggest thing to work on is the caching configuration parameters.
Never have queries in loops.
Moving database queries to files has never been a useful strategy, especially since it often ends up screwing up your query integrity.
Alex and the others are right about testing. If your pages are noticeably slow, then they are slow for a reason (or reasons) - don't even start changing anything until you know what the reasons are and can measure the consequences of your changes. Refactoring by guessing is always a losing strategy espeically when (as in your case) you're adding complexity.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.