Caching includes in PHP for iterated reuse

Caching includes in PHP for iterated reuse - php

Is there a way to cache a PHP include effectively for reuse, without APC, et al?
Simple (albeit stupid) example:
// rand.php
return rand(0, 999);
// index.php
$file = 'rand.php';
while($i++ < 1000){
echo include($file);
}
Again, while ridiculous, this pair of scripts dumps 1000 random numbers. However, for every iteration, PHP has to hit the filesystem (Correct? There is no inherit caching functionality I've missed, is there?)
Basically, how can I prevent the previous scenario from resulting in 1000 hits to the filesystem?
The only consideration I've come to so far is a goofy one, and it may not prove effective at all (haven't tested, wrote it here, error prone, but you get the idea):
// rand.php
return rand(0, 999);
// index.php
$file = 'rand.php';
$cache = array();
while($i++ < 1000){
if(isset($cache[$file])){
echo eval('?>' . $cache[$file] . '<?php;');
}else{
$cache[$file] = file_get_contents($file);
echo include($file);
}
}
A more realistic and less silly example:
When including files for view generation, given a view file is used a number of times in a given request (a widget or something) is there a realistic way to capture and re-evaluate the view script without a filesystem hit?

This would only make any sense if the include file was accessed across a network.
There is no inherit caching functionality I've missed, is there?
All operating systems are very highly optimized to reduce the amount of physical I/O and to speed up file operations. On a properly configured system in most cases, the system will rarely revert to disk to fetch PHP code. Sit down with a spreadsheet and have a think about how long it would take to process PHP code if every file had to be fetched from disk - it'd be ridiculous, e.g. suppose your script is in /var/www/htdocs/index.php and includes /usr/local/php/resource.inc.php - that's 8 seek operations to just locate the files - #8ms each, that's 64ms to find the files! Run some timings on your test case - you'll see that its running much, much faster than that.

As with Sabeen Malik's answer you could capture the output of the include with output buffering, then concat all of them together, then save that to a file and include the one file each time.
This one collective include could be kept for an hour by checking the file's mod time and then rewriting and re including the includes only once an hour.

I think better design would be something like this:
// rand.php
function get_rand() {
return rand(0, 999);
}
// index.php
$file = 'rand.php';
include($file);
while($i++ < 1000){
echo get_rand();
}

Another option:
while($i++ < 1000) echo rand(0, 999);

Related

Apache/PHP using 100% CPU while trying to free cache space

I created a script for use with my website that is supposed to erase the oldest entry in cache when a new item needs to be cached. My website is very large with 500,000 photos on it and the cache space is set to 2 GB.
These functions are what cause the trouble:
function cache_tofile($fullf, $c)
{
error_reporting(0);
if(strpos($fullf, "/") === FALSE)
{
$fullf = "./".$fullf;
}
$lp = strrpos($fullf, "/");
$fp = substr($fullf, $lp + 1);
$dp = substr($fullf, 0, $lp);
$sz = strlen($c);
cache_space_make($sz);
mkdir($dp, 0755, true);
cache_space_make($sz);
if(!file_exists($fullf))
{
$h = #fopen($fullf, "w");
if(flock($h, LOCK_EX))
{
ftruncate($h, 0);
rewind($h);
$tmo = 1000;
$cc = 1;
$i = fputs($h, $c);
while($i < strlen($c) || $tmo-- > 1)
{
$c = substr($c, $i);
$i = fwrite($h, $c);
}
flock($h, LOCK_UN);
fclose($h);
}
}
error_reporting(7);
}
function cache_space_make($sz)
{
$ct = 0;
$cf = cachefolder();
clearstatcache();
$fi = shell_exec("df -i ".$cf." | tail -1 | awk -F\" \" '{print \$4}'");
if($fi < 1)
{
return;
}
if(($old = disk_free_space($cf)) === false)
{
return;
}
while($old < $sz)
{
$ct++;
if($ct > 10000)
{
error_log("Deleted over 10,000 files. Is disk screwed up?");
break;
}
$fi = shell_exec("rm \$(find ".$cf."cache -type f -printf '%T+ %p\n' | sort | head -1 | awk -F\" \" '{print \$2}');");
clearstatcache();
$old = disk_free_space($cf);
}
}
cachefolder() is a function that returns the correct folder name with a / appended to it.
When the functions are executed, the CPU usage for apache is between 95% and 100% and other services on the server are extremely slow to access during that time. I also noticed in whm that cache disk usage is at 100% and refuses to drop until I clear the cache. I was expecting more like maybe 90ish%.
What I am trying to do with the cache_tofile function is attempt to free disk space in order to create a folder then free disk space to make the cache file. The cache_space_make function takes one parameter representing the amount of disk space to free up.
In that function I use system calls to try to find the oldest file in the directory tree of the entire cache and I was unable to find native php functions to do so.
The cache file format is as follows:
/cacherootfolder/requestedurl
For example, if one requests http://www.example.com/abc/def then from both functions, the folder that is supposed to be created is abc and the file is then def so the entire file in the system will be:
/cacherootfolder/abc/def
If one requests http://www.example.com/111/222 then the folder 111 is created and the file 222 will be created
/cacherootfolder/111/222
Each file in both cases contain the same content as what the user requests based on the url. (example: /cacherootfolder/111/222 contains the same content as what one would see when viewing source from http://www.example.com/111/222)
The intent of the caching system is to deliver all web pages at optimal speed.
My question then is how do I prevent the system from trying to lockup when the cache is full. Is there better code I can use than what I provided?

I would start by replacing the || in your code by &&, which was most likely the intention.
Currently, the loop will always run at least 1000 times - I very much hope the intention was to stop trying after 1000 times.
Also, drop the ftruncate and rewind.
From the PHP Manual on fopen (emphasis mine):
'w' Open for writing only; place the file pointer at the beginning of the file and truncate the
file to zero length. If the file does not exist, attempt to create it.
So your truncate is redundant, as is your rewind.
Next, review your shell_exec's.
The one outside the loop doesn't seem too much of a bottleneck to me, but the one inside the loop...
Let's say you have 1'000'000 files in that cache folder.
find will happily list all of them for you, no matter how long it takes.
Then you sort that list.
And then you flush 999'999 entries of that list down the toilet, and only keep the first one.
Then you do some stuff with awk that I don't really care about, and then you delete the file.
On the next iteration, you'll only have to go through 999'999 files, of which you discard only 999'998.
See where I'm going?
I consider calling shell scripts out of pure convenience bad practice anyway, but if you do it, do it as efficiently as possible, at least!
Do one shell_exec without head -1, store the resulting list in a variable, and iterate over it.
Although it might be better to abandon shell_exec altogether and instead program the corresponding routines in PHP (one could argue that find and rm are machine code, and therefore faster than code written in PHP to do the same task, but there sure is a lot of overhead for all that IO redirection).
Please do all that, and then see how bad it still performs.
If the results are still unacceptable, I suggest you put in some code to measure the time certain parts of those functions require (tip: microtime(true)) or use a profiler, like XDebug, to see where exactly most of your time is spent.
Also, why did you turn off error reporting for that block? Looks more than suspicious to me.
And as a little bonus, you can get rid of $cc since you're not using it anywhere.

creating only new files in PHP without cpu intensive code

In my cache system, I want it where if a new page is requested, a check is made to see if a file exists and if it doesn't then a copy is stored on the server, If it does exist, then it must not be overwritten.
The problem I have is that I may be using functions designed to be slow.
This is part of my current implementation to save files:
if (!file_exists($filename)){$h=fopen($filename,"wb");if ($h){fwrite($h,$c);fclose($h);}}
This is part of my implementation to load files:
if (($m=#filemtime($file)) !== false){
if ($m >= filemtime("sitemodification.file")){
$outp=file_get_contents($file);
header("Content-length:".strlen($outp),true);echo $outp;flush();exit();
}
}
What I want to do is replace this with a better set of functions meant for performance and yet still achieve the same functionality. All caching files including sitemodification.file reside on a ramdisk. I added a flush before exit in hopes that content will be outputted faster.
I can't use direct memory addressing at this time because the file sizes to be stored are all different.
Is there a set of functions I can use that can execute the code I provided faster by at least a few milliseconds, especially the loading files code?
I'm trying to keep my time to first byte low.

First, prefer is_file to file_exists and use file_put_contents:
if ( !is_file($filename) ) {
file_put_contents($filename,$c);
}
Then, use the proper function for this kind of work, readfile:
if ( ($m = #filemtime($file)) !== false && $m >= filemtime('sitemodification.file')) {
header('Content-length:'.filesize($file));
readfile($file);
}
}
You should see a little improvement but keep in mind that file accesses are slow and you check three times for files access before sending any content.

Download a large XML file from an external source in the background, with the ability to resume download if incomplete

Some background information
The files I would like to download is kept at the external server for a week, and a new XML file(10-50mb large) is created there every hour with a different name. I would like the large file to be downloaded to my server chunk by chunk in the background each time my website is loaded, perhaps 0.5mb each time, and then resume the download the next time someone else loads the website. This would require my site to have atleast 100 pageloads each hour to stay updated, so perhaps abit more of the file each time if possible. I have researched simpleXML, XMLreader, SAX parsing, but whatever I do, it seems it takes too long to parse the file directly, therefore I would like a different approach, namely downloading it like described above.
If I download a 30mb large XML file, I can parse it locally with XMLreader in 3 seconds(250k iterations) only, but when I try to do the same from the external server limiting it to 50k iterations, it uses 15secs to read that small part, so it would not be possible to parse it directly from that server it seems.
Possible solutions
I think it's best to use cURL. But then again, perhaps fopen(), fsockopen(), copy() or file_get_contents() are the way to go. I'm looking for advice on what functions to use to make this happen, or different solutions on how I can parse a 50mb external XML file into a mySQL database.
I suspect a Cron job every hour would be the best solution, but I am not sure how well that would be supported by webhosting companies, and I have no clue how to do something like that. But if thats the best solution, and the majority thinks so, I will have to do my research in that area too.
If a java applet/javascript running in the background would be a better solution, please point me in the right direction when it comes to functions/methods/libraries there aswell.
Summary
What's the best solution to downloading parts of a file in the
background, and resume the download each time my website is loaded
until its completed?
If the above solution would be moronic to even try, what
language/software would you use to achieve the same thing(download a large file every hour)?
Thanks in advance for all answers, and sorry for the long story/question.
Edit: I ended up using this solution to get the files with cron job scheduling a php script. It checks my folder for what files I already have, generates a list of the possible downloads for the last four days, then downloads the next XMLfile in line.
<?php
$date = new DateTime();
$current_time = $date->getTimestamp();
$four_days_ago = $current_time-345600;
echo 'Downloading: '."\n";
for ($i=$four_days_ago; $i<=$current_time; ) {
$date->setTimestamp($i);
if($date->format('H') !== '00') {
$temp_filename = $date->format('Y_m_d_H') ."_full.xml";
if(!glob($temp_filename)) {
$temp_url = 'http://www.external-site-example.com/'.$date->format('Y/m/d/H') .".xml";
echo $temp_filename.' --- '.$temp_url.'<br>'."\n";
break; // with a break here, this loop will only return the next file you should download
}
}
$i += 3600;
}
set_time_limit(300);
$Start = getTime();
$objInputStream = fopen($temp_url, "rb");
$objTempStream = fopen($temp_filename, "w+b");
stream_copy_to_stream($objInputStream, $objTempStream, (1024*200000));
$End = getTime();
echo '<br>It took '.number_format(($End - $Start),2).' secs to download "'.$temp_filename.'".';
function getTime() {
$a = explode (' ',microtime());
return(double) $a[0] + $a[1];
}
?>
edit2: I just wanted to inform you that there is a way to do what I asked, only it would'nt work in my case. With the amount of data I need the website would have to have 400+ visitors an hour for it to work properly. But with smaller amounts of data there are some options; http://www.google.no/search?q=poormanscron

You need to have a scheduled, offline task (e.g., cronjob). The solution you are pursuing is just plain wrong.
The simplest thing that could possibly work is a php script you run every hour (scheduled via cron, most likely) that downloads the file and processes it.

You could try fopen:
<?php
$handle = fopen("http://www.example.com/test.xml", "rb");
$contents = stream_get_contents($handle);
fclose($handle);
?>

How can I optimize this simple PHP script?

This first script gets called several times for each user via an AJAX request. It calls another script on a different server to get the last line of a text file. It works fine, but I think there is a lot of room for improvement but I am not a very good PHP coder, so I am hoping with the help of the community I can optimize this for speed and efficiency:
AJAX POST Request made to this script
<?php session_start();
$fileName = $_POST['textFile'];
$result = file_get_contents($_SESSION['serverURL']."fileReader.php?textFile=$fileName");
echo $result;
?>
It makes a GET request to this external script which reads a text file
<?php
$fileName = $_GET['textFile'];
if (file_exists('text/'.$fileName.'.txt')) {
$lines = file('text/'.$fileName.'.txt');
echo $lines[sizeof($lines)-1];
}
else{
echo 0;
}
?>
I would appreciate any help. I think there is more improvement that can be made in the first script. It makes an expensive function call (file_get_contents), well at least I think its expensive!

This script should limit the locations and file types that it's going to return.
Think of somebody trying this:
http://www.yoursite.com/yourscript.php?textFile=../../../etc/passwd (or something similar)
Try to find out where delays occur.. does the HTTP request take long, or is the file so large that reading it takes long.
If the request is slow, try caching results locally.
If the file is huge, then you could set up a cron job that extracts the last line of the file at regular intervals (or at every change), and save that to a file that your other script can access directly.

readfile is your friend here
it reads a file on disk and streams it to the client.
script 1:
<?php
session_start();
// added basic argument filtering
$fileName = preg_replace('/[^A-Za-z0-9_]/', '', $_POST['textFile']);
$fileName = $_SESSION['serverURL'].'text/'.$fileName.'.txt';
if (file_exists($fileName)) {
// script 2 could be pasted here
//for the entire file
//readfile($fileName);
//for just the last line
$lines = file($fileName);
echo $lines[count($lines)-1];
exit(0);
}
echo 0;
?>
This script could further be improved by adding caching to it. But that is more complicated.
The very basic caching could be.
script 2:
<?php
$lastModifiedTimeStamp filemtime($fileName);
if (isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])) {
$browserCachedCopyTimestamp = strtotime(preg_replace('/;.*$/', '', $_SERVER['HTTP_IF_MODIFIED_SINCE']));
if ($browserCachedCopyTimestamp >= $lastModifiedTimeStamp) {
header("HTTP/1.0 304 Not Modified");
exit(0);
}
}
header('Content-Length: '.filesize($fileName));
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 604800)); // (3600 * 24 * 7)
header('Last-Modified: '.date('D, d M Y H:i:s \G\M\T', $lastModifiedTimeStamp));
?>

First things first: Do you really need to optimize that? Is that the slowest part in your use case? Have you used xdebug to verify that? If you've done that, read on:
You cannot really optimize the first script usefully: If you need a http-request, you need a http-request. Skipping the http request could be a performance gain, though, if it is possible (i.e. if the first script can access the same files the second script would operate on).
As for the second script: Reading the whole file into memory does look like some overhead, but that is neglibable, if the files are small. The code looks very readable, I would leave it as is in that case.
If your files are big, however, you might want to use fopen() and its friends fseek() and fread()
# Do not forget to sanitize the file name here!
# An attacker could demand the last line of your password
# file or similar! ($fileName = '../../passwords.txt')
$filePointer = fopen($fileName, 'r');
$i = 1;
$chunkSize = 200;
# Read 200 byte chunks from the file and check if the chunk
# contains a newline
do {
fseek($filePointer, -($i * $chunkSize), SEEK_END);
$line = fread($filePointer, $i++ * $chunkSize);
} while (($pos = strrpos($line, "\n")) === false);
return substr($line, $pos + 1);

If the files are unchanging, you should cache the last line.
If the files are changing and you control the way they are produced, it might or might not be an improvement to reverse the order lines are written, depending on how often a line is read over its lifetime.
Edit:
Your server could figure out what it wants to write to its log, put it in memcache, and then write it to the log. The request for the last line could be fulfulled from memcache instead of file read.

The most probable source of delay is that cross-server HTTP request. If the files are small, the cost of fopen/fread/fclose is nothing compared to the whole HTTP request.
(Not long ago I used HTTP to retrieve images to dinamically generate image-based menus. Replacing the HTTP request by a local file read reduced the delay from seconds to tenths of a second.)
I assume that the obvious solution of accessing the file server filesystem directly is out of the question. If not, then it's the best and simplest option.
If not, you could use caching. Instead of getting the whole file, you just issue a HEAD request and compare the timestamp to a local copy.
Also, if you are ajax-updating a lot of clients based on the same files, you might consider looking at using comet (meteor, for example). It's used for things like chats, where a single change has to be broadcasted to several clients.

Is the same file tokenized every time I include it?

This question is about the PHP parsing engine.
When I include a file multiple times in a single runtime, does PHP tokenize it every time or does it keep a cache and just run the compiled code on subsequent inclusions?
EDIT: More details: I am not using an external caching mechanism and I am dealing with the same file being included multiple times during the same request.
EDIT 2: The file I'm trying to include contains procedural code. I want it to be executed every time I include() it, I am just curious if PHP internally keeps track of the tokenized version of the file for speed reasons.

You should use a PHP bytecode cache such as APC. That will accomplish what you want, to re-use a compiled version of a PHP page on subsequent requests. Otherwise, PHP reads the file, tokenizes and compiles it on every request.

By default the file is parsed every time it is (really) included, even within the same php instance.
But there are opcode caches like e.g. apc
<?php
$i = 'include_test.php';
file_put_contents($i, '<?php $x = 1;');
include $i;
echo $x, ' ';
file_put_contents($i, '<?php $x = 2;');
include $i;
echo $x, ' '1 2(ok, weak proof. PHP could check whether the file's mtime has changed. And that what apc does, I think. But without a cache PHP really doesn't)

Look at include_once().
It will include it again.
Also if you are using objects. Look at __autoload()

I just wrote a basic test, much like VolkerK's. Here's what I tested:
<?php
file_put_contents('include.php','<?php echo $i . "<br />"; ?>');
for($i = 0; $i<10; $i++){
include('include.php');
if($i == 5){
file_put_contents('include.php','<?php echo $i+$i; echo "<br />"; ?>');
}
}
?>
This generated the following:
0
1
2
3
4
5
12
14
16
18
So, unless it caches based on mtime of the file, it seems it parses every include. You would likely want to use include_once() instead of standard include(). Hope that helps!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.