How can I avoid reloading an XML document multiple times?

How can I avoid reloading an XML document multiple times? - php

tl;dr: I want to load an XML file once and reuse it over and over again.
I have a bit of javascript that makes an ajax request to a PHP page that collects and parses some XML and returns it for display (like, say there are 4,000 nodes and the PHP paginates the results into chunks of 100 you would have 40 "pages" of data). If someone clicks on one of those other pages (besides the one that initially loads) then another request is made, the PHP loads that big XML file, grabs that subset of indexes (like records 200-299) and returns them for display. My question is, is there a way to load that XML file only once and just reuse it over and over?
The process on each ajax request is:
- load the xml file (simplexml_load_file())
- parse out the bits needed (with xpath)
- use LimitIterator to grab the specific set of indexes I need
- return that set
When what I'd like it to be when someone requests a different paginated result is:
- use LimitIterator on the data I loaded in the previous request (reparse if needed)
- return that set
It seems (it is, right?) that hitting the XML file every time is a huge waste. How would I go about grabbing it and persisting it so that different pagination requests don't have to reload the file every time?

Just have your server do the reading and parsing of the paginated file based on the user input and feedback. Meaning it can be cached on the server much quicker than it would take the client to download and cache the entire XML document. Use PHP, Perl, ASP or what have you to paginate the data prior to displaying it to the user.

I believe the closest thing you are going to get is Memcached.
Although, I wouldn't worry about it, especially if it is a local file. include like operations are fairly cheap.

To the question "hitting the XML file every time is a huge waste" then answer is yes, if you have to parse that big XML file everytime. As I understand, you want to save the chunk the user is interested in so that you don't have to do that everytime. How about a very simple file cache? No extension required, fast, simple to use and maintain. Something like that:
function echo_results($start)
{
// IMPORTANT: make sure that $start is a valid number
$cache_file = '/path/to/cache/' . $start . '.xml';
$source = '/path/to/source.xml';
$mtime = filemtime($cache_file);
if (file_exists($cache_file)
&& filemtime($cache_file) < $mtime)
{
readfile($cache_file);
return;
}
$xml = get_the_results_chunk($start);
file_put_contents($cache_file, $xml);
echo $xml;
}
As an added bonus, you use the source file's last modification time so that you automatically ignore cached chunks that are older than their source.
You can even save it compressed and serve it as-is if the client supports gzip compression (IOW, 99% of browsers out there) or decompress it on-the-fly otherwise.

Could you load it into $_SESSION data? or would that blow out memory due to the size of the chunk?

Related

simplexml cache file

I am loading an XML file which is pretty heavy and contains a lot of data, and I am only using a small amount of it. This is causing my site to take ages to load.
http://www.footballadvisor.net/
Can anyone advice me on a resolution to this, is there a way in simplexml that I can cache the file for a period of time? My site is in PHP.
Thanks in advance
Richard

You wouldn't do it directly with simplexml. What you'd need to use is the other file functions (fopen or file_get_contents), save the file or extract the bits you need and save it. If enough time has passed since the last time it was checked (or if enough time passed that new data would be available) you could delete the cached data and check again.

Can you get a specific xml value without loading the full file?

I recently wrote a PHP plugin to interface with my phpBB installation which will take my users' Steam IDs, convert them into the community ids that Steam uses on their website, grab the xml file for that community id, get the value of avatarFull (which contains the link to the full avatar), download it via curl, resize it, and set it as the user's new avatar.
In effect it is syncing my forum's avatars with Steam's avatars (Steam is a gaming community/platform and I run a gaming clan). My issue is that whenever I am reading the value from the xml file it takes around a second for each user as it loads the entire xml file before searching for the variable and this causes the entire script to take a very long time to complete.
Ideally I want to have my script run several times a day to check each avatarFull value from Steam and check to see if it has changed (and download the file if it has), but it currently takes just too long for me to tie up everything to wait on it.
Is there any way to have the server serve up just the xml value that I am looking for without loading the entire thing?
Here is how I am calling the value currently:
$xml = #simplexml_load_file("http://steamcommunity.com/profiles/".$steamid."?xml=1");
$avatarlink = $xml->avatarFull;
And here is an example xml file: XML file

The file isn't big. Parsing it doesn't take much time. Your second is wasted mostly for network communication.
Since there is no way around this, you must implement a cache. Schedule a script that will run on your server every hour or so, looking for changes. This script will take a lot of time - at least a second for every user; several seconds if the picture has to be downloaded.
When it has the latest picture, it will store it in some predefined location on your server. The scripts that serve your webpage will use this location instead of communicating with Steam. That way they will work instantly, and the pictures will be at most 1 hour out-of-date.
Added: Here's an idea to complement this: Have your visitors perform AJAX requests to Steam and check if the picture has changed via JavaScript. Do this only for pictures that they're actually viewing. If it has, then you can immediately replace the outdated picture in their browser. Also you can notify your server who can then download the updated picture immediately. Perhaps you won't even need to schedule anything yourself.

You have to read the whole stream to get to the data you need, but it doesn't have to be kept in memory.
If I were doing this with Java, I'd use a SAX parser instead of a DOM parser. I could handle the few values I was interested in and not keep a large DOM in memory. See if there's something equivalent for you with PHP.

SimpleXml is a DOM parser. It will load and parse the entire document into memory before you can work with it. If you do not want that, use XMLReader which will allow you to process the XML while you are reading it from a stream, e.g. you could exit processing once the avatar was fetched.
But like other people already pointed out elsewhere on this page, with a file as small as shown, this is likely rather a network latency issue than an XML issue.
Also see Best XML Parser for PHP

that file looks small enough. It shouldn't take that long to parse. It probably takes that long because of some sort of network problem and the slowness of parsing.
If the network is your issue then no amount of trickery will help you :(.
If isn't the network then you could try a regex match on the input. That will probably be marginally faster.
Try this expression:
/<avatarFull><![CDATA[(.*?)]]><\/avatarFull>/
and read the link from the first group match.
You could try the SAX way of parsing (http://php.net/manual/en/book.xml.php) but as i said since the file is small i doubt it will really make a difference.

You can take advantage of caching the results of simplexml_load_file() somewhere like memcached or filesystem. Here is typical workflow:
check if XML file was processed during last N seconds
return processing results on success
on failure get results from simplexml
process them
resize images
store results in cache

Validation against 10K XSD - performance problem

I have an XSD scheme which has 10K lines. It takes 5 seconds to validate my XML with 500 lines. I get dynamically XML via POST from external server, on every click of the user on my homepage. The validation takes 5+ seconds, which is very much for every click of the user.
PHP Example:
$doc = new DOMDocument();
$doc->load('file.xml'); //100 to 500 lines
$doc->schemaValidate('schema.xsd'); //schema.xsd 10 000 lines
Do you have any idea how I can validate the XML against the XSD faster?

Some things to check:
Is the schema a local file, or are you fetching it over the network (e.g. via http: or file: to a mounted volume)?
Can you cache your schema? Many schema validation engines let you load the schema and cache it, and then do multiple validations against an internal representation.
What does your schema look like? 5 seconds for a 10K schema seems pretty slow.
What XML schema validator are you using?

You could create a subset of the XSD, which contains only the parts you need for your site. Validate against the full schema only after the final submit.

Use a different XML library and/or do your remote operation in the background and have the web read the latest cache.

Querying large XML file (600mb+) in PHP or JavaScript?

I have a large XML file (600mb+) and am developing a PHP application which needs to query this file.
My initial approach was to extract all the data from the file and insert it into a MySQL database - then query it that way. The only issue with this was that it was still slow, plus the XML data gets updated regularly - meaning I need to download, parse and insert data from the XML file into the database everytime the XML file is updated.
Is it actually possible to query a 600mb file? (for example, searching for records where TITLE="something here"?) Is it possible to get it to do this in a reasonable amount of time?
Ideally would like to do this in PHP, though I could also use JavaScript too.
Any help and suggestions appreciated :)

Constructing an XML DOM for a 600+ Mb document is definitely a way to fail. What you need is SAX-based API. SAX, though, does not usually allow XPath to be used, but you can emulate it with imperative code.
As for the file being updated, is it possible to retrieve only differences anyhow? That would massively speed up subsequent processing.

Generating ZIP files with PHP + Apache on-the-fly in high speed?

To quote some famous words:
“Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill.”
While solving some mundane problem at work I came up with this idea, which I'm not quite sure how to solve. I know I won't be implementing this, but I'm very curious as to what the best solution is. :)
Suppose you have this big collection with JPG files and a few odd SWF files. With "big" I mean "a couple thousand". Every JPG file is around 200KB, and the SWFs can be up to a few MB in size. Every day there's a few new JPG files. The total size of all the stuff is thus around 1 GB, and is slowly but steadily increasing. Files are VERY rarely changed or deleted.
The users can view each of the files individually on the webpage. However there is also the wish to allow them to download a whole bunch of them at once. The files have some metadata attached to them (date, category, etc.) that the user can filter the collection by.
The ultimate implementation would then be to allow the user to specify some filter criteria and then download the corresponding files as a single ZIP file.
Since the amount of criteria is big enough, I cannot pre-generate all the possible ZIP files and must do it on-the-fly. Another problem is that the download can be quite large and for users with slow connections it's quite likely that it will take an hour or more. Support for "resume" is therefore a must-have.
On the bright side however the ZIP doesn't need to compress anything - the files are mostly JPEGs anyway. Thus the whole process shouldn't be more CPU-intensive than a simple file download.
The problems then that I have identified are thus:
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Will passing large amounts of file data through PHP not be a performance hit in itself?
How would you implement this? Is PHP up to the task at all?
Added:
By now two people have suggested to store the requested ZIP files in a temporary folder and serving them from there as usual files. While this is indeed an obvious solution, there are several practical considerations which make this infeasible.
The ZIP files will usually be pretty large, ranging from a few tens of megabytes to hundreads of megabytes. It's also completely normal for a user to request "everything", meaning that the ZIP file will be over a gigabyte in size. Also there are many possible filter combinations and many of them are likely to be selected by the users.
As a result, the ZIP files will be pretty slow to generate (due to sheer volume of data and disk speed), and will contain the whole collection many times over. I don't see how this solution would work without some mega-expensive SCSI RAID array.

This may be what you need:
http://pablotron.org/software/zipstream-php/
This lib allows you to build a dynamic streaming zip file without swapping to disk.

Use e.g. the PhpConcept Library Zip library.
Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.
The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.

You're going to have to store the generated zip file, if you want them to be able to resume downloads.
Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.
To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed here and once your parsed that you'll need to run something like this.
$size = filesize($zip_file);
if(isset($_SERVER['HTTP_RANGE'])) {
//parse http_range
$range = explode( '-', $seek_range);
$new_length = $range[1] - $range[0]
header("HTTP/1.1 206 Partial Content");
header("Content-Length: $new_length");
header("Content-Range: bytes {$range[0]}-$range[1]");
echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
header("Content-Range: bytes 0-$size");
header("Content-Length: ".$size);
echo file_get_contents($zip_file);
}
This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.
Now to your questions
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Cache the file to the hard disk, means you wont have this problem.
Will passing large amounts of file data through PHP not be a performance hit in itself?
Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.

i have a download page, and made a zip class that is very similar to your ideas.
my downloads are very big files, that can't be zipped properly with the zip classes out there.
and i had similar ideas as you.
the approach to give up the compression is very good, with that you not even need fewer cpu resources, you save memory because you don't have to touch the input files and can pass it throught, you can also calculate everything like the zip headers and the end filesize very easy, and you can jump to every position and generate from this point to realize resume.
I go even further, i generate one checksum from all the input file crc's, and use it as an e-tag for the generated file to support caching, and as part of the filename.
If you have already download the generated zip file the browser gets it from the local cache instead of the server.
You can also adjust the download rate (for example 300KB/s).
One can make zip comments.
You can choose which files can be added and what not (for example thumbs.db).
But theres one problem that you can't overcome with the zip format completely.
Thats the generation of the crc values.
Even if you use hash-file to overcome the memory problem, or use hash-update to incrementally generate the crc, it will use to much cpu resources.
Not much for one person, but not recommend for professional use.
I solved this with an extra crc value table that i generate with an extra script.
I add this crc values per parameter to the zip class.
With this, the class is ultra fast.
Like a regular download script, as you mentioned.
My zip class is work in progress, you can have a look at it here: http://www.ranma.tv/zip-class.txt
I hope i can help someone with that :)
But i will discontinue this approach, i will reprogram my class to a tar class.
With tar i don't need to generate crc values from the files, tar only need some checksums for the headers, thats all.
And i don't need an extra mysql table any more.
I think it makes the class easier to use, if you don't have to create an extra crc table for it.
It's not so hard, because tars file structure is easier as the zip structure.
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
If your script is safe and it closes on user abort, then you can remove it completely.
But it would be safer, if you just renew the timeout on every file that you pass throught :)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Yes that would work.
I had generated a checksum from the input file crc's.
I used this as an e-tag and as part of the zip filename.
If something changed, the user can't resume the generated zip,
because the e-tag and filename changed together with the content.
Will passing large amounts of file data through PHP not be a performance hit in itself?
No, if you only pass throught it will not use much more then a regular download.
Maybe 0.01% i don't know, its not much :)
I assume because php don't do much with the data :)

You can use ZipStream or PHPZip, which will send zipped files on the fly to the browser, divided in chunks, instead of loading the entire content in PHP and then sending the zip file.
Both libraries are nice and useful pieces of code. A few details:
ZipStream "works" only with memory, but cannot be easily ported to PHP 4 if necessary (uses hash_file())
PHPZip writes temporary files on disk (consumes as much disk space as the biggest file to add in the zip), but can be easily adapted for PHP 4 if necessary.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.