Can you get a specific xml value without loading the full file? - php

I recently wrote a PHP plugin to interface with my phpBB installation which will take my users' Steam IDs, convert them into the community ids that Steam uses on their website, grab the xml file for that community id, get the value of avatarFull (which contains the link to the full avatar), download it via curl, resize it, and set it as the user's new avatar.
In effect it is syncing my forum's avatars with Steam's avatars (Steam is a gaming community/platform and I run a gaming clan). My issue is that whenever I am reading the value from the xml file it takes around a second for each user as it loads the entire xml file before searching for the variable and this causes the entire script to take a very long time to complete.
Ideally I want to have my script run several times a day to check each avatarFull value from Steam and check to see if it has changed (and download the file if it has), but it currently takes just too long for me to tie up everything to wait on it.
Is there any way to have the server serve up just the xml value that I am looking for without loading the entire thing?
Here is how I am calling the value currently:
$xml = #simplexml_load_file("http://steamcommunity.com/profiles/".$steamid."?xml=1");
$avatarlink = $xml->avatarFull;
And here is an example xml file: XML file

The file isn't big. Parsing it doesn't take much time. Your second is wasted mostly for network communication.
Since there is no way around this, you must implement a cache. Schedule a script that will run on your server every hour or so, looking for changes. This script will take a lot of time - at least a second for every user; several seconds if the picture has to be downloaded.
When it has the latest picture, it will store it in some predefined location on your server. The scripts that serve your webpage will use this location instead of communicating with Steam. That way they will work instantly, and the pictures will be at most 1 hour out-of-date.
Added: Here's an idea to complement this: Have your visitors perform AJAX requests to Steam and check if the picture has changed via JavaScript. Do this only for pictures that they're actually viewing. If it has, then you can immediately replace the outdated picture in their browser. Also you can notify your server who can then download the updated picture immediately. Perhaps you won't even need to schedule anything yourself.

You have to read the whole stream to get to the data you need, but it doesn't have to be kept in memory.
If I were doing this with Java, I'd use a SAX parser instead of a DOM parser. I could handle the few values I was interested in and not keep a large DOM in memory. See if there's something equivalent for you with PHP.

SimpleXml is a DOM parser. It will load and parse the entire document into memory before you can work with it. If you do not want that, use XMLReader which will allow you to process the XML while you are reading it from a stream, e.g. you could exit processing once the avatar was fetched.
But like other people already pointed out elsewhere on this page, with a file as small as shown, this is likely rather a network latency issue than an XML issue.
Also see Best XML Parser for PHP

that file looks small enough. It shouldn't take that long to parse. It probably takes that long because of some sort of network problem and the slowness of parsing.
If the network is your issue then no amount of trickery will help you :(.
If isn't the network then you could try a regex match on the input. That will probably be marginally faster.
Try this expression:
/<avatarFull><![CDATA[(.*?)]]><\/avatarFull>/
and read the link from the first group match.
You could try the SAX way of parsing (http://php.net/manual/en/book.xml.php) but as i said since the file is small i doubt it will really make a difference.

You can take advantage of caching the results of simplexml_load_file() somewhere like memcached or filesystem. Here is typical workflow:
check if XML file was processed during last N seconds
return processing results on success
on failure get results from simplexml
process them
resize images
store results in cache

Related

Image Scraping hogging server resources/connections?

As part of my web app, I built a system that periodically pulls an RSS feed and scrapes its content. I also look for any image tags present in the feed item, and attempt to pull it to query its size and such to determine which "picture" to use.
Here is a rough sketch of that part of the code:
Is there an <image> node? If so, that is the image. Exit.
Parse the content of the description node through simplehtmldom and look for any and all img tags
Iterate through all img tags:
getimagesize();
If the image size is greater than one I found earlier, use this picture.
Exit.
At step 3, the script can take awhile, especially for feeds that have lots of images for me to check. I assume that each call to getimagesize() takes a certain amount of time and it adds up quickly. I'm not too worried about it taking a long time (although if it could be reduced, that would be best), but the fact that while this script is running, it effectively leaves all other concurrent users hanging until the script has finished.
I'd like to avoid this, but am not too proficient at server admin - perhaps someone could give me some guiding pointers?
Thanks!
Run it on a separate server if you need the performance boost. getimagesize() can really slow things down. I'd recommend running the scraping script on it's own server and host everything else on your current server.

Simple HTML Dom save file with Cron Job once a day, then access that saved file

I am using SimpleHTMLDom to get some info off of a RSS feed. This data is only updated once a day around 7am. I would like to use the feature $html->save('result.htm'); Then have my page just load the result.htm file instead of running the parse each time I look at the page.
I guess I am wondering, would this be a good idea? Would it really speed the page load time up that much? Would using a cache be similar or maybe better?
(this question almost address this)
yes, it would be a good idea and you couldn't get any faster (unless you load the page to webserver memory and serve it from there).
just extend the cronjob you already have to process the data with SimpleHTMLDom and save the html it produced at 7am. Then keep serving that file until the next morning.
Just make sure you create a tmp-file first (result.tmp.html) the next morning and only do the move/rename once the cronjob finishes.
i am not sure i told you anything you didn't know already...

php script to download files from a repeating node of an xml feed

I need to parse a huge xml feed containing games data and download all the games through the url node of each repeating item to my server.
I have no problem parsing the xml feed, I need to know the best way to download the files from their remote site to my server. Also bear in mind that the feed contains several thousands of items.
I solved it using the php file_get_contents()/file_put_contents() which is very easy to use and works perfectly. The only problem I have now is that if I try to grab around 50+ games at once, sometimes the server returns an error and I have to request less number say 25 games to be able to get them without problem. Anyway I can ask the server to get me huge amount say 500-1000 game without throwing an error?

Viewing large text file in a browser

I need to write a text file viewer (not the directory tree, but the actual file contents) for use in a browser. It will be used to view large files. I want to give the user the ability to actually ummm, browse the file, ie prev page & next page buttons, while each page will show only a portion of the file.
Two question:
Is there anyway to pass the file descriptor through POST (or something) so that on each page I can keep reading from an already open file, and not starting all over again (again - huge files)
Is there a way to read the file backwards? Will be very useful for browsing back in a file.
Any other implementation ideas are very welcome. Thanks
Keeping the file open between requests is not a good idea - you don't have to "start all over again" - just maintain an offset and use fseek() to jump to that offset. That way, you can also implement the "backwards jumping".
Cut your huge files into smaller files once, and then serve the small files to the user.
You should consider pagination. If you're concerned about the user being frustrated by needing to click "next" too often, you could make each chunk reasonably large (so a normal reader pages every 20min).
Another option is the Chunked-Endoding transfer type: Wikipedia Entry. This would allow your server to respond quickly and give the user something to read while it streams the rest of the file over the network (rather than the server needing to read in the file and send it all at once). This could dramatically improve the perceived performance compared to serving the files normally, but still consumes a lot of bandwidth for your server.
You might be able to simulate a large document with Javascript and AJAX, but only send pieces at a time for better performance.
Consider sending a few pages worth of your document and attaching listeners to the scroll event of your browser. Over time or as the user scrolls down you AJAX more chunks. This creates a few annoying UX edge cases, like:
Scroll bar indicates a much smaller document than there actually is
You might be able to avoid this by filling in the bottom of your document with many page breaks, but it'll be difficult to make the length perfect.
Scrolling past the point of currently-available content will show a blank page.
You could detect this using JavaScript and display a "loading" icon to let the user know what's going on.
Built-in "find" feature doesn't work
Hard to avoid this without the user downloading the entire document, but you could provide your own search feature for them to use instead (not as good but perhaps adequate).
Really though, you're probably best off with pagination with medium-sized pages. It's a very well understood design pattern that's a relatively easy (compared to other options at least) to implement and make fast.
Hope that helps!

Upload progress using pure PHP/AJAX?

I'm sure this has been asked before, but as I can't seem to find a good answer, here I am, asking... again. :)
Is there any way, using only a mixture of HTML, JavaScript/AJAX, and PHP, to report the actual progress of a file upload?
In reply to anyone suggesting SWFUpload or similar:
I know all about it. Been down that road. I'm looking for a 100% pure solution (and yes, I know I probably won't get it).
Monitoring your file uploads with PHP/Javascript requires the PECL extension:
uploadprogress
A good example of the code needed to display the progress to your users is:
Uber Uploader
If I'm not mistaken it uses JQuery to communicate with PHP.
You could also write it yourself, It's not that complex.
Add a hidden element as the first element of upload form, named UPLOAD_IDENTIFIER.
Poll a PHP script that calls uploadprogress_get_info( UPLOAD_IDENTIFIER )
It return an array containing the following:
time_start - The time that the upload began (unix timestamp),
time_last - The time that the progress info was last updated,
speed_average - Average speed in bytes per second,
speed_last - Last measured speed in bytes per second,
bytes_uploaded - Number of bytes uploaded so far,
bytes_total - The value of the Content-Length header sent by the browser,
files_uploaded - Number of files uploaded so far,
est_sec - Estimated number of seconds remaining.
Let PHP return the info to Javascript and you should have plenty of information.
Depending on the audience, you will likely not use all the info available.
If you have APC installed (and by this point, you really should; it'll be standard in PHP6), it has an option to enable upload tracking.
There's some documentation, and Rasmus has written a code sample that uses YUI.
If you're able to add PECL packages into your PHP, there is the uploadprogress package.
The simplest way would be to just use swfupload, though.
Is there any way, using only a mixture of HTML, JavaScript/AJAX, and PHP, to report the actual progress of a file upload?
I don't know of any way to monitor plain HTML (multipart/form-data) file uploads in webserver-loaded PHP.
You need to have access to the progress of the multipart/form-data parser as the data comes in, but this looks impossible because the ways of accessing the HTTP request body from PHP ($HTTP_RAW_POST_DATA and php://input) are documented as being “not available with enctype="multipart/form-data"”.
You could do a script-assisted file upload in Firefox using an upload field's FileList to grab the contents of a file to submit in a segmented or non-multipart way. Still a bunch of work to parse though.
(You could even run a PHP script as a standalone server on another port just for receiving file uploads, using your own HTTP-handling code. But that's a huge amount of work for relatively little gain.)
I'd recommend you to five FancyUpload a try it's a really cool solution for progress bar and it's not necesarely attached to php. Checkout also the other tools at digitarald.de
cheers
IMHO, this is the problem that Web browsers should solve. We have progress meter for downloads, so why not for uploads as well?
Take a look at this for example:
http://www.fireuploader.com/

Categories