XML to Database, what route should I take?

XML to Database, what route should I take? - php

I have access to an traffic data server from where I get XML files with the information that I need. (Example: Point A to Point B: travel time 20 min, distance 18 miles, etc).
I download the XML file (which is archived), extract it, then process it and store it into a DB. I only allow for the download of the XML file per request but only if 5 minutes have passed from last download. The XML on the traffic server gets updated every 30 seconds to maybe 5 minutes. During the 5 minute period any user requesting the webpage will retrieve the data from the DB (no update) therefore limiting the number of requests made to the traffic server.
My problem with my current approach is that when I get new XML file the whole process takes some time (3-7 seconds) and that makes the user wait too much before getting anything. However, when no XML download is needed and all the data gets displayed straight from the DB the process is very fast.
The archived XML is about 100-200KB, while the unarchived one is about 2MB. The XML file contains traffic data from 3 or 4 states, while I only need the data for one state. That is why I currently use the DB method.
Is this approach a good one? I was wondering if I should just extract the data directly from the downloaded XML file for every request and limit somehow how often the XML file gets downloaded from the traffic server. Or, can anyone point me to a better way?
Sample of the XML file
This is how it looks on my website

You need to download the XML each time it changes.
But only if you've got active users in the next period of time it takes to download the files.
As you can't foresee the future, you don't know whether or not you'll get a request of a user within the next 7 seconds.
You can however possibly find out with a HEAD request if the XML file has been updated.
So you could create yourself a service that is downloading from the remote system the XML each time it changes. In case the date is indeed not needed that often, you can configure that service to not check and/or download that often.
The rest of your system can be independent to it as long as you can learn about the best configuration of the download service by statistical analysis of your users behavior.
If you need this even more real-time you need to configure the new services based on changing data from the other system and then you need to start to interchange data bidirectionally between those two systems which is more complicated and can lead to more side-effects. But from the numbers you give, this level of detail probably isn't needed anyway, so I won't care about it.

Related

Location based caching system

I like to have a location based data caching(on the server) system for supplying data for a mobile application. i.e., if some user requests data from a location (which is common to all the users from same area), i'll fetch the values from DB and show to them. But if a second user retrieves the same page within the next 5 mins from the same location, then i don't want to query the millions of records present in the DB and i can just take them if it is there in file cache. So any such things available now in PHP?

I am not aware of any such thing in PHP, but it's not too hard to make your own caching engine with PHP. You need to make a cache directory and based on the requests you get you have to check if a file corresponding to that request is there in your cache directory or not.
e.g your main parameters are lat and long.
Suppose you get the request with lat = 123 and long =234 (taking some random values), you will check your cache folder is a file named 123_234.data is present or not. If it is present, instead of querying the database you read the file and send the content as the output, else you read from the database and before sending the response write that response in a file cache/123_234.data. This way you can serve the files later too without querying the database again.
Challenges:
Time: The cache will expire at some point or the other. So while checking if the file exists, you also need to check the last modified timestamp to ensure the cache is not expired. It depends on you application requirements if the cache expires in a minute, 10 minutes, hours, days or months.
Making intelligent cache file names in this case is going to be challenging as even for a distance of 100m, the lat,long combination will be different. One option for you might be to choose the file names by setting the precision. e.g a real lat long combination is of the form 28.631541,76.945281. You may want to make a cache file named 28.63154_76.94528.data (reducing precision value to 5 places after decimal). It again depends if you want to cache just for a single point on a globe or for a geographical region, and if a geographical region, then the radius of it.
I don't know why someone down voted the question, I believe it is a very good and intelligent question. There goes my upvote :)

If all you are concerned about is the queries...one approach might be a db table that stores query results as json or serialized php objects along with whatever fields you need to match locations.
A cron job running on whatever interval best suits would clear out expired results

Persist file-based data cache for max of 1 minute in PHP

I've written some PHP code based on a sample on Github to access the DISQUS API, get a JSON response, and write it to a flat *.txt file as a cache. I also have similar code from Github to open that cache file and output the results to the page. Right now I'm having trouble finding a way to logically persist and/or evict that cache file. I would like my code that access that cache file to read from the file if it was modified/created in the last 60 seconds. If it's read from beyond 60 seconds from its creation/modification date, re-create it with new data. So my underlying goal is to persist that cache for 1 minute max. How can I accomplish this with PHP?
FYI, I'm an ASP.NET developer and can easily do this with the native C# caching tools but I'm not familiar with the PHP approaches.
Edit: Per the comment below, here are more details about my expected uses:
Average size of data to be cached: a few kb of HTML (5-10kb)
Total size of data to be cached: a few kb of HTML (5-10kb)
Rate at which I expect requests against the cache to occur: I'm not really sure! The point here is to not hit the 1000 request limit per hour for the DISQUS API, so I'm using a text file to cache every minute.
so as to perhaps get more guidance on a good solution

Opinions/expertise/suggestions wanted - provide feedback from php performing a lengthy task

I am thinking about converting a visual basic application (that takes pipe delimited files and imports them into a microsoft sql database) into a php page. One of these files is on average about 9 megabytes in size. (I couldn't be accurate about the number of lines involved but I'd say it's about 20 thousand)
One of the advantages being that any changes made to the page would be automatically 'deployed' to the intended user (currently when I make changes to the visual basic app, which was originally created by someone else, I have to put the latest version on all the PCs of the people that use it).
Problem is these imports can take like two minutes to complete. Two minutes is a long time in website-time so I would like to provide feedback to the user to indicate that the page hasn't failed/timed out and is definitely doing something.
The best idea that I can think of so far is to use ajax to do it incrementally. Say import 1000 records at a time then feed back, implement the next 1000, feed back, and so on.
Are there better ways of doing this sort of thing that wouldn't require me to learn new programming languages or download apps or libraries?

You don't have to make the Visual Basic -> PHP switch. You can stick with VB syntax in ASP or ASP.NET applications. With an ASP based solution, you can reuse plenty of the existing code so it won't be learning a new language / starting from scratch.
As for how to present a long running process to the user, you're looking for "Asynchronous Handlers" - the basic premise being the user visits a web page (A) which starts the process page (B).
(A) initiates (B), reports starting to the user and sets the page to reload in n seconds.
(B) does all the heavy lifting - just like your existing VB app. Progress is stored in some shared space (a flat file, a database, a memory cache, etc)
Upon reload, (A) reports current progress of (B) by read-only accessing the shared space (B) is keeping progress in.
Scope of (A):
Look for running (B) process - report status if found, or initiate fresh (B) process. Since (B) appears to be based on the existence of files (from your description) you might grant (A) the ability to determine if there's any point in calling (B) or not (ie. If files exist call (B) else report: nothing to do) or you may wish to keep the scopes entirely free and call (B).
Report progress of (B).
Should take very little time to execute, may want to include HTTP refresh header so user automatically gets updates.
Scope of (B):
Same as existing VB script – look for files, load… yada yada yada.
Should take similar time to execute as existing VB script (2 minutes)
Potential Improvements:
(A) could use an AJAX interface, so instead of a page-reload (HTTP refresh), an AJAX call is made every n seconds and simply the status box is updated. Some sort of animated icon (swirling wheel) will give the user the impression something is going on between refreshes.
It sounds like (B) could benefit from a multi-threaded approach (loading multiple files at once) depending on whether the files are related. As pointed out by Ponies, there may be a better strategy to such a load, but that's a different topic all together :)
Some sort of semaphore/flag approach may be required if page (A) could be simultaneously hit at the same time by multiple users and (B) takes a few seconds to start up and report status'.
Both (A) and (B) can be developed in PHP or ASP technology.

How are you importing the data into the database? Ideally, you should be using SQL Server's BULK INSERT which likely would speed up things. But it's still a matter of uploading the file for parsing...
I don't think it's worth the effort to get status of insertions - most sites only display an animated gif/etc (like the hourglass, etc) to indicate that the system is processing things but no real details.

Processing many rss/xml feeds in a cron file without overloading server

I have a cron that for the time being runs once every 20 minutes, but ultimately will run once a minute. This cron will process potentially hundreds of functions that grab an XML file remotely, and process it and perform its tasks. Problem is, due to speed of the remote sites, this script can sometimes take a while to run.
Is there a safe way to do this without [a] the script timing out, [b] overloading the server [c] overlapping and not completing its task for that minute before it runs again (would that error out?)
Unfortunately caching isnt an option as the data changes near real-time, and is from a variety of sources.

I think a slight design change would benefit this process quite a bit. Given that a remote server could time out, or a connection could be slow, you'll definitely run into concurrency issues if one slow job is still writing files when another one starts up.
I would break it into two separate scripts. Have one script that is only used for fetching the latest XML data, and another for processing it. The fetch script can take it's sweet time if it needs to, while the process script continually looks for the newest file available in order to process it.
This way they can operate independently, and the processing script can always work with the latest data, irregardless of how long either script takes to perform.

have a stack that you keep all the jobs on, have a handful of threads who's job it is to:
Pop a job off the stack
Check if you need to refresh the xml file (check for etags, expire headers, etc.)
grab the XML (this is the bit that could take the time hence spreading the load over threads) if need be, this should time out if it takes too long and raise the fact it did to someone as you might have a site down, dodgy rss generator or whatever.
then process it
This way you'll be able to grab lots of data each time.
It could be that you don't need to grab the file at all (would help if you could store the last etag for a file etc.)
One tip, don't expect any of them to be in a valid format. Suggest you have a look at Mark Pilgrims RSS RegExp reader which does a damn fine job of reading most RSS's
Addition: I would say hitting the same sites every minute is not really playing nice to the servers and creates alot of work for your server, do you really need to hit it that often?

You should make sure to read the <ttl> tag of the feeds you are grabbing to ensure you are not unnecessarily grabbing feeds before they change. <ttl> holds the update period. So if a feed has <ttl>60</ttl> then it should be only updated every 60 minutes.

What's the best way to use the Twitter API via PHP?

A client would like me to add their Twitter stream to their website homepage, using a custom solution built in PHP.
The Twitter API obviously has a limited number of calls you can make to it per hour, so I can't automatically ping Twitter every time someone refreshes my client's homepage.
The client's website is purely HTML at the moment and so there is no database available. My solution must therefore only require PHP and the local file system (e.g. saving a local XML file with some data in it).
So, given this limited criteria, what's the best way for me to access the Twitter API - via PHP - without hitting my API call limit within a few minutes?

It will be quite easy, once you can pull down a timeline and display it, to then add some file-based-caching to it.
check age of cache
Is it more than 5 mins old?
fetch the latest information
regenerate the HTML for output
save the finished HTML to disk
display the cached pre-prepared HTML
PEAR's Cache_Lite will do all you need on the caching layer.

a cron job (not likley - if there's not even a database, then there are no cron jobs)
write the microtime() to a file. on a page view compare the current timestamp to the saved one. its the difference greater than N minutes, pull the new tweetfeed and write the current timestamp to the file
if the front page is a static html-file not calling any php, include an image <img src="scheduler.php"/> that returns an 1px transparent gif (at least you did it this way when i was young) and does your twitter-pulling silently
or do you mean local-local filesystem, as in "my/the customers computer not the server"-local?
in this case:
get some server with a cron job or scheduler and PHP
write a script that reads and saves the feed to a file
write the file to the customers server using FTP
display the feed using javascript (yes, ajax also works with static files as datasources). jquery or some lib is great for this
or: create the tweet-displaying html file locally and upload it (but be careful ... because you may overwrite updates on the server)
imo: for small sites you often just don't need a fully grown sql database anyway. filesystems are great. a combination of scandir, preg_match and carefully chosen file names are often good enough.
and you can actually do a lot of front-end processing (like displaying XML) using beautiful javascript.

Since we don't know your server config I suggest you set up a cron job (assuming your on a Linux box). If you have something like cPanel on a shared hosting environment than it should be not much of an issue. You need to write a script that is called by cron and that will get the latest tweets and write them to a file (xml?). You can schedule cron to run every 30 min. or what ever you want.

You may want to use TweetPHP by Tim Davies. http://lab.lostpixel.net/classes/twitter/ - This class has lots of features including the one you want, showing your clients time line.
The page shows good examples on how to use it.
You can then put the output of this in a file or database. If you want the site visitor to update the database or the file like every 5 minutes so, you can set a session variable holding a timestamp and just allow another update if the timestamp was at least 5 minutes ago.
Hope this helps

My suggestion: Create a small simple object to hold the cache date and an array of tweets. Every time someone visits the page, it performs the following logic:
A) Does file exist?
Yes: Read it into a variable
No: Proceed to step D)
B) Unserialize the variable (The PHP pair serialize()/unserialize() would do just fine)
C) Compare the age of the cache stored with the current time (a Unix timestamp would do it)
Its over 5 minutes from each other:
D) Get the newest tweets from Twitter, update the object, serialize it and write in the cache again. Store the newest tweets for printing, too.
Its not: Just read the tweets from the cache.
E) Print the tweets
Simplest and easiest way to serialize the object is the serialize()/unserialize() pair. If you're not willing to put off the effort to make the object, you could just use 2D array, serialize() will work just fine. Give a look on http://php.net/serialize
Considering you have no cPanel access, its the best solution since you won't have access to PEAR packages, cron or any other simpler solutions.

array(
'lastrequest' => 123,
'tweets' => array ()
)
now in your code put a check to see if the timestamp in the datastore for lastrequest is more than X seconds old. If it is then its time to update your data.
serialize and store the array in a file, pretty simple

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.