Periodically populate view data with query - php

I'm writing a view that will have daily/weekly/monthly report data. I'm thinking it makes sense to only run a query periodically to update the data rather than hit the database whenever someone loads the page. Can this be done completely in PHP and MySQL? What are some robust ways to handle this?

Using a templating engine like Smarty that supports caching, you can set a long cache time for those pages. You then need to code your PHP to test whether your date constraints have changed and if the data is not already cached, and if either of those conditions is true, perform the query. Otherwise, Smarty will just load the cached page and your code won't query the database.
$smarty = new Smarty();
if (!$smarty->isCached('yourtemplate.tpl')) {
// Run your query and populate template variables
}
$smarty->display('yourtemplate.tpl');
Further documentation on Smarty caching

Yes but not very well. You want to look into Cron jobs, most Web hosts provide a service to setup Crons. They are simply a way to run a script, any script, PHP, Javascript, a whole page etc.
Search google for cron jobs, you should find what you're looking for.
If your web host doesn't provide cron jobs and you don't know how Unix commands work, then there are sites that will host a cron job for you.
Check out
http://www.cronjobs.org/

I'm thinking it makes sense to only run a query periodically to update the data rather than hit the database whenever someone loads the page
Personally I'd go with both. e.g.
SELECT customer, COUNT(orders.id), SUM(order_lines.value)
FROM orders, order_lines
WHERE orders.id=order_lines.order_id
AND orders.placed>#last_time_data_snapshotted
AND orders.customer=#some_user
GROUP BY customer
UNION
SELECT user, SUM(rollup.orders), SUM(rollup.order_value)
FROM rollup
WHERE rollup.last_order_date<#last_time_data_snapshotted
AND rollup.customer=#some_user
GROUP BY customer
rather than hit the database whenever someone loads the page
Actually, depending on the pattern of usage this may make a lot of sense. But that doesn't necessary preclude the method above - just set a threshold on when you'll push the aggregated data into the pre-consolidated table and test the threshold on each request.

I'd personally go for the storing the cached data in a file, then just read that file if it has been updated within a certain timeframe, if not then do your update (e.g. getting info from the database, writing to the file).
Some example code:
$cacheTime = 900; // 15 minutes
$useCache = false;
$cacheFile = './cache/twitter.cachefile';
// check time
if(file_exists($cacheFile)){
$cacheContents = file_get_contents($cacheFile);
if((date('U')-filemtime($cacheFile))<$cacheTime || filesize($cacheFile)==0){
$useCache = true;
}
}
if(!$useCache){
// get all your update data setting $cacheContents to the file output. I'd imagine using a buffer here would be a good idea.
// update cache file contents
$fh = fopen($cacheFile, 'w+');
fwrite($fh, $cacheContents);
fclose($fh);
}
echo $cacheContents;

Related

What is the best way to check MySQL table's update continuously?

For some reasons (that I think it is not the point of my question, but if it help, ask me and I can describe why), I need to check MySQL tables continuously for new records. If any new records come, I want to do some related actions that are not important now.
Question is, how I should continuously check the database to make sure I am using the lowest resources and getting the results, close to the realtime.
For now, I have this:
$new_record_come = false;
while(! $new_record_come) {
$sql = "SELECT id FROM Notificatins WHERE insert_date > (NOW() - INTERVAL 5 SECONDS)";
$result = $conn->query($sql);
if ($result)
{
//doing some related actions...
$new_record_come = true;
}
else
{
sleep(5); //5 seconds delay
}
}
But I am worry that if I get thousands of users, it will make the server down, even if the server is a high price one!
Do you have any advice to make it better in performance or even change the way completely or even change the type of query or any other suggestion?
Polling a database is costly, so you're right to be wary of that solution.
If you need to scale this application up to handle thousands of concurrent users, you probably should consider additional technology that complements the RDBMS.
For this, I'd suggest using a message queue. After an app inserts a new notification to the database, the app will also post an item to a topic on the message queue. Typically the primary key (id) is the item you post.
Meanwhile, other apps are listening to the topic. They don't need to do polling. The way message queues work is that the client just waits until there's a new item in the queue. The wait will return the item.
A comment suggested using a trigger to invoke a PHP script. This won't work, because triggers execute while the transaction that spawned them is not yet committed. So if the trigger runs a PHP script, which probably needs to read the record from the database. But an uncommitted record is not visible to any other database session, so the PHP script can never read the data that it was notified about.
Another angle (much simpler than message queue I think):
I once implemented this on a website by letting the clients poll AND compare it to their latest id they received.
For example: You have a table with primary key, and want to watch if new items are added.
But you don't want to set up a database connection and query the table if there is nothing new in it.
Let's say the primary key is named 'postid'.
I had a file containing the latest postid.
I updated it with each new entry in tblposts, so it contains alsways the latest postid.
The polling scripts on the clientside simply retrieved that file (do not use PHP, just let Apache serve it, much faster: name it lastpostid.txt or something).
Client compares to its internal latest postid. If it is bigger, the client requests the ones after the last one. This step DOES include a query.
Advantage is that you only query the database when something new is in, and you can also tell the PHP script what your latest postid was, so PHP can only fetch the later ones.
(Not sure if this will work in your situation becuase it assumes an increasing number meaning 'newer'.)
This might not be possible with your current system design but how about instead of using triggers or a heartbeat to poll the database continuously that you go where the updates, etc happen and from there execute other code? This way, you can avoid polling the database continuously and code will fire ONLY IF somebody initiates a request?

Run a SQL query (count) every 30 sec, and then save the output to some file

I am developing a website and got a database where people can insert data (votes). I want to keep a counter in the header like "x" votes have been cast. But it is possible that there will be a lot of traffic on the website soon. Now I can do it with the query
SELECT COUNT(*) FROM `tblvotes
and then display number in the header, but then every time the users changes page, it will redo the query, so I am thinking, maybe it is better to the query once every 30 sec (so much less load on the mysql server) but then I need to save the output of it to some place (this shouldn't be so hard; I can write it to a textfile?) But how can I let my website automatically every 30 sec run the query and put the number in the file. I got no SSH to the server so I can t crontab it?
If there is something you might not understand feel free to ask!
Simplest approach: Write the result into a local textfile, check the filetime of the textfile on every request to be less than now() + 30 seconds, and if so, update the file. To update, you should lock the file. While the file is being updated, other users for whom the condition now() + 30 is met should only read the currently existing file to avoid race conditions.
Hope that helps,
Stefan
Crontabs can only run every minute, at its fastest.
I think there is a better solution to this. You should make an aggregate table in which the statistical information is stored.
With a trigger on the votes_table, you can do 'something' every time the table receives a INSERT statement.
The aggregate table will then store the most accurate information, which you then can query to display the count.
Better solution will be using some cache mechanism (e.g. APC) instead of files if your server allows it.
If you can, you may want to look into using memcached. It allows you to set an expiry time for any data you add to it.
When you first do the query, you write the md5 of the query text associated with the result. On subsequent queries, look for the data in memcached. If it is expired, you can redo the sql query and then rewrite it to memcached.
Okay, so the first part of your question is basically about caching the result of the total votes to be included in the header of your page. Its a very good idea - here is an idea of how to implement it...
If you can not enable a crontab (even with out SSH access you might be able to set this up using your hostings control panel), you might be able to get away with using an external 3rd party cronjob service. (Google has many results for this)...
Everytime your cronjob runs, you can create/update a file that simply contains some PHP arrays -
$fileOutput = "<"."?php\n\n";
$fileOutput .= '$'.$arrayName.'=';
$fileOutput .= var_export($yourData,true);
$fileOutput .= ";\n\n?".">";
$handle = fopen(_path_to_cache_file,'w+');
fwrite($handle,$fileOutput);
fclose($handle);
That will give you a PHP file that you can simply include() into your header markup and then you'll have access to the $yourData variable under the name $arrayName.

PHP retrieving database row counts and executing a file

I have a simple question. I‘d like to write a php function to check the database rows and if the number of rows are affected by the last ran query, execute an internal php file. The catch is, that I want it to check the rows, and check the timestamp at the same time so if the time stamp is different and the row count is different, it executes the php file.
The file in question is a sql database backup, so I need it to only execute if there was a change in the database and if the time stamp is older than 43200 seconds (half a day). This would backup the database if there was activities on the site (one activity would back once, two activity would back up twice and anything more than that would be ignored), and if not, it would not do anything. I hope I’m explaining it right.
Cron job is out of question, since it’s dependant on the database changes not just the time.
The code I’m using is like this (without checking the database rows) and is only accessed when a customer access the shopping cart checkout or account page:
<?php
$dbbackuplog = '/path/to/backuptime.log';
if (file_exists($dbbackuplog)) {
$lastRun = file_get_contents($dbbackuplog);
if (time() - $lastRun >= 43200) {
//Its been more than 12 hours so run the backup
$cron = file_get_contents('/file.php');
//update backuptime.log with current time
file_put_contents($dbbackuplog, time());
}
}
?>
I appreciate any input or suggestions.
First of all, you cannot run anything with file_get_contents. That function simply reads the bare contents of the file you ask for and under no circumstances will it run any code. If you want to run the code, you want include or require instead.
Second, your idea about not just triggering but also fully executing backups while a customer is performing an action is, well, I 'm not going to pull any punches, terrible. There's a reason why people use cron for backups (actually more than one reason) and you should follow that example. That's not to say that you are not allowed to affect the behavior of the cron script based on dynamic factors, but rather that the act of taking a backup should always be performed behind the scenes.

What's the ideal way to implement this PHP/XML function in a website?

I have this code written up to snatch some weather data and make it available on my website:
if( ! $xml = simplexml_load_file('http://www.weather.gov/data/current_obs/KBED.xml') )
{
echo 'unable to load XML file';
}
else
{
$temp = $xml->temp_f.' Degrees';
$wind = $xml->wind_mph;
$wind_dir = $xml->wind_dir;
$gust = $xml->wind_gust_mph;
$time = $xml->observation_time;
$pres = $xml->pressure_in;
$weath = $xml->weather;
}
And then I just echo them out inside the tags I want them inside. My site is low traffic, but I'm wondering what the "best" way is to do something like this if I were to spike way up in traffic. Should I write those variables I want into a database every hour (when the XML is refreshed) with a cron job to save pinging the server each time, or is that not bad practice? I understand this is a bit subjective, but I have no one else to ask for "best ways". Thanks!!
I would suggest the following:
When you first get the content of the xml, parse it, and serialise it to a file, with a timestamp attached to the file in some way (perhaps as part of the serialised data structure)
Every time the page loads, grab that serialised data, and check the timestamp. If it's passed a certain point, go and grab the xml again and cache it, making sure to update the timestamp. If not, just use that data.
That should work, means you only have to go get the xml occasionally, and also, once the cache has expired, you don't have the waste of going and getting it regularly even though no-one is visiting (since it is only updated on a request).
Set up a cron job to periodically fetch the XML document, parse it and store the variables in a database.
When a page is requested, fetch the variables from the database and render your page.
It is a good idea to store the timestamp of the last update in the database as well, so that you can tell when the data is stale (because the weather website is down or so).
This setup looks very reasonable to me.
You could cache the output of the external site, and let it renew itself say every 5-10 seconds. That would kill the impact of a lot of 'pings' from your site. It really depends on how important timing accuracy is to your customer/client.
In a high traffic situation I would have a separate script that runs a a daemon or cron job and fetches the weather every specified interval, and overwrites the public website page when done. That way, you've not to worry about caching as it's done by a background task, your visitors are merely accessing a static page from the web server. That also avoids or at least minimises the need to incorporate a database into the equation, and is fairly light-weight.
On the downside, it does create a second point of failure and could be pretty useless if the information needs to be accurate to the time of page access.

crawling scraping and threading? with php

I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.
USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/
pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here
This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.
CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)

Categories