Location based caching system

Location based caching system - php

I like to have a location based data caching(on the server) system for supplying data for a mobile application. i.e., if some user requests data from a location (which is common to all the users from same area), i'll fetch the values from DB and show to them. But if a second user retrieves the same page within the next 5 mins from the same location, then i don't want to query the millions of records present in the DB and i can just take them if it is there in file cache. So any such things available now in PHP?

I am not aware of any such thing in PHP, but it's not too hard to make your own caching engine with PHP. You need to make a cache directory and based on the requests you get you have to check if a file corresponding to that request is there in your cache directory or not.
e.g your main parameters are lat and long.
Suppose you get the request with lat = 123 and long =234 (taking some random values), you will check your cache folder is a file named 123_234.data is present or not. If it is present, instead of querying the database you read the file and send the content as the output, else you read from the database and before sending the response write that response in a file cache/123_234.data. This way you can serve the files later too without querying the database again.
Challenges:
Time: The cache will expire at some point or the other. So while checking if the file exists, you also need to check the last modified timestamp to ensure the cache is not expired. It depends on you application requirements if the cache expires in a minute, 10 minutes, hours, days or months.
Making intelligent cache file names in this case is going to be challenging as even for a distance of 100m, the lat,long combination will be different. One option for you might be to choose the file names by setting the precision. e.g a real lat long combination is of the form 28.631541,76.945281. You may want to make a cache file named 28.63154_76.94528.data (reducing precision value to 5 places after decimal). It again depends if you want to cache just for a single point on a globe or for a geographical region, and if a geographical region, then the radius of it.
I don't know why someone down voted the question, I believe it is a very good and intelligent question. There goes my upvote :)

If all you are concerned about is the queries...one approach might be a db table that stores query results as json or serialized php objects along with whatever fields you need to match locations.
A cron job running on whatever interval best suits would clear out expired results

Related

tracking calls to a PHP WebService for each user/IP

Basicaly, I have an PHP webservice which will be available from:
website
mobile phone clients - (android/iphone)
Data is retreived in JSON format. Request is sent using GET method over HTTPS protocol. Data in database is kept in mongo database.
Here comes my main questions:
I want to get the statistics for each user - how many calls he is doing per-minute/hour - how to save this data? Will it be okay to save everything into mongo database? (1 day statistics = 43 millions rows) Is there a solution maybe which will keep data only for X days and then auto-trunc everything automatically?
I also want to get statistics for each IP - how many calls were made from it per-minute/hour/day - how to save this data? Will it be okay to save everything into mongo database? Will it not become too large?
Is it always possible to see an IP of an user who is making a call to a webservice? what about IPv6?
These 3 questions are the most interesting to me currently. I am planning on letting users use basic services without the need of loggin-in! My main concern is the database/file system performance.
Next comes the description about - what measures I am planning to use. How did I come to this solution and why these 3 questions above are essential. Feel free to ignore the text below if you are not interested in details :)
I want to protect my webservice against crawling I.e. somebody can pass parameters (which are not hard to guess) to get entire data off my databse :)
Here is an usage example: https://mydomain/api.php?action=map&long=1.23&lat=2.45
As you can see - I am already using a secure https protocol, in order to prevent accidental catching entire GET request. It also protects agains 'man in the middle' attacks. However, it doesn't stop attackers from getting into website and going through JS AJAX calls to get an actual request structure. Or decompiling entire android's .APK file.
After reading a lot of questions through out the internet - I came to the conclusion that there is no way of protecting my data entirely, but I think I have found an approach of making the life of a crawlers a lot harder!
And I need your advice on either - if this whole thing is worth implementing and what technologies shall be used in my case (see next).
Next comes the security measures against non-website (mobile device) use of a service for users which are not logged-in.
<?php
/*
Case: "App first started". User: "not logged in"
0. generate UNIQUE_APP_ID for APP when first started
1. send UNIQUE_APP_ID to server (or request new UNIQUE_APP_ID from server to write it on device)
1.1. Verify how many UNIQUE_APP_IDs were created from this IP
if more than X unique values in last Y minutes ->
ban IP temporary
ban all UNIQUE_APP_IDs created by IP during Y minutes (use delay to link them together).
but exclude UNIQUE_APP_IDs from ban if these are not showing strange behaviour ( mainly - no calls to API)
force affected users to log-in to continue to use the service or ask to retry later when ban wears off
else register UNIQUE_APP_ID on server
Note: this is a 'hard' case, as the IP might belong to some public Wi-Fi AP. Precautions:
* temporary instead of permanent ban
* activity check for each UNIQUE_APP_ID belonging to IP. There could be legit users who use the service from long time thus will not be affected by this check. Created Z time ago from this IP.
* users will be not be ever banned - rather forced to log-in, where more restrictive actions will be performed individually!
Now that the application is registered and all validations are passed:
2. Case: "call to API is made". User: "not logged-in"
2.1. get IP from which call is made
2.2. get unique app ID of client
2.3. verity ID against DB on server
if not exists -> reject call
2.4. check how many calls this particular ID did in the last X minutes
if more than X calls -> ban only this unique ID
2.5 check how many Ids were banned from this IP in the last X minutes. If more than Y then ban new calls for whole IP temporary
check if all banned IDs were created from the same IP, if yes then also ban that IP, if it is different from new IP
*/
?>
As you can see - my whole solution is based on the idea that I can store the data about webservice and retrieve it for analysis easily.. for each single webservice call.. Or maybe each X'th call. I have no idea about - what kind of database shall be used. I was thinking that mongo might not be the best choice. Maybe MySQL? Keeping data safe from wrong users is one reason. Another reason is that abusal usage of database will result in a huge load on a database.(DDos?) So, i think this might be a good idea to count webservice calls.
On the other side. A bit of calculations.
If there are 1000 users working simultaniusly. Each generating 30 calls to a webservice per minute. So it's 30000 disc writes in a minute. In hour it's 60 times that i.e. 1.800.000 disc writes in an hour. If I am planning to keep statistics about daily usage then it's 24 times that i.e. in
average there will be 43.200.000 records for tracking purposes kept on a server.
Each record contains information about: time + IP + user unique ID
I was also thinking about not storing any data at all. Use redis instead. I know that there is some kind of counter exists. For for each individual IP I can create a separate key and start counting calls. In this case everything will be kept in server's RAM. There is also an expire date parameter which is possible to set for each key. And for separate users I can store their IDs instead of network IP. This solution only came to my mind after I finished writing this whole essay, so let me hear your ideas about questions above.

XML to Database, what route should I take?

I have access to an traffic data server from where I get XML files with the information that I need. (Example: Point A to Point B: travel time 20 min, distance 18 miles, etc).
I download the XML file (which is archived), extract it, then process it and store it into a DB. I only allow for the download of the XML file per request but only if 5 minutes have passed from last download. The XML on the traffic server gets updated every 30 seconds to maybe 5 minutes. During the 5 minute period any user requesting the webpage will retrieve the data from the DB (no update) therefore limiting the number of requests made to the traffic server.
My problem with my current approach is that when I get new XML file the whole process takes some time (3-7 seconds) and that makes the user wait too much before getting anything. However, when no XML download is needed and all the data gets displayed straight from the DB the process is very fast.
The archived XML is about 100-200KB, while the unarchived one is about 2MB. The XML file contains traffic data from 3 or 4 states, while I only need the data for one state. That is why I currently use the DB method.
Is this approach a good one? I was wondering if I should just extract the data directly from the downloaded XML file for every request and limit somehow how often the XML file gets downloaded from the traffic server. Or, can anyone point me to a better way?
Sample of the XML file
This is how it looks on my website

You need to download the XML each time it changes.
But only if you've got active users in the next period of time it takes to download the files.
As you can't foresee the future, you don't know whether or not you'll get a request of a user within the next 7 seconds.
You can however possibly find out with a HEAD request if the XML file has been updated.
So you could create yourself a service that is downloading from the remote system the XML each time it changes. In case the date is indeed not needed that often, you can configure that service to not check and/or download that often.
The rest of your system can be independent to it as long as you can learn about the best configuration of the download service by statistical analysis of your users behavior.
If you need this even more real-time you need to configure the new services based on changing data from the other system and then you need to start to interchange data bidirectionally between those two systems which is more complicated and can lead to more side-effects. But from the numbers you give, this level of detail probably isn't needed anyway, so I won't care about it.

PHP Memcache potential problems?

I'll most probably be using MemCache for caching some database results.
As I haven't ever written and done caching I thought it would be a good idea to ask those of you who have already done it. The system I'm writing may have concurrency running scripts at some point of time. This is what I'm planning on doing:
I'm writing a banner exchange system.
The information about banners are stored in the database.
There are different sites, with different traffic, loading a php script that would generate code for those banners. (so that the banners are displayed on the client's site)
When a banner is being displayed for the first time - it get's cached with memcache.
The banner has a cache life time for example 1 hour.
Every hour the cache is renewed.
The potential problem I see in this task is at step 4 and 6.
If we have for example 100 sites with big traffic it may happen that the script has a several instances running simultaneously. How could I guarantee that when the cache expires it'll get regenerated once and the data will be intact?

How could I guarantee that when the cache expires it'll get regenerated once and the data will be intact?
The approach to caching I take is, for lack of a better word, a "lazy" implementation. That is, you don't cache something until you retrieve it once, with the hope that someone will need it again. Here's the pseudo code of what that algorithm would look like:
// returns false if there is no value or the value is expired
result = cache_check(key)
if (!result)
{
result = fetch_from_db()
// set it for next time, until it expires anyway
cache_set(key, result, expiry)
}
This works pretty well for what we want to use it for, as long as you use the cache intelligently and understand that not all information is the same. For example, in a hypothetical user comment system, you don't need an expiry time because you can simply invalidate the cache whenever a new user posts a comment on an article, so the next time comments are loaded, they're recached. Some information however (weather data comes to mind) should get a manual expiry time since you're not relying on user input to update your data.
For what its worth, memcache works well in a clustered environment and you should find that setting something like that up isn't hard to do, so this should scale pretty easily to whatever you need it to be.

Handling HTTP request

If there is a HTTP request coming to a web server from many clients the requests will be handled in the order.
For all the http request i want to use a token bucket system.
So when there is a first Request i write a number to a file and increment the number for the next request and so on..
I dont want to do it in DB since the DB size increases..
Is this the right way to do this.Please suggest
Edit:So if a user posts a comment the comment should be stored in the a file instead of the DB.So to keep track of it there is a variable that is incremented for every request.this number will be used in writing the file name and refer it for future reference.so if there are many requests is this the right way to do it..
Thanks..

Why not lock ( http://php.net/manual/en/function.flock.php ) files in a folder ?
First call locks 01,
Second call locks 02,
3rd call locks 03,
01 gets unlocked,
4th call locks 01
Basically each php script tries to lock the first file it can and when it's done it unlocks/erases the file.
I use this in a system with 250+ child processes spawned by a "process manager". Tried to use a database but it slowed down everything.
If you want to keep incrementing the file number for some content i would suggest using mktime() or time() and using
$now=time();
$suffix=0;
while(is_file($dir.$now.'_'.$suffix)) {
$suffix++;
}
But again, depending on how you want to read the data or use it, there are many options. Could you provide more details?
-----EDIT 1-----
Each request has a "lock-file", and stores the lock id (number) is in $lock.
three visitors post at the same time with the lock-id 01, 02, 03 (the last step in the described situation)
$now=time();
$suffix=0;
$post_id=30;
$dir='posts/'.$post_id.'/';
if(!is_dir($dir)) { mkdir($dir,0777,true); }
while(is_file($dir.$mktime.'_'.$lock.'_'.$suffix.'.txt')) {
$suffix++;
}
The while should not be neede but i usually keep it anyway just in case :).
That should create a txt file 30/69848968695_01_0.txt and ..02_0.txt and ..03_0.txt.
When you want to show the comments you just sort them by filename....

The database size need not increase. All you need is a single row. In concept the logic goes:
Read row, taking lock, getting the current count
Write row with count incremented, releasing lock
Note that you're using the database locks to deal with the possibilities that multiple requests are being processed at the same time.
So I'm suggesting to use the database as the place to manage your count. You can still write your other data to files if you wish. However you'll still need housekeeping for the files. Is that much harder with a database?

I agree with some of the other commenters that, regardless of whichever problem you are trying to solve, you may be making it more difficult than it needs to be.
Your example is mentioning putting comments in a file and keeping them outside the database.
What is the purpose of your count, exactly? You want to count the number of comments a user has made? Or total number of comment requests exactly?
If you don't have to update the count anywhere in real time, you could write a simple script that reads your server access logs and adds up the total.
Also, Matthew points out above, if you want requests to be handled in a particular order you will be rapidly heading for strange concurrency bugs and performance issues.
If you update your post to include details more explicitly, we should be able to help you further.
Hope this helps.

Generating scoreboards on large traffic sites

Bit of an odd question but I'm hoping someone can point me in the right direction. Basically I have two scenarios and I'd like to know which one is the best for my situation (a user checking a scoreboard on a high traffic site).
Top 10 is regenerated every time a user hits the page - increase in load on the server, especially in high traffic, user will see his/her correct standing asap.
Top 10 is regenerated at a set interval e.g. every 10 minutes. - only generates one set of results causing one spike every 10 minutes rather than potentially once every x seconds, if a user hits in between the refresh they won't see their updated score.
Each one has it's pros and cons, in your experience which one would be best to use or are there any magical alternatives?
EDIT - An update, after taking on board what everyone has said I've decided to rebuild this part of the application. Rather than dealing with the individual scores I'm dealing with the totals, this is then saved out to a separate table which sort of acts like a cached data source.
Thank you all for the great input.

Adding to Marcel's answer, I would suggest only updating the scoreboards upon write events (like new score or deleted score). This way you can keep static answers for popular queries like Top 10, etc. Use something like MemCache to keep data cached up for requests, or if you don't/can't install something like MemCache on your server serialize common requests and write them to flat files, and then delete/update them upon write events. Have your code look for the cached result (or file) first, and then iff it's missing, do the query and create the data

Nothing is never needed real time when it comes to the web. I would go with option 2 users will not notice that there score is not changing. You can use some JS to refresh the top 10 every time the cache has cleared

To add to Jordan's suggestion: I'd put the scorecards in a separate (HTML formatted) file, that is produced every time when new data arrives and only then. You can include this file in the PHP page containing the scorecard or even let a visitor's browser fetch it periodically using XMLHttpRequests (to save bandwidth). Users with JavaScript disabled or using a browser that doesn't support XMLHttpRequests (rare these days, but possible) will just see a static page.

The Drupal voting module will handle this for you, giving you an option of when to recalculate. If you're implementing it yourself, then caching the top 10 somewhere is a good idea - you can either regenerate it at regular intervals or you can invalidate the cache at certain points. You'd need to look at how often people are voting, how often that will cause the top 10 to change, how often the top 10 page is being viewed and the performance hit that regenerating it involves.
If you're not set on Drupal/MySQL then CouchDB would be useful here. You can create a view which calculates the top 10 data and it'll be cached until something happens which causes a recalculation to be necessary. You can also put in an http caching proxy inline to cache results for a set number of minutes.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.