I've run into a bit of a pickle. Usually I can find solutions to my problems by some extensive googling (to the right SO thread), but not this time.
I'm using an API that lets me access their data by cURL, and this API has a request limit of 500 requests per 10 minutes. If this limit is repeatedly exceeded, your API key gets suspended.
My application, written in PHP, frequently makes requests through different pages - sometimes simultaneously. I need a solution that ensures that I'm under the request limit at all times. I need to be able to check and update a variable that resets every 10 minutes, from all scripts - at the same time.
Can this even be done? How should I approach this problem?
I'm not asking for sourcecode to a perfect solution, I'd just like some pointers to how this could be solved, or if it can't be solved - what's the alternate approach?
Thank you in advance.
Sounds like memcache is what your looking for. It can hold variables like PHP objects or base types and give them as well an expiration date. Just google for the memcached extensions for PHP and look over here: http://php.net/manual/en/class.memcache.php.
Furhermore, it can be accessed via TCP and has some failover capabilities. AWS is offering a service compatible with the memcache API calling it ElastiCache
If pages are run by different users, only thing i can think of is storing in a table.
Do something like
$datestring = substr(date("Y-m-d H:i"),0, -1);
in this way you have always same string for minutes 2016-11-07 17:00:00 - 2016-11-07 17:09:59.
Then store in a table with this mechanism:
"INSERT INTO table (datestring, count) VALUES ('$datestring', '1') ON DUPLICATE KEY UPDATE count = count + 1";
of course put datestring char(15) as a unique key.
you check your counting with a select FROM.... comparing datestring you just created and you get no rows at all ora a row with a count value.
This is a solution, of course there are many.
Related
Basicaly, I have an PHP webservice which will be available from:
website
mobile phone clients - (android/iphone)
Data is retreived in JSON format. Request is sent using GET method over HTTPS protocol. Data in database is kept in mongo database.
Here comes my main questions:
I want to get the statistics for each user - how many calls he is doing per-minute/hour - how to save this data? Will it be okay to save everything into mongo database? (1 day statistics = 43 millions rows) Is there a solution maybe which will keep data only for X days and then auto-trunc everything automatically?
I also want to get statistics for each IP - how many calls were made from it per-minute/hour/day - how to save this data? Will it be okay to save everything into mongo database? Will it not become too large?
Is it always possible to see an IP of an user who is making a call to a webservice? what about IPv6?
These 3 questions are the most interesting to me currently. I am planning on letting users use basic services without the need of loggin-in! My main concern is the database/file system performance.
Next comes the description about - what measures I am planning to use. How did I come to this solution and why these 3 questions above are essential. Feel free to ignore the text below if you are not interested in details :)
I want to protect my webservice against crawling I.e. somebody can pass parameters (which are not hard to guess) to get entire data off my databse :)
Here is an usage example: https://mydomain/api.php?action=map&long=1.23&lat=2.45
As you can see - I am already using a secure https protocol, in order to prevent accidental catching entire GET request. It also protects agains 'man in the middle' attacks. However, it doesn't stop attackers from getting into website and going through JS AJAX calls to get an actual request structure. Or decompiling entire android's .APK file.
After reading a lot of questions through out the internet - I came to the conclusion that there is no way of protecting my data entirely, but I think I have found an approach of making the life of a crawlers a lot harder!
And I need your advice on either - if this whole thing is worth implementing and what technologies shall be used in my case (see next).
Next comes the security measures against non-website (mobile device) use of a service for users which are not logged-in.
<?php
/*
Case: "App first started". User: "not logged in"
0. generate UNIQUE_APP_ID for APP when first started
1. send UNIQUE_APP_ID to server (or request new UNIQUE_APP_ID from server to write it on device)
1.1. Verify how many UNIQUE_APP_IDs were created from this IP
if more than X unique values in last Y minutes ->
ban IP temporary
ban all UNIQUE_APP_IDs created by IP during Y minutes (use delay to link them together).
but exclude UNIQUE_APP_IDs from ban if these are not showing strange behaviour ( mainly - no calls to API)
force affected users to log-in to continue to use the service or ask to retry later when ban wears off
else register UNIQUE_APP_ID on server
Note: this is a 'hard' case, as the IP might belong to some public Wi-Fi AP. Precautions:
* temporary instead of permanent ban
* activity check for each UNIQUE_APP_ID belonging to IP. There could be legit users who use the service from long time thus will not be affected by this check. Created Z time ago from this IP.
* users will be not be ever banned - rather forced to log-in, where more restrictive actions will be performed individually!
Now that the application is registered and all validations are passed:
2. Case: "call to API is made". User: "not logged-in"
2.1. get IP from which call is made
2.2. get unique app ID of client
2.3. verity ID against DB on server
if not exists -> reject call
2.4. check how many calls this particular ID did in the last X minutes
if more than X calls -> ban only this unique ID
2.5 check how many Ids were banned from this IP in the last X minutes. If more than Y then ban new calls for whole IP temporary
check if all banned IDs were created from the same IP, if yes then also ban that IP, if it is different from new IP
*/
?>
As you can see - my whole solution is based on the idea that I can store the data about webservice and retrieve it for analysis easily.. for each single webservice call.. Or maybe each X'th call. I have no idea about - what kind of database shall be used. I was thinking that mongo might not be the best choice. Maybe MySQL? Keeping data safe from wrong users is one reason. Another reason is that abusal usage of database will result in a huge load on a database.(DDos?) So, i think this might be a good idea to count webservice calls.
On the other side. A bit of calculations.
If there are 1000 users working simultaniusly. Each generating 30 calls to a webservice per minute. So it's 30000 disc writes in a minute. In hour it's 60 times that i.e. 1.800.000 disc writes in an hour. If I am planning to keep statistics about daily usage then it's 24 times that i.e. in
average there will be 43.200.000 records for tracking purposes kept on a server.
Each record contains information about: time + IP + user unique ID
I was also thinking about not storing any data at all. Use redis instead. I know that there is some kind of counter exists. For for each individual IP I can create a separate key and start counting calls. In this case everything will be kept in server's RAM. There is also an expire date parameter which is possible to set for each key. And for separate users I can store their IDs instead of network IP. This solution only came to my mind after I finished writing this whole essay, so let me hear your ideas about questions above.
I am a newbie with PHP and therefore this is more of a conceptual question or maybe even a question about 'best practices'.
Often, I see websites with stats drawn from their database. For example, let's say it is a sales lead website. It may have stats at the top of the page like:
NEW SALES LEADS YESTERDAY: 123
NEW SALES LEADS THIS MONTH: 556
NEW SALES LEADS THIS YEAR: 3870
Obviously, this should not be calculated everytime the page is displayed, right? That would potentially be a large burden on the server? How do people cache this type of data. Any best practices? I thought I writing a CRON jobs that would calculate it on a daily basis and insert to a database. What are your ideas? Thank you!
You can calculate it once and then store it in a xcache. Here, however there doesn't seem to be a need for a cron. The query can run one time and store the result in xcache. Important thing here would be to set the expiration time of the stored value according to your use case. For eg. if you need to store daily stats like above, set the expiration time to be a few hours. In case of data which gets updated every minute, you can set the expiration time to be a few minutes.
Something like this.
$newSalesLeadYest;
if(xcache_isset("newSalesLeadYest")){
$newSalesLeadYest = xcache_get("newSalesLeadYest");
} else{
$newSalesLeadYest = runQueryToFetchStat();
//Cache set for X secs
xcache_set("newSalesLeadYest", $newSalesLeadYest, X);
}
What you need is to come up with a caching strategy.
Some factors to help you decide:
How frequent does the data change?
How important is the current values - is it ok if it's 1min, 1hr, 1day old?
How expensive, time wise, is loading fresh data?
How much traffic are you getting? 10s, 100s, millions?
There are a few ways you can achieve the result.
You can use something like memcached to persist the data to avoid it being generated each request.
You can use http caching and load the data client side using javascript from an api.
You can have a background worker (eg. run by cron), which generates the latest figures and persists to a lookup database table.
You could improve the queries and indexes so that getting live data is fast enough to do every request
You could alter you database schema so that you have more static data
From the 3 examples you gave, 3 simple counts should not be expensive enough to warrant complex caching systems. If you can paste the sql queries, we can help optimise them.
The data sounds like it will only get updated once per day, so a simple nightly cron "flatten" query would be a nice fit.
I need to show some basic stats on the front page of our site like the number of blogs, members, and some counts - all of which are basic queries.
Id prefer to find a method to run these queries say every 30 mins and store the output but im not sure of the best approach and I don't really want to use a cron. Basically, I don't want to make thousands of queries per day just to display these results.
Any ideas on the best method for this type of function?
Thanks in advance
Unfortunately, cron is better and reliable solution.
Cron is a time-based job scheduler in Unix-like computer operating systems. The name cron comes from the word "chronos", Greek for "time". Cron enables users to schedule jobs (commands or shell scripts) to run periodically at certain times or dates. It is commonly used to automate system maintenance or administration, though its general-purpose nature means that it can be used for other purposes, such as connecting to the Internet and downloading email.
If you are to store the output into disk file,
you can always check the filemtime is lesser than 30 minutes,
before proceed to re-run the expensive queries.
There is nothing at all wrong with using a cron to store this kind of stuff somewhere.
If you're looking for a bit more sophisticated caching methods, I suggest reading into memcached or APC, which could both provide a solution for your problem.
Cron Job is best approach nothing else i seen feasible.
You have many to do this, I think the good not the best, you can store your data on table and display it every 30 min. using the function sleep()
I recommend you to take a look at wordpress blog system, and specially at the plugin BuddyPress..
I did the same some time ago, and every time someone load the page, the query do the job and retrieve the information from database, I remenber It was something like
SELECT COUNT(*) FROM my_table
and I got the number of posts in my case.
Anyway, there are so many approach. Good Luck.
Dont forget The cron is always your best friend.
Using cron is the simplest way to solve the problem.
One good reason for not using cron - you'll be generating the stats even if nobody will request them.
Depending on the length of time it takes to generate the data (you might want to keep track of the previous counts and just add counts where the timestamp is greater than the previous run - with appropriate indexes!) then you could trigger this when a request comes in and the data looks as if it is stale.
Note that you should keep the stats in the database and think about how to implement a mutex to avoid multiple requests trying to update the cache at the same time.
However the right solution would be to update the stats every time a record is added. Unless you've got very large traffic volumes, the overhead would be minimal. While 'SELECT count(*) FROM some_table' will run very quickly you'll obviously run into problems if you don't simply want to count all the rows in a table (e.g. if blogs and replies are held in the same table). Indeed, if you were to implement the stats update as a trigger on the relevant tables, then you wouldn't need to make any changes to your PHP code.
Reading it at http://blog.programmableweb.com/2007/04/02/12-ways-to-limit-an-api/ I wondered how to accomplish time based limits (ie. 1 call per second). Well, if authentication required, the way is to compare the seconds tracked by PHP's time function (for instance, if ($previous_call_time == $current_call_time) { ... })? Any other suggestions?
Use a simple cache:
if(filemtime("cache.txt") < (int)$_SERVER["REQUEST_TIME"] - 3600) { // 3600 is one hour in seconds
$data = file_get_contents("http://remote.api/url/goes/here");
file_put_contents("cache.txt", $data);
} else
$data = file_get_contents("cache.txt");
Something like this will save the value of the API and let you get the information whenever you want while still limiting how often you actually pull date from the feed.
Hope this helps!
The more general limitations criteria is "number of calls per 'period'". For example - per hour, like twitter does.
You can track number of currently performed requests by adding a record to mysql for each request.
The more performant solution in this way is to use memcached and to increment the "key" (user_id + current_hour).
Something to consider: You can use an external service to do this instead of building it yourself. E.g. my company, WebServius ( http://www.webservius.com ) currently supports configurable per-API and per-API-key throttling, and we are likely going to be adding even more features such as "adaptive throttling" (automatically throttle usage when API becomes less responsive).
i need to show the number of online visitors, but there is a problem with selecting algoritm to do it!
maybe i must create a table in DB, where i'll store ip addresses of visitors and time of visit! by so i can show the count of ip addresses, which's time >= NOW() - 10 minutes, for example...("now()-10 minutes" is just to show the logic, i know that this is not a function:)
is this goog way to go?
please give me an idea.
Thanks
This is good tutorial. Note that mysql (i believe youll use it) online users table should be typed as MEMORY.
I'm not really sure how you would use AJAX to store the data...
I personally use the database solution.
I store user_id, last_seen, IP and location in the site (but that's not necessary to just get the count).
When the user requests a page refresh the last_seen column and delete all the entries with NOW()-last_seen greater than x minutes.
Keeping track of "visitors" (as opposed to raw page requests, which the web server should track on its own) is a complex art.
You could store IP addresses, as you described, but what about a visitor who's using a proxy that rotates their IP as frequently as every page load? What about a set of visitors all using the same proxy that uses the same IP for all of them?
My recommendation: don't bother doing any of it yourself, and use Google's free Analytics service. It tracks visitors, browsers, traffic sources, and just about anything else you could possibly want to know about who's looking at your site.
Yes, the algorithm is o.k. in general but with some corrections
May be you want to delete all outdated records first, and then just count the rest.
10 minutes is too much. 1-3 is the average time user spend on the page.
AJAX project is funny, but it has nothing to do with storage. You need more education of terms of client-server application. AJAX is the transport, not storage.
If you expect the website will have a lot of visitors, the query could make the page for one user every 10 minutes pretty slow...
If so, I would suggest to write an CLI script that will clean the old entries and run it in a cronjob. That way the user wouldn't notice any delay, as the parsetime will be spent on the CLI.