Caching location based data cluster - php

How to serve the search results from a particular location without querying the database again and again that holds millions of records?
We are having a database of billions of records with latitude and longitude. And it is growing every minute. And now, we need to serve those data to our mobile application. So we planned to show this in following categories.
Showing latest 10 inserted results:
For this, we are using a table in which when every new record comes, we are inserting it into the queue table and removing the last value if it is greater than 10.
Showing the latest 10 results from the user location:
For this, we need to aggregate the data and show to the local users every 5 mins and we can serve the same data to all users from that locality for 5 mins.
Now, i need help with,
How to divide the areas? For example, if i divide the world in form of squares, then i can serve same data for each square for next 5 mins. Is there any algorithms to divide the areas in similar model using geo location or do you think that any other model better suits for this?
How and where to cache the content for each area to serve only for next 5 mins, and it need to refreshed with new data. Is there any caching algorithms present in DB itself or is there any other technique present for this? For example, if we have areas A & B into 2 squares, and if a user from "A" requests a data, then we need to cache the result and need to serve with same result without querying the DB to all the users requesting for next 5 mins from the same square "A" and need to refresh it after that. So that i can save server bandwidth. But how to do this?Any server caching? Temporary tables or How?
Please direct me on this. Or if you think that there is a better approach, please, please let me know. Any references are also greatly welcome. Thanks everyone in advance.

Related

Building a live database (approach)

I just want a approach on how to build a database with live records, so don't just downvote. I don't expect any code.
At the moment I have a MySql database with about 2 thousand users, they are are getting more though. Each player/user has several points, which are increasing or decreasing by certain actions.
My goal is that this database gets refreshed about every second and the user with more points move up and others move down... and so on
My question is, what is the best approach for this "live database" where records have to be updated every second. In MySql I can run time based actions which are executing a SQL command but this isn't the greatest way I think. Can someone suggest a good way to handle this? E.g. other Database providers like MongoDB or anything else?
EDIT
This doesn't work client side, so I can't simply push/post it into the databse due some time based events. For explanation: A user is training his character in the application. This training (to get 1 level up) takes 12 hours. After the time is elapsed the record should be updated in the database AUTOMATICALLY also if the user doesn't send a post request by his self (if the user is not logged in) other users should see the updated data in his profile.
You need to accept the fact that rankings will be stale to some extent. Your predicament is no different than any other gaming platform (or SO rankings for that matter). Business decisions were put in place and constantly get reviewed for the level of staleness. Take the leaderboards on tags here, for instance. Or the recent change that has profile pages updated a lot more frequently, versus around 4AM GMT.
Consider the use of MySQL Events. It is built-in functionality that replaces the need for cron tasks. I have 3 event-related links off my profile page if interested. You could calculate ranks on a timed schedule (your tolerance for staleness) and the users' requests for them would be fast (faster than the below from Gordon). On the con-side, they are stale.
Consider not saving (writing) rank info but rather focus just on filling in the slots of your other data. And get your rankings on the fly. As an example, see this rankings answer here from Gordon. It is dynamic, runs upon request with at least at that moment non-staleness, and would not require Events.
Know that only you should decide what is tolerable for the UX.

duplicable database

Is there a way that i can copy my database (a) from localhost and place it intro database (b) localhost?
I was thinking mb php has some manually (on refresh page opera refresh page every 1 seconds) or automatic php get from db(a) post to db (b) anything like this possible ?
What i need it for ...in to db (a) i store new members and every 30 minutes db (a) deletes itself, but db (b) its always full. So even if myweb.com gets hacked no one knows to hack myweb1.com if that makes any sense. One website hides the other
Its very hard to answer your question becouse we dont know much about your web and SQL servers.
If you have 2 SQL servers or 2 instances, or 2 separate databases you can easy create SQL JOB
and schedlue it to execute every 30 min or so.
Job would be simple to select data from DB(a) and insert them in DB(b) then delete records in DB(a) .. you need to make it in SQL transaction so new records added in DB(a) when you move records dont get deleteded

Kohana Session Table

Currently i am hosting a website with ~10k unique visitors a day and ~6 clicks per user.
So round about 60k pageviews a day.
I use Kohana 3.2 and save the session datas from every user in the "sessions" table. Every page request execute a timestamp refresh in this table! So it's round about 60k updates (excl. Selects / Inserts / .. ) refreshing timestamps only.
The mysql process is getting pretty low..
So that's my question:
Should i stop using the SESSIONS table for saving user
How can i use the $_SESSION instead the values from the table?
Is there another alternative to handle this problem right now? We orderd more server capacity but have to wait..
EDIT:
Maybe it's enough to deny all these "updates" by every click..?
Okay. At least it was enough to truncate the table "sessions" - for the first moment. There was more than 1kk records - thats why the database operations getting pretty slow..?
Maybe it's just a simple mysql problem, and it's enough when i change it NOSQL.

Location based caching system

I like to have a location based data caching(on the server) system for supplying data for a mobile application. i.e., if some user requests data from a location (which is common to all the users from same area), i'll fetch the values from DB and show to them. But if a second user retrieves the same page within the next 5 mins from the same location, then i don't want to query the millions of records present in the DB and i can just take them if it is there in file cache. So any such things available now in PHP?
I am not aware of any such thing in PHP, but it's not too hard to make your own caching engine with PHP. You need to make a cache directory and based on the requests you get you have to check if a file corresponding to that request is there in your cache directory or not.
e.g your main parameters are lat and long.
Suppose you get the request with lat = 123 and long =234 (taking some random values), you will check your cache folder is a file named 123_234.data is present or not. If it is present, instead of querying the database you read the file and send the content as the output, else you read from the database and before sending the response write that response in a file cache/123_234.data. This way you can serve the files later too without querying the database again.
Challenges:
Time: The cache will expire at some point or the other. So while checking if the file exists, you also need to check the last modified timestamp to ensure the cache is not expired. It depends on you application requirements if the cache expires in a minute, 10 minutes, hours, days or months.
Making intelligent cache file names in this case is going to be challenging as even for a distance of 100m, the lat,long combination will be different. One option for you might be to choose the file names by setting the precision. e.g a real lat long combination is of the form 28.631541,76.945281. You may want to make a cache file named 28.63154_76.94528.data (reducing precision value to 5 places after decimal). It again depends if you want to cache just for a single point on a globe or for a geographical region, and if a geographical region, then the radius of it.
I don't know why someone down voted the question, I believe it is a very good and intelligent question. There goes my upvote :)
If all you are concerned about is the queries...one approach might be a db table that stores query results as json or serialized php objects along with whatever fields you need to match locations.
A cron job running on whatever interval best suits would clear out expired results

How can I make my curl-based URL monitoring service lightweight to run?

I'm currently building a user panel which will scrape daily information using curl. For each URL it will INSERT a new row to the database. Every user can add multiple URLs to scrape. For example: the database might contain 1,000 users, and every user might have 5 URLs to scrape on average.
How do I to run the curl scraping - by a cron job once a day at a specific time? Will a single dedicated server stand this without lags? Are there any techniques to reduce the server load? And about MySQL databases: with 5,000 new rows a day the database will be huge after a single month.
If you wonder I'm building a statistics service which will show the daily growth of their pages (not talking about traffic), so as i understand i need to insert a new value per user per day.
Any suggestions will be appreciated.
5000 x 365 is only 1.8 million... nothing to worry about for the database. If you want, you can stuff the data into mongodb (need 64bit OS). This will allow you to expand and shuffle loads around to multiple machines more easily when you need to.
If you want to run curl non-stop until it is finished from a cron, just "nice" the process so it doesn't use too many system resources. Otherwise, you can run a script which sleeps a few seconds between each curl pull. If each scrape takes 2 seconds that would allow you to scrape 43,200 pages per 24 period. If you slept 4 sec between a 2 second pull that would let you do 14,400 pages per day (5k is 40% of 14.4k, so you should be done in half a day with 4 sec sleep between 2 sec scrape).
This seems very doable on a minimal VPS machine for the first year, at least for the first 6 months. Then, you can think about utilizing more machines.
(edit: also, if you want you can store the binary GZIPPED scraped page source if you're worried about space)
I understand that each customer's pages need to be checked at the same time each day to make the growth stats accurate. But, do all customers need to be checked at the same time? I would divide my customers into chunks based on their ids. In this way, you could update each customer at the same time every day, but not have to do them all at once.
For the database size problem I would do two things. First, use partitions to break up the data into manageable pieces. Second, if the value did not change from one day to the next, I would not insert a new row for the page. In my processing of the data, I would then extrapolate for presentation the values of the data. UNLESS all you are storing is small bits of text. Then, I'm not sure the number of rows is going to be all that big a problem if you use proper indexing and pagination for queries.
Edit: adding a bit of an example
function do_curl($start_index,$stop_index){
// Do query here to get all pages with ids between start index and stop index
$query = "select * from db_table where id >= $start_index and id<=$stop_index";
for($i=$start_index; $i<= $stop_index; $i++;){
// do curl here
}
}
urls would look roughly like
http://xxx.example.com/do_curl?start_index=1&stop_index=10;
http://xxx.example.com/do_curl?start_index=11&stop_index=20;
The best way to deal with the growing database size is to perhaps write a single cron script that would generate the start_index and stop_index based on the number of pages you need to fetch and how often you intend to run the script.
Use multi curl and properly optimise not simply normalise your database design. If I were to run this cron job, I will try to spend time studying that is it possible to do this in chunks or not? Regarding hardware start with an average configuration, keep monitoring it and increment the hardware, CPU or Memory. Remember, there is no silver bullet.

Categories