I want to create a web site using google maps and fusion tables where users can leave a marker on a map with message so that another users will be able to see this marker on
their map. And all of this in a real time.
I've already created a small prototype.
And i've got a question: Google has a limitations on using FT
So users won't be able to see a markers that placed further than 100.000 one:
Only the first 100,000 rows of data in a table are mapped or included
in query results.
Queries with spatial predicates only return data from within this
first 100,000 rows.
Therefore, if you apply a filter to a very large table and the filter matches data in rows
after the first 100K, these rows are not displayed.
How can i overcome this limitation?
Or it's better to create my own database and use a marker cluster to work with large
amounts of markers and therefore to forget about FT?
AFAIK there is no way around the 100K limitation. Perhaps a Google Premier license, costing money, would allow you to overcome this, I'm not sure. Another possibility is to maintain 5 Fusion Tables, each with a maximum of 100K rows. You can display 5 Fusion Table layers at a time via the GMap API. Don't see why this wouldn't work. You'd just have to run your query code against all the current layers. I've done this with 2 layers (both much smaller than 100K) but it worked fine.
At your broadest view, where you are threatened by the 100k limit use the following algorithm
Union(Newest 50k records, Random sample 50k records from all older)
and plot this "limited" set. There is no way that view is actually going to display each single point to your user.
When he zooms in, grab the current viewing dimensions thru JS, and filter your DB records fro those within the viewing area.
This means, 1st you'd call to see if the filter returns 100k (thus meaning there's proabably 100k+ records that could be displayed) apply the random sampling algo again to reduce to 100k. If the filter returns <100k, then you no longer need to random sample and you are no longer threatened by the limit.
Related
I have a table with over 2 million rows. One of the values is an address. Some rows have a common address. I am using php.
On my website, I want the user to put in their zip code, and I will return all results within that zip code. I would then use Google to Geolocate them on a map. The problem is that since Google charges by the query, I can't be wasting time and money requesting coordinates for an address I already have. Here is what I believe to be the correct approach:
Ask user for zip code
Run "Select * with 'Zip Code' = $user_zip" (paraphrasing)
Run a Geolocate on first address and plot on map
Check for matching addresses in result and group with the mapped result
Find next new address
Repeat 3-6 until complete
Is there a better way to approach this? I am looking for efficiency, easy way to manipulate all matching results at once, and the least amount of queries. If my way is correct, can someone please help me with the logic for numbers 3-5?
If I understand this right what you are trying to do is to render a map with markers for each record in your database that is within a certain zip area. And your challenge is that you need coordinates to render each marker. The biggest issue with your approach in terms of wasting resources is that you do not store the coordinates of each address in your database. I would suggest you to:
1 - Alter the endpoint (or script or whatever) that creates these records in your db to fetch the coordinates and store them in the database.
2 - Run a one time migration to fetch coordinates for each record. While I understand that doing this for 2 milion rows could be "costly" with Google's Geocoding (Estimate is 1000$ for 2 milion api calls). To save the costs you could look into some of the opensource map tools.
Either way fetching coordinates during the request lifecycle is both a waste of resource and it will significantly affect speeds.
I am currently monitoring 5 different buildings, for each building there are around 300 rooms. Each room has 4 sensors, three which monitor temperature at different points of the room and one for the amount of power (killowatts) the room is consuming.
I am currently polling every sensor every 15 minutes which is producing 576,000 entries per day, the amount of buildings I am monitoring is soon going to increase.
I am currently storing all the information in MySQL, I have a MySQL table for each sensor type so the tables are named 'power', 'temp1', 'temp2', 'temp3'. The columns within these tables are 'id', 'building_id', 'epoch', 'value'.
I then use this data to produce graphs using the Chart.js library and statistics such as the amount of power used per building within a certain time period etc, I do all this using PHP.
I don't believe my MySQL database is going to be able to handle this without serious scale and clustering
I need to be able to view historic data for 5 years although some of the granularity can be lost after a certain period of time.
I have been informed that RRD might be able to solve my problem and have done some research on it but still have some questions.
Will it still allow me to create my own graphs using specifically the Chart.js library? If I can get time / value JSON data from the RRD this should be ok.
How many different RRD files will I need to create also? Would I need one per building? Per room? Per sensor? Is it still going to be easy to manage.
I have PHP scripts which run at 15 minute intervals that pull the data from the sensors using SNMP and then insert the data into MySQL if I can use the same scripts to also insert into the RRD that would also be great, from what I have seen you can use PHP to insert into RRD so that should be ok.
EDIT: I am currently reading http://michael.bouvy.net/blog/en/2013/04/28/graph-data-rrdtool-sensors-arduino/ which has started to answer some of my questions.
Whether you have one RRD file with 6000 metrics, or 5 files with 1200 metrics etc depends on how you are managing the data.
Firstly, you should not group together metrics for which the samples arrive at different points in time. So, if you sample one room at a time, you should probably have one RRD file per room (with 4 metrics in it). This will depend on what manages your sensors; if you have one device per room or per building. Retrieving the data and graphing it works whether you have one file or a thousand (though the 'thousand' scenario work much better in the latest version of RRDTool).
Secondly, are you likely to add new data points (IE, buildings or rooms)? You cannot (easily) add new metrics to an existing RRD file. So, if you expect to add a new building in the future, or to add or remove a room, then maybe one RRD per building or one per room would be better.
Without having any more information, I would guess that you'd be better off with one RRD per room (containing 4 metrics) and update them separately. Name the files according to the building and room IDs, and they can hold the power and 3 temperature values according to Epoch.
For graphing, RRDTool is of course capable of creating its own graphs by accessing the data directly. However, if you want to extract the data and put them into a graph yourself this is possible; the Xport function will allow you to pull out the necessary datapoints (possibly from multiple RRD files and with aggregation) which you can then pass to the graphing library of your choice. There is also the Fetch function if you want the raw data.
If your data samples are coming at 15min intervals, make sure you set your RRD Interval, Heartbeat and RRAs up correctly. In particular, the RRAs will specify what aggregation is performed and for how long data are kept at higher granularities. The RRAs should in general correspond to the resolutions you expect to graph the data at (which is why people generally use 5min/30min/2h/1d as these correspond nicely to daily, weekly, monthly and yearly graphs at a 400px width)
You might want to take a look at time-series databases and test a few systems that have built-in visualization, an API that allows you to perform aggregations and PHP wrappers. Time-series databases are optimized for efficient storage of timestamped data and have built-in functionality for time-series transformations.
https://en.wikipedia.org/wiki/Time_series_database
I've got a list of shops that I have put in a javascript array. I have their addresses as well.
I'm needing to create an autocomplete which allows me to put in a city name and it displays the 3 nearest to that location. I imagine it will need to interface with google's apis some how but not sure where to start.
I've got the actual autocomplete jquery stuff working on an ajax script, but I don't know how to get things located nearest.
You need the lat/long locations of the stores, https://developers.google.com/maps/documentation/geocoding/ Then you need the lat/long location of the user, with some relatively simple mathematics you can then calculate the distance between these two points:
$distance = round((6371*3.1415926*sqrt(($lat2-$lat1)*($lat2-$lat1) +
cos($lat2/57.29578)*cos($lat1/57.29578)*($lon2-$lon1)*($lon2-$lon1))/180), 1);
If you have a large number of stores and a large number of users I advise caching these distances in a mysql table, you have to do this for each store in your database. So you create a table for each e.g. zipcode that requests this and put up a cron to remove these tables every hour or so.
So the process:
User asks for the nearest store
You get his location through google api (or your own storage)
Check if there's a table for his location
If yes, give him the results directly, if no generate the table and give him the results
Mind that google only allows a limited number of data requests. Even though this number is huge (I believe 25.000 requests per day) it may be advisable to store the lat-lon locations of your stores AND users. Would also improve the speed.
I made something similar to this, I fetched the lat/lon locations at the moment a location was inserted into the database and inserted it in a seperate per-zipcode lat/lon table.
How to build a proper structure for an analytics service? Currently i have 1 table that stores data about every user that visits the page with my client's ID so later my clients will be able to see the statistics for a specific date.
I've thought a bit today and I'm wondering: Let's say i have 1,000 users and everyone has around 1,000 impressions on their sites daily, means i get 1,000,000 (1M) new records every day to a single table. How will it work after 2 months or so (when the table reaches 60 Million records)?
I just think that after some time it will have so much records that the PHP queries to pull out the data will be really heavy, slow and take a lot of resources, is it true? and how to prevent that?
A friend of mine working on something similar and he is gonna make a new table for every client, is this the correct way to go with?
Thanks!
Problem you are facing is I/O bound system. 1 million records a day is roughly 12 write queries per second. That's achievable, but pulling the data out while writing at the same time will make your system to be bound at the HDD level.
What you need to do is configure your database to support the I/O volume you'll be doing, such as - use appropriate database engine (InnoDB and not MyISAM), make sure you have fast enough HDD subsystem (RAID, not regular drives since they can and will fail at some point), design your database optimally, inspect queries with EXPLAIN to see where you might have gone wrong with them, maybe even use a different storage engine - personally, I'd use TokuDB if I were you.
And also, I sincerely hope you'd be doing your querying, sorting, filtering on the database side and not on PHP side.
Consider this Link to the Google Analytics Platform Components Overview page and pay special attention to the way the data is written to the database, simply based on the architecture of the entire system.
Instead of writing everything to your database right away, you could write everything to a log file, then process the log later (perhaps at a time when the traffic isn't so high). At the end of the day, you'll still need to make all of those writes to your database, but if you batch them together and do them when that kind of load is more tolerable, your system will scale a lot better.
You could normalize impressions the data like this;
Client Table
{
ID
Name
}
Pages Table
{
ID
Page_Name
}
PagesClientsVisits Table
{
ID
Client_ID
Page_ID
Visits
}
and just increment visits on the final table on each new impression. Then the maximum number of records in there becomes (No. of clients * No. of pages)
Having a table with 60 million records can be ok. That is what a database is for. But you should be careful about how many fields you have in the table. Also what datatype (=>size) each field has.
You create some kind of reports on the data. Think about what data you really need for those reports. For example you might need only the numbers of visits per user on every page. A simple count would do the trick.
What you also can do is generate the report every night and delete the raw data afterwards.
So, read and think about it.
to describe it some more, if i have an image map of a region that when clicked, a query is performed to get more information about that region.my client wants me to display an approximate number of search results while hovering over that region image map.my problem is how do i cache? or get that number without heavily exhausting my server's memory resources?
btw im using php and mysql if that's a necessary info.
You could periodically execute the query and then store the results (e.g., in a different table).
The results would not be absolutely up-to-date, but would be a good approximation and would reduce the load on the server.
MySQL can give you the approximate number of rows that would be returned by your query, without actually running the query. This is what EXPLAIN syntax is for.
You run the query with 'EXPLAIN' before the 'SELECT', and then multiply all the results in the rows column.
The accuracy is highly variable. It may work with some types of queries, but be useless on others. It makes use of statistics about your data that MySQL's optimizer keeps.
Note: using ANALYZE TABLE on the table periodically (ie once a month) may help improve the accuracy of these estimates.
You could create another table with the id's of the regions, and then create a script that runs over all the regions at a slow time (night for example) and populates this extra table with the data.
Then when you hover you get the ID from this new table, and its at most a day old.
The issue with that could be that you do not have a slow time or that the run-through that is done at night takes a long time and is very process heavy.
EDIT to your comment.
Take a large region or country if you can, do a query of that region within your SQL browser of choice and check out the time it would take.
If it is to much you could distribute it so certain countries will execute at certain hours, small countries together and large countries alone at some period.
Is your concern that you don't want to go to the database for this information at all when the roll-ever event occurs, or is it that you think the query will be too slow and you want the information from the database, but faster?
If you have a slow query, you can tune it, or use some of the other suggestions already given.
Alternatively, and to avoid a database hit altogether, it seems that you have a finite number of regions on this map, and you can run a query periodically for all regions and keep the numbers in memory.