Im building an application where vehicles coordinates are being logged by GPS. I want to implement a couple of features to start with, such as:
realtime tracking of vehicles
history tracking of vehicles
keeping locations and area's for customer records
I need some guidelines as where to start on database and application design. Anything from best practices, hints to experience would really help me get on the right track.
How would one tackle ORM for geometry? For example: A location would convert to a class SpatialPoint, where an area would convert to a class SpatialPolygon
How do i keep the massive data stream comming from the vehicles sane? Im thinking a table to keep the latest points in (for realtime data) and batch parsing this data into PolyLines in a separate table for history purposes (one line per employee shift on a vehicle).
Mysql is probably not the best choice for this, but I'm planning on using Solr as the index for quick location based searches. Although we need to do some realtime distance calculation like as which vehicle is nearest to customer X. Any thoughts?
I can help you on one bit, mysql definitely is the best choice, I've been down the same path as you many times and the mysql spatial extension is fantastic, infact it's awesomely fast even over tables with 5 million+ rows of spatial data, it's all in the index. The spatial extension is one of the best kept mysql secrets that few use ;)
ORM, I'd recommend skipping for this tbh - if you have a huge amount of data all those instances of classes will kill your application, stick with a v simple array structure for dealing with the data.
RE massive data stream, either consume it live and only store every 10th entry, or just stick it all in the one table - it won't impact speed due to how the table is indexed, but size considerations may be worth considering.
For an alternative coming from PHP, you could try postgis on postgresql, but I've always favoured mysql for ease of use, native support and all round speed.
Good luck!
Yes, I recommend use of Solr as well. Current release is 1.4. It works incredibly well for this problem.
ORM -
You may need sfSolrPlugin with Doctrine ORM to tie PHP to Solr, see article from LucidWorks entitled Building a search application in 15 person-days
real time index updates -
That is coming in the next release of Solr, I believe Solr 1.5. You can get it from SVN.
Geo-spatial search -
I use Spatial Search Plugin for Apache Solr. G-s capabilities might be included in Solr 1.5. I believe that there are already some rudimentary support for g-s, w/o use of plugin.
On "how to handle/store a lot of points coming from the vehicles":
I'm working on a very similar project. I've solved this problem by maintaining 2 tables (using MySQL but this holds true for any other DB):
one for tracking objects (vehicles, users, whatever)
this table would have the object id as primary key and any updates that violates the primary key constraint would update the data stored for this key. It can be easily achieved with "ON DUPLICATE KEY UPDATE"
This makes the lookup extremely fast for tracking and keeps only one instance of location data/object. I have also implemented server side logic for deleteing records of obsolate data(after a certain ammount of time these data needs to be deleted if no updates received on them)
one for history/lookup purposes
this table would have the object id and the timestamp as composite primary key. The table can be partitioned on the timestamp column.
Any update on an object's location would insert to both tables.
I hope this helps.
Related
So currently i have an application where i'm storing location data (lat,lng) along with other fields and who not. So what i love about mysql or sql in general is that i can get geospatial queries easily. e.g. select all rows that fall within a given radius and center point.
What i love about dynamodb is that it's damn near infinitely scalable on AWS, which is the service i'll be using, and fast. I would love to move all my data over to dynamodb and even insert new data there But i wouldn't be able to use those geospatial queries which is the most important part of my application. It's required.
I know about the geolibrary for dynamodb but its written in java and my backend is written in php so thats a no go, plus they don't seem to update or maintain that library.
One solution i was thinking of was to store just the coordinates in mysql and store the corresponding id along with the other data (including the lat and long values) in dynamodb.
With this i could achieve the geospatial query functionality i want while being able to scale everything well on amazon specifically because thats the host i'm using.
So basically i'd query all POIs within a given radius from mysql and with all the ids i'd use that to get all results from dynamodb. Sounds crazy or what?
But the potential downside of this is having to query one data source and then querying another one immediately after with the result from the first query. Maybe I'm over thinking and underestimating how fast these technologies have become.
So to sum up my requirements:
Must be on AWS
Must be able to perform geospatial queries
Must be able to connect to dynamodb and MySQL in PHP
Any help or suggestions would be greatly appreciated.
My instinct says, don't use 2 datasources, only if you have a really specific case.
How much data do you have? Is MySQL ( or Aurora) really can't handle it? If your application is read heavy, it can easily scale with read replicas.
I have a few ideas for you which may brings you at least a bit closer:
Why don't you implement your own geo-library in php? :D
You can do a dummy search in the DB, where you not filtering by actual distance, but with an upper and lower boundary in lat. and long. ( So you not searching in a circle, but in a square. Then it's on you if your application is fine with it, or it filters the result, but that would be a much smaller dataset and an easy filter.
Maybe CloudSearch can help you out. It offers geo spatial queries on lat long fields. It works well together with DynamoDB, and it has a PHP SDK (never tried that though, I use nodejs)
You write the items that have lat,long fields to DynamoDB. Each item (or item update/deletion) is uploaded to CloudSearch automatically via a DynamoDB stream. So now you have "automatic copies" of your DynamoDB items in CloudSearch, and you can use all query capabilities of CloudSearch, including geo queries (one limitation, it only queries in boxes, not in circles, so you will need some extra math)
You will need to create a DynamoDB stream that triggers a Lambda function that uploads every item to CloudSearch. You set this up once, and it will do its magic "forever".
This approach will only work if you accept a small delay between the moment you are writing in DynamoDB and the moment it is available in CloudSearch.
With this approach you still have 2 datasources, but they are entirely separated from the perspective of your app. One datasource is for querying and the other one for writing. Keeping them in sync is done automatically for you in the AWS cloud. Your app writes to DynamoDB, and queries from CloudSearch. And you have the scalability advantages that these AWS services offer.
I have an activity records table named revisions (showed in following image) built for a big learning management system, which mainly keeps record of CRUD operations on tables (e.g. who has done what on which object in what time).
This table may contain up to 3M records of data. I want to build a search functionality for this on the front-end with PHP/Laravel.
Now my question is that what things should I consider for building search functionalities with high performance for tables with millions of records of data, what are the things on code level, database level, or are there 3rd party stuff to support these kind of issues?
I am experienced with building systems with PHP/Laravel, Python/Django, Ruby, etc. But I have never encountered with a case like this, dealing with millions records of data. So please keep in mind my knowledge/experience level. I have NO experience on this level.
Note: Search will be an advance search, making users able to search with different criteria and parameters, the object which is changed, who has changed it, when it's changed, etc.
Let me know if my question still isn't clear.
I would recommend to take a look at the https://www.elastic.co/products/elasticsearch and save your activity records to its storage when you do save to the main database. Then you can easily search any field. Elasticsearch can store a schema free JSON documents, if you prefer more SQL way, there is another search engine - http://sphinxsearch.com/.
There is no problem inserting a zillion rows into a table. Performance problems come when you try to do non-trivial SELECTs on the table. You mentioned "search"; you will have to limit what the 'users' can search for. But at least make a stab at what they might want to search for.
You mentioned "searching for an object", but I don't see a column called object. How many rows might there be for a given object? Do you need all the rows? Or selected ones? (An INDEX on object is likely to make the query efficient, regardless of table size.)
Third-party software sometimes gets in the way of dealing with really large tables. Beware.
I'm working with a Postgres database that I have no control over the administration of. I'm building a calendar that deals with seeing if resources (physical items) were online or offline on a specific day. Unfortunately, if they're offline I can only confirm this by finding the resource name in a text field.
I've been using
select * from log WHERE log_text LIKE 'Resource Kit 06%'
The problem is that when we're building a calendar using LIKE 180+ times (at least 6 resources per day) is slow as can be. Does anybody know of a way to speed this up (keep in mind I can't modify the database). Also, if there's nothing I can do on the database end, is there anything I can do on the php end?
I think, that some form of cache will be required for this. As you cannot change anything in database, your only chance is to pull data from it and store it in some more accessible and faster form. This is highly dependent on frequency of data inserted into table. If there are more inserts than selects, it will not probably help much. Other way there is slight chance of improved performance.
Maybe you can consider using Lucene search engine, which is capable of fulltext indexing. There is implementation from Zend and even Apache has some http service. I haven't opportunity to test it however.
If you don't use something that robust, you can write your own caching mechanism in php. It will not be as fast as postgres, but probably faster than not indexed LIKE queries. If your queries need to be more sofisticated (conditions, grouping, ordering...), you can use SQLite database, which is file based and doesn't need extra service running on server.
Another way could be using triggers in database, which could on insert data store required information to some other table in more indexed manner. But without rights to administer database, it is probably dead end.
Please be more specific with your question, if you want more specific information.
I am developing a project at work for which I need to create and maintain Summary Tables for performance reasons. I believe the correct term for this is Materialized Views.
I have 2 main reasons to do this:
Denormalization
I normalized the tables as much as possible. So there are situations where I would have to join many tables to pull data. We work with MySQL Cluster, which has pretty poor performance when it comes to JOIN's.
So I need to create Denormalized Tables that can run faster SELECT's.
Summarize Data
For example, I have a Transactions table with a few million records. The transactions come from different websites. The application needs to generate a report will display the daily or monthly transaction counts, and total revenue amounts per website. I don't want the report script to calculate this every time, so I need to generate a Summary Table that will have a breakdown by [site,date].
That is just one simple example. There are many different kinds of summary tables I need to generate and maintain.
In the past I have done these things by writing several cron scripts to keep each summary table updated. But in this new project, I am hoping to implement a more elegant and proper solution.
I would prefer a PHP based solution, as I am not a server administrator, and I feel the most comfortable when I can control everything through my application code.
Solutions that I have considered:
Copying VIEW's
If the resulting table can be represented as a single SELECT query, I can generate a VIEW. Since they are slow, there can be a cronjob that copies this VIEW into a real table.
However, some of these SELECT queries can be so slow that it's not acceptable even for cronjobs. It is not very efficient to recreate the whole summary data, if older rows are not even being updated much.
Custom Cronjobs for each Summary Table
This is the solution I have used before, but now I am trying to avoid it if possible. If there will be many summary tables, it can be messy to maintain.
MySQL Triggers
It is possible to add triggers to the main tables so that every time there is an INSERT, UPDATE or DELETE, the summary tables get updated accordingly.
There would be no cronjobs and the summaries would be in real time. However if there is ever a need to rebuild a summary table from scratch, it would have to be done with another solution (probably #1 above).
Using ORM Hooks/Triggers
I am using Doctrine as my ORM. There is a way to add event listeners that will trigger stuff on INSERT/UPDATE/DELETE, which in turn can update the summary tables. In a sense this solution is similar to #3 above, but I will have better control over these triggers since they will be implemented in PHP.
Implementation Considerations:
Complete Rebuilds
I want to avoid having to rebuild the summary tables, for efficiency, and only update for new data. But in case something goes wrong, I need the capability to rebuild the summary table from scratch using existing data on the main tables.
Ignoring UPDATE/DELETE on Old Data
Some summaries can assume that older records will never be updated or deleted, but only new records will be inserted. The summary process can save a lot of work by making the assumption that it doesn't need to check for updates on older data.
But of course this won't apply to all tables.
Keeping a Log
Let's assume that I won't have access to, or do not want to use the binary MySQL logs.
For summarizing new data, the summary process just needs to remember the last primary key id's for the last records it summarized. Next time it runs, it can summarize everything after that id. However, to keep track of older records that have been updated/deleted, it needs another log so it can go back and re-summarize that data.
I would appreciate any kind of strategies, suggestions or links that can help. Thank you!
As noted above materialized views in Oracle are different than indexed views in SQL Server. They are very cool and useful. See http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repmview.htm for details
MySql does not have support for these however.
One thing you mention several times is poor performance. Have you checked your database design for proper indexing and run explain plans on the queries to see why they are slow. See here http://dev.mysql.com/doc/refman/5.1/en/using-explain.html. This is of course assuming that your server is tuned properly, you have mysql setup and tuned, e.g. buffer caches, etc. etc. etc.
To your direct question. What you sound like you want to do is something we do often in a data warehouse situation. We have a production database and a DW that pulls in all sorts of information, aggregates and pre-caclulates it to speed up querying. This may be overkill for you but you can decide. Depending on the latency you define for your reports, i.e. how often you need them, we normally go through an ETL (extract transform load) process periodically (daily, weekly, etc.) to populate the DW from the production system. This keeps impact low on the production system and moves all reporting to another set of servers which also lessens the load. On the DW side, I would normally design my schemas different, i.e. using star schemas. (http://www.orafaq.com/node/2286) Star schemas have fact tables (things you want to measure) and dimensions (things you want to aggregate the measures by (time, geography, product categories, etc.) On SQL Server they also include an additional engine called SQL Server Analysis services (SSAS) to look at fact tables and dimensions, pre calculate and build OLAP data cubes. In these data cubes you can drill down and look at all types of patterns, do data analysis and data mining. Oracle does things slightly differently but the outcome is the same.
Whether you want to go the about route really depends on the business need and how much value you get from data analysis. As I said it is likely overkill if you just have a few summary tables but some of the concepts you may find helpful as you think things through. If your business is going toward a business intelligence solution then this is something to consider.
PS You can actually set a DW up to work in "real-time" using something called ROLAP if that is the business need. Microstrategy has a good product that works well for this.
PPS You also may want to look at PowerPivot from MS (http://www.powerpivot.com/learn.aspx) I have only played with it so I cannot tell you how it works on very large datasets.
Flexviews (http://flexvie.ws) is an open source PHP/MySQL based project. Flexviews adds incrementally refreshable materialized views (like the materialized views in Oracle) to MySQL, usng PHP and stored procedures.
It includes FlexCDC, a PHP based change data capture utility which reads binary logs, and the Flexviews MySQL stored procedures which are used to define and maintain the views.
Flexviews supports joins (inner join only) and aggregation so it can be used to create summary tables. Moreover, you can use Flexviews in combination with Mondrian's (a ROLAP server) aggregation designer to create summary tables that the ROLAP tool can automatically use.
If you don't have access to the logs (it can read them remotely, btw, so you don't need server access, but you do need SUPER privs) then you can use 'COMPLETE' refresh with Flexviews. This automates creating a new table with 'CREATE TABLE ... AS SELECT' under a new table name. It then uses RENAME TABLE to swap the new table for the one, renaming the old with an _old postfix. Finally it drops the old table. The advantage here is that the SQL to create the view is stored in the database (flexviews.mview) and can be refreshed with a simple API call which automates the swapping process.
I am building a web app, and I am thinking about how I should build the database.
The app will be feed by keywords, then it will retrieve info for those keywords and save it into the database with a datestamp. The info will be from different source like, num of results from yahoo, diggs from the last month that contains that keyword, etc.
So I was thinking the a simple way to do it would be to have a table with an id and keyword column where the keywords would be stored, and another table for ALL the data with a id(same as keyword), datestamp, data_name, data_content.
Is this a good way to use mysql or could this in someway make queries slower or sometihng? should I build tables for each type of data I want to use? I am mostly looking for a good performance on the application.
Another reason I would like to use only one table for the data is that I can easly add more data_name(s) without touching the db.
The second table i.e, the table which contains various information about keywords ,id column in this table can be used as a foreign key to the first table id column
Is this a good way to use mysql or
could this in someway make queries
slower or sometihng? should I build
tables for each type of data I want to
use? I am mostly looking for a good
performance on the application.
A lot of big MySQL players(like for example flickr) use MySQL as a simple KV(key-value) store.
Furthermore if you are concerned with performance you should cache your data in memcached/redis(there is nothing which can beat memory).
I have another recommendation: index your content. If you plan on storing content and be able to search using keywords, then use mysql to store the details of your document, like author, text and some other info and use Lucene to create an index. Lucene is originally for Java, but has ports to many languages, and PHP is no exception.
You can use Zend framework to manage Lucene indexes, with very little effort, browse thru the documentation or look for a tutorial online. The thing of this recommendation is simple:
- You'll improve your search time drastically
- Your keyword acceptance will be higher, Lucene will give power to search
I hope I can help!
Best luck!