So currently i have an application where i'm storing location data (lat,lng) along with other fields and who not. So what i love about mysql or sql in general is that i can get geospatial queries easily. e.g. select all rows that fall within a given radius and center point.
What i love about dynamodb is that it's damn near infinitely scalable on AWS, which is the service i'll be using, and fast. I would love to move all my data over to dynamodb and even insert new data there But i wouldn't be able to use those geospatial queries which is the most important part of my application. It's required.
I know about the geolibrary for dynamodb but its written in java and my backend is written in php so thats a no go, plus they don't seem to update or maintain that library.
One solution i was thinking of was to store just the coordinates in mysql and store the corresponding id along with the other data (including the lat and long values) in dynamodb.
With this i could achieve the geospatial query functionality i want while being able to scale everything well on amazon specifically because thats the host i'm using.
So basically i'd query all POIs within a given radius from mysql and with all the ids i'd use that to get all results from dynamodb. Sounds crazy or what?
But the potential downside of this is having to query one data source and then querying another one immediately after with the result from the first query. Maybe I'm over thinking and underestimating how fast these technologies have become.
So to sum up my requirements:
Must be on AWS
Must be able to perform geospatial queries
Must be able to connect to dynamodb and MySQL in PHP
Any help or suggestions would be greatly appreciated.
My instinct says, don't use 2 datasources, only if you have a really specific case.
How much data do you have? Is MySQL ( or Aurora) really can't handle it? If your application is read heavy, it can easily scale with read replicas.
I have a few ideas for you which may brings you at least a bit closer:
Why don't you implement your own geo-library in php? :D
You can do a dummy search in the DB, where you not filtering by actual distance, but with an upper and lower boundary in lat. and long. ( So you not searching in a circle, but in a square. Then it's on you if your application is fine with it, or it filters the result, but that would be a much smaller dataset and an easy filter.
Maybe CloudSearch can help you out. It offers geo spatial queries on lat long fields. It works well together with DynamoDB, and it has a PHP SDK (never tried that though, I use nodejs)
You write the items that have lat,long fields to DynamoDB. Each item (or item update/deletion) is uploaded to CloudSearch automatically via a DynamoDB stream. So now you have "automatic copies" of your DynamoDB items in CloudSearch, and you can use all query capabilities of CloudSearch, including geo queries (one limitation, it only queries in boxes, not in circles, so you will need some extra math)
You will need to create a DynamoDB stream that triggers a Lambda function that uploads every item to CloudSearch. You set this up once, and it will do its magic "forever".
This approach will only work if you accept a small delay between the moment you are writing in DynamoDB and the moment it is available in CloudSearch.
With this approach you still have 2 datasources, but they are entirely separated from the perspective of your app. One datasource is for querying and the other one for writing. Keeping them in sync is done automatically for you in the AWS cloud. Your app writes to DynamoDB, and queries from CloudSearch. And you have the scalability advantages that these AWS services offer.
Related
My stack is php and mysql.
I am trying to design a page to display details of a mutual fund.
Data for a single fund is distributed over 15-20 different tables.
Currently, my front-end is a brute-force php page that queries/joins these tables using 8 different queries for a single scheme. It's messy and poor performing.
I am considering alternatives. Good thing is that the data changes only once a day, so I can do some preprocessing.
An option that I am considering is to create run these queries for every fund (about 2000 funds) and create a complex json object for each of them, store it in mysql indexed for the fund code, retrieve the json at run time and show the data. I am thinking of using the simple json_object() mysql function to create the json, and json_decode in php to get the values for display. Is this a good approach?
I was tempted to store them in a separate MongoDB store - would that be an overkill for this?
Any other suggestion?
Thanks much!
To meet your objective of quick pageviews, your overnight-run approach is very good. You could generate JSON objects with your distilled data, or even prerendered HTML pages, and store them.
You can certainly store JSON objects in MySQL columns. If you don't need the database server to search the objects, simply use TEXT (or LONGTEXT) data types to store them.
To my way of thinking, adding a new type of server (mongodb) to your operations to store a few thousand JSON objects does not seem worth the the trouble. If you find it necessary to search the contents of your JSON objects, another type of server might be useful, however.
Other things to consider:
Optimize your SQL queries. Read up: https://use-the-index-luke.com and other sources of good info. Consider your queries one-by-one starting with the slowest one. Use the EXPLAIN or even the EXPLAIN ANALYZE command to get your MySQL server to tell you how it plans each query. And judiciously add indexes. Using the query-optimization tag here on StackOverflow, you can get help. Many queries can be optimized by adding indexes to MySQL without changing anything in your php code or your data. So this can be an ongoing project rather than a big new software release.
Consider measuring your query times. You can do this with MySQL's slow query log. The point of this is to identify your "dirty dozen" slowest queries in a particular time period. Then, see step one.
Make your pages fill up progressively, to keep your users busy reading while you get the data they need. Put the toplevel stuff (fund name, etc) in server-side HTML so search engines can see it. Use some sort of front-end tech (React, maybe, or Datatables that fetch data via AJAX) to render your pages client-side, and provide REST endpoints on your server to get the data, in JSON format, for each data block in the page.
In your overnight run create a sitemap file along with your JSON data rows. That lets you control exactly how you want search engines to present your data.
I am about to rebuild my web application to use elastic search instead of mysql for searching purposes, but I am unsure exactly how to do so.
I watched a Laracon video on it, since my application is built on Laravel 4.2, and I will be using this wrapper to query: https://github.com/elasticsearch/elasticsearch
However, am I still going to use the MySQL database to house the data, and have ES search it? Or is it better to have ES house and query the data.
If I go the first route, do I have to do CRUD operations on both sides to keep them updated?
Can ES handle the data load that MySQL can? Meaning hundreds of millions of rows?
I'm just very skiddish on starting the whole thing. I could use a little direction, it would be greatly appreciated. I have never worked with any search other than MySQL.
I would recommend keeping MySQL as the system of record and do all CRUD operations from your application against MySQL. Then start an ElasticSearch machine and periodically move data from MySQL to ElasticSearch (only the data you need to search against).
Then if ElasticSearch goes down, you only lose the search feature - your primary data store is still ok.
ElasticSearch can be configured as a cluster and can scale very large, so it'll handle the number of rows.
To get data into Elastic, you can do a number of things:
Do an initial import (very slow, very big) and then just copy diffs with a process. You might consider something like Mule ESB to move data (http://www.mulesoft.org/).
When you write data from your app, you can write once to MySQL and also write the same data to Elastic. This provides real time data in Elastic, but of course if the second write to Elastic fails, then you'll be missing the data.
I'm working with a Postgres database that I have no control over the administration of. I'm building a calendar that deals with seeing if resources (physical items) were online or offline on a specific day. Unfortunately, if they're offline I can only confirm this by finding the resource name in a text field.
I've been using
select * from log WHERE log_text LIKE 'Resource Kit 06%'
The problem is that when we're building a calendar using LIKE 180+ times (at least 6 resources per day) is slow as can be. Does anybody know of a way to speed this up (keep in mind I can't modify the database). Also, if there's nothing I can do on the database end, is there anything I can do on the php end?
I think, that some form of cache will be required for this. As you cannot change anything in database, your only chance is to pull data from it and store it in some more accessible and faster form. This is highly dependent on frequency of data inserted into table. If there are more inserts than selects, it will not probably help much. Other way there is slight chance of improved performance.
Maybe you can consider using Lucene search engine, which is capable of fulltext indexing. There is implementation from Zend and even Apache has some http service. I haven't opportunity to test it however.
If you don't use something that robust, you can write your own caching mechanism in php. It will not be as fast as postgres, but probably faster than not indexed LIKE queries. If your queries need to be more sofisticated (conditions, grouping, ordering...), you can use SQLite database, which is file based and doesn't need extra service running on server.
Another way could be using triggers in database, which could on insert data store required information to some other table in more indexed manner. But without rights to administer database, it is probably dead end.
Please be more specific with your question, if you want more specific information.
I have read many blog and articles about the pros and cons of Amazon EC2 versus Microsoft Azure (and Google's App Engine). However, I am trying to decide which would better suite my particular case.
I have a data set - which can be thought of as a standard table of the format:
[id] [name] [d0] [d1] [d2] .. [d63]
---------------------------------------
0 Name1 0.43 -0.22 0.11 -0.81
1 Name2 0.23 0.65 0.62 0.41
2 Name3 -0.13 -0.23 0.17 0.00
...
N NameN 0.43 -0.23 0.12 0.01
I ultimately want to do something that (despite my final chosen stack) would equate to an SQL SELECT statement similar to:
SELECT name FROM [table] WHERE (d0*QueryParameter1) + (d1*QueryParameter1) +(d2*QueryParameter2) + ... + (dN*QueryParameterN) < 0.5
where QueryParameter1,2,N are parameters supplied at runtime, and change each time the query is run (so caching is out of the question).
My main concern is with the speed of the query, so I would like advice on which cloud stack option would provide the fastest query result possible.
I can do this a number of ways:
(1) Use SQL Azure, just as the query lies above. I have tried this method, and the queries can be quite slow as expected since SQL only gives you a single instance. I can spin up multiple instances of SQL and shard the data, but that gets real expensive real quick.
(2) Use Azure Storage Tables. Bloggers claim storage tables are faster in general, but would this still be the case for my query requirements?
(3) Use EC2 and spin up several instances with MySQL, possibly incorporating sharding to new instances (cost increases though).
(4) Use EC2 with MongoDB, as I've read it is faster than MySQL. Again this is probably dependent on the type of query.
(5) Google AppEngine. I'm not really sure how GAE would work with this query structure, but I guess that's why I am looking for opinions.
I'd like to find the best stack combination to optimize my specific need (outlined by the pseudo SQL query above).
Does anyone have any experience in this? Which stack option would result in the fastest query containing many math operators in the WHERE clause?
Cheers,
Brett
Your type of query with dynamic coefficients (weights) will require the entire table to be scanned on every query. A SQL database engine is not going to help you here, because there is really nothing that the query optimizer can do.
In other words, what you need is NOT a SQL database, but really a "NoSQL" database which really optimizes table/row access to the fastest speed possible. So you really shouldn't have to try SQL Azure and MySQL to find out this part of the answer.
Also, each row in your type of query is completely independent from each other, so it lends itself to simple parallelism. Your choice of platform should be whichever gives you:
Table/row scan at the fastest speed
Ability to highly parallelize your operation
Each platform you mentioned gives you ability to store huge amounts of blob or table-like data for very fast scan retrieval (e.g. table storage in Azure). Each also gives you the ability to "spin up" multiple instances to process them in parallel. It really depends on which programming environment you're most comfortable in (e.g. Java in Google/Amazon, .NET in Azure). In essence they all do the same thing.
My personal recommendation is Azure, since you can:
Store massive amounts of data in "table storage", optimized for fast scan retrieval, and partitioned (e.g. over d0 ranges) for optimal parallelism
Dynamically "spin up" as many compute instances as you like to process the data in parallel
Queueing mechanisms to synchronize the results collation
Azure does what you requires in a very "no-frills" way -- providing just enough infrastructure for you to do your job, and nothing more.
The problem is not the math operators or the number thereof, the problem is that they are parameterized - you are effectively doing a weighted average across the columns with the weights being defined at run-time, so that the operation must be computed and cannot be inferred.
Even in SQL Server, this operation can be parallelized (and this should show up on the execution plan), but it is not amenable to search optimization using indexes, which is where most relational databases will really shine. With static weights and indexed computed column would obviously perform very quickly.
Because this problem is easily parallelized, you might want to look at something based on a Map-Reduce principle.
Currently neither SQL Azure nor Amazon RDS can scale horizontally (EC2 can at least vertically) but IF and only IF your data can be partitioned in a way that still makes it possible to execute your query the upcoming SQL Federations Feature of SQL Azure might be worth looking at and help making an informed decision.
MongoDB (which I like a lot) is more geared toward Document oriented workloads and is possible not the best solution for this type of job although your mileage may vary (it's blazingly fast as long as most of your working set fits into memory).
Assuming that the QueryParameter0, QueryParameter1, ... , QueryParameterN are all supplied at runtime and are different each time, then I don't think that any of the platforms will be able to provide significant advantages over any of the others - since none of them will be able to take advantage of any pre-computed indicies.
With indicies removed, the only other factors for speed then comes from the processing power available - you already know about this for the SQL Azure option, and for the other options this pretty much comes down to you deciding what processing to apply - it's up to you to fetch all the data and to then process it.
One option you might consider is whether you could host this data yourself on an instance (e.g. using an Azure blob or cloud drive) and could then process the data in a custom built worker role. This isn't something I'd think about for general data storage, but if its just this one table and this one query then it would be pretty easy to hand craft a quick solution?
Update - just seen the answer from #Cade too - +1 for his suggestion of parallelization.
Im building an application where vehicles coordinates are being logged by GPS. I want to implement a couple of features to start with, such as:
realtime tracking of vehicles
history tracking of vehicles
keeping locations and area's for customer records
I need some guidelines as where to start on database and application design. Anything from best practices, hints to experience would really help me get on the right track.
How would one tackle ORM for geometry? For example: A location would convert to a class SpatialPoint, where an area would convert to a class SpatialPolygon
How do i keep the massive data stream comming from the vehicles sane? Im thinking a table to keep the latest points in (for realtime data) and batch parsing this data into PolyLines in a separate table for history purposes (one line per employee shift on a vehicle).
Mysql is probably not the best choice for this, but I'm planning on using Solr as the index for quick location based searches. Although we need to do some realtime distance calculation like as which vehicle is nearest to customer X. Any thoughts?
I can help you on one bit, mysql definitely is the best choice, I've been down the same path as you many times and the mysql spatial extension is fantastic, infact it's awesomely fast even over tables with 5 million+ rows of spatial data, it's all in the index. The spatial extension is one of the best kept mysql secrets that few use ;)
ORM, I'd recommend skipping for this tbh - if you have a huge amount of data all those instances of classes will kill your application, stick with a v simple array structure for dealing with the data.
RE massive data stream, either consume it live and only store every 10th entry, or just stick it all in the one table - it won't impact speed due to how the table is indexed, but size considerations may be worth considering.
For an alternative coming from PHP, you could try postgis on postgresql, but I've always favoured mysql for ease of use, native support and all round speed.
Good luck!
Yes, I recommend use of Solr as well. Current release is 1.4. It works incredibly well for this problem.
ORM -
You may need sfSolrPlugin with Doctrine ORM to tie PHP to Solr, see article from LucidWorks entitled Building a search application in 15 person-days
real time index updates -
That is coming in the next release of Solr, I believe Solr 1.5. You can get it from SVN.
Geo-spatial search -
I use Spatial Search Plugin for Apache Solr. G-s capabilities might be included in Solr 1.5. I believe that there are already some rudimentary support for g-s, w/o use of plugin.
On "how to handle/store a lot of points coming from the vehicles":
I'm working on a very similar project. I've solved this problem by maintaining 2 tables (using MySQL but this holds true for any other DB):
one for tracking objects (vehicles, users, whatever)
this table would have the object id as primary key and any updates that violates the primary key constraint would update the data stored for this key. It can be easily achieved with "ON DUPLICATE KEY UPDATE"
This makes the lookup extremely fast for tracking and keeps only one instance of location data/object. I have also implemented server side logic for deleteing records of obsolate data(after a certain ammount of time these data needs to be deleted if no updates received on them)
one for history/lookup purposes
this table would have the object id and the timestamp as composite primary key. The table can be partitioned on the timestamp column.
Any update on an object's location would insert to both tables.
I hope this helps.