I need to be able to quickly find n closest destinations for a given destinations, calculate n x n distance matrix for n destinations and several other such operation related to distances between two or more destination.
I have learned a Graph DB will give far better performance compared to a MySQL database. My application is written in PHP.
SO my question is - Is it possible to use Graph DB with a PHP application, If yes then which one is the best option and opensource and how to store this data in graph DB and how would it be accessed.
Thanks in advance.
Neo4j is a very solid graph DB and has flexible (if a bit complex) licensing as well. It implements the Blueprints API and should be pretty easy to use from just about any language, including PHP. It also has a REST API as well, which is about as flexible as it gets, and there is at least one good example of using it from PHP.
Depending on what data you have, there are a number of ways to store it.
If you have "route" data, where your points are already connected to each other via specific paths (ie. you can't jump from one point directly to another), then you simply make each point a node and the connections you have between points in your routes are edges between nodes, with the distances as properties of those edges. This would give you a graph that looks like your classic "traveling salesman" sort of problem, and calculating distances between nodes is just a matter of doing a weighted breadth-first search (assuming you want shortest path).
If you can jump from place to place with your data set, then you have a fully connected graph. Obviously this is a lot of data, and grows quadratically as you add more destinations, but a graph DB is probably better at dealing with this than a relational DB is. To store the distances, as you add nodes to the graph, you also add an edge to each other existing node with the distance pre-calculated as one of it's properties. Then, to retrieve the distances between a pair of nodes, you simply find the edge between them and get it's distance property.
However, if you have a large number of fully-connected nodes, you would probably be better off just storing the coordinates of those nodes and calculating the distances as-needed, and optionally caching the results to speed things up.
Lastly, if you use the Blueprints API and the other tools in that stack, like Gremlin and Rexter, you should be able to swap in/out any compatible graph database, which lets you play around with different implementations that may meet your needs better, like using Titan on top of a Cassandra / Hadoop cluster.
Yes, a graph database will give you more performance than an extension for MySQL or Postgres will be able to. One that looks really slick is OrientDB, a there's a beta implementation in PHP using the binary protocol and another one that uses HTTP as the transport layer.
As for the example code, Alessandro (from odino.org) wrote a implementation of Dijkstra's algorithm along with a full explanation of how to use it with OrientDB to find the minimum distance between cities.
Actually it's not that much about database as about indexes. I've used MongoDB's geospatial indexing and search (document DB), which has geo indexing designed for finding multiple nearest elements to given coordinates - with good results. Still - it runs only simple queries (find nearest) and it gets a bit slow if your index doesn't fit in the RAM (I've used geonames DB with 8mln places with coordinates and got 0.005-2.5s per query on VM - 1. hdd overhead 2. probably the index didn't fit in the RAM).
So currently i have an application where i'm storing location data (lat,lng) along with other fields and who not. So what i love about mysql or sql in general is that i can get geospatial queries easily. e.g. select all rows that fall within a given radius and center point.
What i love about dynamodb is that it's damn near infinitely scalable on AWS, which is the service i'll be using, and fast. I would love to move all my data over to dynamodb and even insert new data there But i wouldn't be able to use those geospatial queries which is the most important part of my application. It's required.
I know about the geolibrary for dynamodb but its written in java and my backend is written in php so thats a no go, plus they don't seem to update or maintain that library.
One solution i was thinking of was to store just the coordinates in mysql and store the corresponding id along with the other data (including the lat and long values) in dynamodb.
With this i could achieve the geospatial query functionality i want while being able to scale everything well on amazon specifically because thats the host i'm using.
So basically i'd query all POIs within a given radius from mysql and with all the ids i'd use that to get all results from dynamodb. Sounds crazy or what?
But the potential downside of this is having to query one data source and then querying another one immediately after with the result from the first query. Maybe I'm over thinking and underestimating how fast these technologies have become.
So to sum up my requirements:
Must be on AWS
Must be able to perform geospatial queries
Must be able to connect to dynamodb and MySQL in PHP
Any help or suggestions would be greatly appreciated.
My instinct says, don't use 2 datasources, only if you have a really specific case.
How much data do you have? Is MySQL ( or Aurora) really can't handle it? If your application is read heavy, it can easily scale with read replicas.
I have a few ideas for you which may brings you at least a bit closer:
Why don't you implement your own geo-library in php? :D
You can do a dummy search in the DB, where you not filtering by actual distance, but with an upper and lower boundary in lat. and long. ( So you not searching in a circle, but in a square. Then it's on you if your application is fine with it, or it filters the result, but that would be a much smaller dataset and an easy filter.
Maybe CloudSearch can help you out. It offers geo spatial queries on lat long fields. It works well together with DynamoDB, and it has a PHP SDK (never tried that though, I use nodejs)
You write the items that have lat,long fields to DynamoDB. Each item (or item update/deletion) is uploaded to CloudSearch automatically via a DynamoDB stream. So now you have "automatic copies" of your DynamoDB items in CloudSearch, and you can use all query capabilities of CloudSearch, including geo queries (one limitation, it only queries in boxes, not in circles, so you will need some extra math)
You will need to create a DynamoDB stream that triggers a Lambda function that uploads every item to CloudSearch. You set this up once, and it will do its magic "forever".
This approach will only work if you accept a small delay between the moment you are writing in DynamoDB and the moment it is available in CloudSearch.
With this approach you still have 2 datasources, but they are entirely separated from the perspective of your app. One datasource is for querying and the other one for writing. Keeping them in sync is done automatically for you in the AWS cloud. Your app writes to DynamoDB, and queries from CloudSearch. And you have the scalability advantages that these AWS services offer.
We are looking for technology to realize a quick search for the autocomplate list, which is based on large dictionaries (countries, cities, regions, organizations, objects).
Countries 150 rows
Cities 300 000 rows
Regions 70 000 rows
Organizations 500 000 rows
Objects 50 000 rows
The main requirements
1. the filter field should be one for all dictionaries
2. user input should lead to displaying a list of completion
3. the expected load more than 200 requests per second
Preferred technology nginx, php, windows servers
Given the amount of cache directories to make the results and put them in memcached is not possible
Prompt, how to solve the problem
Implementing fast autocomplete for large datasets like yours can be challenging. I have not used Elasticsearch/solr for autocomplete, but they are possible solutions. I would prefer Elasticsearch though.
You can also try out Redis as index store for autocomplete. Redis does provide a demo for the same http://autocomplete.redis.io/ But redis is not supported on Windows.
Also I suggest not use PHP. Rather use Java/C# for performance reason.
If you dont want to implement it yourself, you can use this service called 'Autocomplete as a Service' which is specifically written for these purposes. You can access it here - www.aaas.io. It does support large datasets and you can apply filters as well.
Disclaimer: I am founder of it. If the current plans do not fit your need you can drop me email. I will be happy to provide service for you.
I am designing a web app where I need to determine which places listed in my DB are in the users driving distance.
Here is a broad overview of the process that I am currently using -
Get users current location via Google's map api
Run through each place in my database(approx 100) checking if the place is within the users driving distance using the google places api. I return and parse the JSON file with PHP to see if any locations exist given the users coordinates.
If place is in users driving distance display top locations(limited to 20 by google places), other wise don't display
This process works fine when I am running through a handful of places, but running through 100 places is much slower, and makes 100 api calls. With Google's current limit of 100,000 calls per day, this could become an issue down the road.
So is there a better way to determine which places in my database are within a users driving distance? I do not want to keep track of addresses in my DB, I want to rely on Google for that.
You can use the formula found here to calculate the distance between zip codes:
This is not precise (door-step to door-step) but you can calculate the distance from the user's zip code to the location's zip code.
Using this method, you won't even have to hit Google's API.
I have an unconventional idea for you. This will be very, very odd when you think about it for the first time, as it does exactly the opposite order of what you will expect to do. However, you might get to see the logic.
In order to put it in action, you'll need a broad category of stuff that you want the user to see. For instance, I'm going to go with "supermarkets".
There is a wonderful API as part of google places called nearbySearch. Its true wonder is to allow you to rank places by distance. We will make use of this.
Modify your database and store the unique ID returned on nearbySearch places. This isn't against the ToS, and we'll need this
Get a list of those IDs.
The plan
When you get the user's location, query nearbySearch for your category, and loop through results with the following constraints:
If the result's ID matches something in your database, you have that result. Bonus #1: it's sorted by distance ascending! Bonus #2: you already get the lat-loc for it!
If the result's ID does not match, you can either silently skip it or use it and add it to your database. This means that you can quite literally update your database on-the-fly with little to no manual work as an added bonus.
When you have run through the request, you will have IDs that never came up in the results. Calculate the point-to-point distance of the furthest result in Google's data and you will have the max distance from your point. If this is too small, use the technique I described here to do a compounded search.
The only requirement is: you need to know roughly what you are searching for. However, consider this: your normal query cycle takes you anywhere between 1 and 100 google Queries. My method takes 1 for a 50km radius. :-)
To calculate distances, you will need Haversine's formula rather than doing a zip code lookup, by the way. This has the added advantage of being truly international.
Important caveats
This search method directly depends on the trade-off between the places you know about and the distance. If you are looking for less than 10km radii, use this method to only generate one request.
If, however, you have to do compounded searching, bear in mind that each request cycle will cost you 3N, where N is the number of queries generated on the last cycle. Therefore, if you only have 3 places in a 100km radius, it makes more sense to look up each place individually.
I am working on an application using memcache pool (5 servers) and some processing nodes. I have two different possible approaches and I was wondering if you guys have any comments on comparison based on performance (speed primarily) between the two
I extract a big chunk of data from memcache once per request, itereate over it and discard the bits I dont need for the particular request
I extract small small bits from memcached and only extract the ones I need. i.e. I extract value of a and based on value of a, extract value of either b or c. Use this combination to find the next key I want to extract.
The difference between the two is that the number of memcached lookups (which is a pool of servers) reduces in 1. but the size of response increases. Any benchmarking reports around it someone has seen before?
Unfortunately I cant use a better key based on request directly as I dont have enough memcache to support all possible combinations of values, so I got to construct some of it at run time
You would have to benchmark for your own setup. The parts that would matter wold be the time spent on:
requesting large amount of data from memcache + retrieving it + extracting data from the resonse
sending several requests to memcache + retrieving the data
Basically first thing you have to measure is how large the overhead for interaction with your cache pool is. And there is that small matter of how this whole thing will react when load increases. What might be fast now, can turn out to be a terrible decision later, when the users start pouring in.
This kinda depends on your definition of "large chunk". Are we talking megabytes here or an array with 100 keys? You also have to consider, that php still needs to process that information.
There are two things you can do at this point:
take a hard looks at how you are storing the information. Maybe you can cut it down to two small requests. One to retrieve the specific data for the conditions, and other to get the conditional information.
setup your own benchmark-thing for your server. Some random article on the web will not be relevant to your system architecture.
I know this is not the answer you wanted to hear, but that's my two cents .. here ya go.
I'm working on a full text index system for a project of mine. As one part of the process of indexing pages it splits the data into a very, very large number of very small pieces.
I have gotten the size of the pieces to be as low as a constant 20-30 bytes, and it could be less, it is basically 2 8 byte integers and a float that make up the actual data.
Because of the scale I'm looking for and the number of pieces this creates I'm looking for an alternative to mysql which has shown significant issues at value sets well below my goal.
My current thinking is that a key-value store would be the best option for this and I have adjusted my code accordingly.
I have tried a number but for some reason they all seem to scale even less than mysql.
I'm looking to store on the order of hundreds of millions or billions or more key-value pairs so I need something that won't have a large performance degradation with size.
I have tried memcachedb, membase, and mongo and while they were all easy enough to set up, none of them scaled that well for me.
membase had the most issues due to the number of keys required and the limited memory available. Write speed is very important here as this is a very close to even workload, I write a thing once, then read it back a few times and store it for eventual update.
I don't need much performance on deletes and I would prefer something that can cluster well as I'm hoping to eventually have this able to scale across machines but it needs to work on a single machine for now.
I'm also hoping to make this project easy to deploy so an easy setup would be much better. The project is written in php so it needs to be easy accessed from php.
I don't need to have rows or other higher level abstractions, they are mostly useless in this case and I have already made the code from some of my other tests to get down to a key-value store and that seems to likely be the fastest as I only have 2 things that would be retrieved from a row keyed off a third so there is little additional work done to use a key-value store. Does anyone know any easy to use projects that can scale like this?
I am using this store to store individual sets of three numbers, (the sizes are based on how they were stored in mysql, that may not be true in other storage locations) 2 eight byte integers, one for the ID of the document and one for the ID of the word and a float representation of the proportion of the document that that word was (number of times the work appeared divided by the number of words in the document). The index for this data is the word id and the range the document id falls into, every time I need to retrieve this data it will be all of the results for a given word id. I currently turn the word id, the range, and a counter for that word/range combo each into binary representations of the numbers and concatenate them to form the key along with a 2 digit number to say what value for that key I am storing, the document id or the float value.
Performance measurement was somewhat subjective looking at the output from the processes putting data into or pulling data out of the storage and seeing how fast it was processing documents as well as rapidly refreshing my statistics counters that track more accurate statistics of how fast the system is working and looking at the differences when I was using each storage method.
You would need to provide some more data about what you really want to do...
depending on how you define fast large scale you have several options:
and sooo on.. the list gets pretty big..
Edit 1:
Per this post comments I would say that you take a look to cassandra or voldemort. Cassandra isn't a simple KV storage per se since you can storage much more complex objects than just K -> V
if you care to check cassandra with PHP, take a look to phpcassa. but redis is also a good option if you set a replica.
Here's add a few products and ideas that weren't mentioned above:
OrientDB - this is a graph/document database, but you can use it to store very small "documents" - it is extremely fast, highly scalable, and optimized to handle vast amounts of records.
Berkeley DB - Berkeley DB is a key-value store used at the heart of a number of graph and document databases - supposedly has a SQLite-compatible API that works with PHP.
shmop - Shared memory operations might be one possible approach, if you're willing to do some dirty-work. If you records are small and have a fixed size, this might work for you - using a fixed record-size and padding with zeroes.
handlersocket - this has been in development for a long time, and I don't know how reliable it is. It basically lets you use MySQL at a "lower level", almost like a key/value-store. Because you're bypassing the query parser etc. it's much faster than MySQL in general.
If you have a fixed record-size, few writes and lots of reads, you may even consider reading/writing to/from a flat file. Likely nowhere near as fast as reading/writing to shared memory, but it may be worth considering. I suggest you weigh all the pros/cons specifically for your project's requirements, not only for products, but for any approach you can think of. Your requirements aren't exactly "mainstream", and the solution may not be as obvious as picking the right product.