We are looking for technology to realize a quick search for the autocomplate list, which is based on large dictionaries (countries, cities, regions, organizations, objects).
Countries 150 rows
Cities 300 000 rows
Regions 70 000 rows
Organizations 500 000 rows
Objects 50 000 rows
The main requirements
1. the filter field should be one for all dictionaries
2. user input should lead to displaying a list of completion
3. the expected load more than 200 requests per second
Preferred technology nginx, php, windows servers
Given the amount of cache directories to make the results and put them in memcached is not possible
Prompt, how to solve the problem
Implementing fast autocomplete for large datasets like yours can be challenging. I have not used Elasticsearch/solr for autocomplete, but they are possible solutions. I would prefer Elasticsearch though.
You can also try out Redis as index store for autocomplete. Redis does provide a demo for the same http://autocomplete.redis.io/ But redis is not supported on Windows.
Also I suggest not use PHP. Rather use Java/C# for performance reason.
If you dont want to implement it yourself, you can use this service called 'Autocomplete as a Service' which is specifically written for these purposes. You can access it here - www.aaas.io. It does support large datasets and you can apply filters as well.
Disclaimer: I am founder of it. If the current plans do not fit your need you can drop me email. I will be happy to provide service for you.
Related
I need to be able to quickly find n closest destinations for a given destinations, calculate n x n distance matrix for n destinations and several other such operation related to distances between two or more destination.
I have learned a Graph DB will give far better performance compared to a MySQL database. My application is written in PHP.
SO my question is - Is it possible to use Graph DB with a PHP application, If yes then which one is the best option and opensource and how to store this data in graph DB and how would it be accessed.
Thanks in advance.
Neo4j is a very solid graph DB and has flexible (if a bit complex) licensing as well. It implements the Blueprints API and should be pretty easy to use from just about any language, including PHP. It also has a REST API as well, which is about as flexible as it gets, and there is at least one good example of using it from PHP.
Depending on what data you have, there are a number of ways to store it.
If you have "route" data, where your points are already connected to each other via specific paths (ie. you can't jump from one point directly to another), then you simply make each point a node and the connections you have between points in your routes are edges between nodes, with the distances as properties of those edges. This would give you a graph that looks like your classic "traveling salesman" sort of problem, and calculating distances between nodes is just a matter of doing a weighted breadth-first search (assuming you want shortest path).
If you can jump from place to place with your data set, then you have a fully connected graph. Obviously this is a lot of data, and grows quadratically as you add more destinations, but a graph DB is probably better at dealing with this than a relational DB is. To store the distances, as you add nodes to the graph, you also add an edge to each other existing node with the distance pre-calculated as one of it's properties. Then, to retrieve the distances between a pair of nodes, you simply find the edge between them and get it's distance property.
However, if you have a large number of fully-connected nodes, you would probably be better off just storing the coordinates of those nodes and calculating the distances as-needed, and optionally caching the results to speed things up.
Lastly, if you use the Blueprints API and the other tools in that stack, like Gremlin and Rexter, you should be able to swap in/out any compatible graph database, which lets you play around with different implementations that may meet your needs better, like using Titan on top of a Cassandra / Hadoop cluster.
Yes, a graph database will give you more performance than an extension for MySQL or Postgres will be able to. One that looks really slick is OrientDB, a there's a beta implementation in PHP using the binary protocol and another one that uses HTTP as the transport layer.
As for the example code, Alessandro (from odino.org) wrote a implementation of Dijkstra's algorithm along with a full explanation of how to use it with OrientDB to find the minimum distance between cities.
Actually it's not that much about database as about indexes. I've used MongoDB's geospatial indexing and search (document DB), which has geo indexing designed for finding multiple nearest elements to given coordinates - with good results. Still - it runs only simple queries (find nearest) and it gets a bit slow if your index doesn't fit in the RAM (I've used geonames DB with 8mln places with coordinates and got 0.005-2.5s per query on VM - 1. hdd overhead 2. probably the index didn't fit in the RAM).
Is there any tools for identifying, and merging non exact duplicates in MySQL tables?
I have a large data set with many duplicates like:
1348, Auto Motors, 12 Long Road, etc
48264, Auto Mtors, 12 Log Road, etc
82743, Ato Motoers, 12 Lng Road, etc
83821, Auto Motors, 13 Long Road, etc
92743, Auto Motors, 11 Long Road, etc
There are many tables needed to be merged like:
Companies
Addresses
Phone Numbers
Employees
There is about 100,000 rows, and 30-40 columns to match on each row (joined tables).
So, anyone know of a tool for sorting this out? I already have MySQL, PHP installed. I have/can use(d) MongoDB, and Solr before if they would help. And I am open to installing other software if needed.
Alternatively what kind of queries should I run if I cannot find a tool to handle this.
A simple find all duplicates wont work because they are not exact.
Doing wildcard like searches would be extremely slow for all the different combinations I would need to try.
Using a Oliver or Levenshtein (MySQL) may work, and there is too much data to pull into PHP (also probably extremely slow).
You have data that requires massaging. I don't think this is something you can do entirely in sql.
Google Refine is a great tool for massaging. I would load the data in Refine first, clean it up, then import into your relational database.
Doing wildcard like searches would be extremely slow for all the different combinations I would need to try.
Using a Oliver or Levenshtein (MySQL) may work, and there is too much data to pull into PHP (also probably extremely slow).
You state this as if it were facts, but that is exactly what I would suggest. E.g. load one row into php. Then loop over all other rows, matching with various algorithms that you feel are appropriate (Levenshtein or perhaps your own list of stopwords etc.). It'll take a while to run through, but this is presumably something you can do as a one-off task or at least a periodical one (say, once per day)
I am looking to build a realestate search engine specs are
Approx 500 000 listings
daily updates of potentially 50 000 listings
Data supplied in clean(ish) CSV's - need to remove characters, encode utf, the usual.
50+ fields of data (30 images, various property specs etc)
Im having a lot of problem with Drupal7 and Joomla cannot handle it. Thats just the data import.
Im wanting to have solr index the data and serve as the search engine. I have a few questions.
Can solr serve the listing directly from its index? (If so do I need a data store such as Mysql or even a CMS)
Would I be better off putting the data in a simple single table mysql DB and use that to push documents to solr for index, then either load listings from the DB or from Solr index.
Due to data difficulties, it seems I can simply do away with a lot of complications trying to figure out the inner workings of D7/Joomla/any other cms and just put a few simple php files up as the front end.
I dont need anything fancy looking, was going to use the basic drupal template for this project.
I need speed and reliability and excellent search results.
IMHO it should be possible to use SOLR exclusively for your purpose. The number of 50000 listings is not very much for SOLR even for a single server, but 500000 updates per about 10h I suggest is indeed a lot. Since you will have about 50000 updates per hour which is equivalent to a full reindex per hour.
We use SOLR for our enterprise, too, and with something about 40-120 fields. 40000 items do need about 5 minutes to index completely. If you want to autowarm caches you have to add perhaps some minutes to that.
As far as I see your problem will be the small update periods. If you want to update individual documents instead of all of the 50000 listings once per hour, your solr cannot use caching or you will have to use multiple solr servers. (Perhaps for solr 4.0 you could even consider scaling up your solr server hardware, but i suspect 3.x would have any benefits from that)
No use of caches could lead to slower search performance, but it does not have to.
Since SOLR offers thy dynamic fields functionality you can add different structures per document. This should match your various properties requirement.
A company we do business with wants to give us a 1.2 gb CSV file every day containing about 900,000 product listings. Only a small portion of the file changes every day, maybe less than 0.5%, and it's really just products being added or dropped, not modified. We need to display the product listings to our partners.
What makes this more complicated is that our partners should only be able to see product listings available within a 30-500 mile radius of their zip code. Each product listing row has a field for what the actual radius for the product is (some are only 30, some are 500, some are 100, etc. 500 is the max). A partner in a given zip code is likely to only have 20 results or so, meaning that there's going to be a ton of unused data. We don't know all the partner zip codes ahead of time.
We have to consider performance, so I'm not sure what the best way to go about this is.
Should I have two databases- one with zip codes and latitude/longitude and use the Haversine formula for calculating distance...and the other the actual product database...and then what do I do? Return all the zip codes within a given radius and look for a match in the product database? For a 500 mile radius that's going to be a ton of zip codes. Or write a MySQL function?
We could use Amazon SimpleDB to store the database...but then I still have this problem with the zip codes. I could make two "domains" as Amazon calls them, one for the products, and one for the zip codes? I don't think you can make a query across multiple SimpleDB domains, though. At least, I don't see that anywhere in their documentation.
I'm open to some other solution entirely. It doesn't have to be PHP/MySQL or SimpleDB. Just keep in mind our dedicated server is a P4 with 2 gb. We could upgrade the RAM, it's just that we can't throw a ton of processing power at this. Or even store and process the database every night on a VPS somewhere where it wouldn't be a problem if the VPS were unbearably slow while that 1.2 gb CSV is being processed. We could even process the file offline on a desktop computer and then remotely update the database every day...except then I still have this problem with zip codes and product listings needing to be cross-referenced.
You might want to look into PostgreSQL and Postgis. It has similar features as MySQL spacial indexing features, without the need to use MyISAM (which, in my experience, tend to become corrupt as opposed to InnoDB).
In particular with Postgres 9.1, which allows k-nearest neighbour search queries using GIST indexes.
Well, that is an interesting problem indeed.
This seems like its actually two issues, one how should you index the databases and the second is how to you keep it up to date. The first you can achieve as you describe, but normalization may or may not be a problem, depending on how you are storing the zip code. This primarily comes down to what your data looks like.
As for the second one, this is more my area of expertise. You can have your client upload the csv to you as they currently are, keep a copy of the one from yesterday and run it through a diff utility, or you can leverage Perl, PHP, Python, Bash or any other tools you have, to find the lines that have changed. Pass those into a second block that would update your database. I have dealt with clients with issues along this line and scripting it away tends to be the best choice. If you need help with organizing your script that is always available.
Im building an application where vehicles coordinates are being logged by GPS. I want to implement a couple of features to start with, such as:
realtime tracking of vehicles
history tracking of vehicles
keeping locations and area's for customer records
I need some guidelines as where to start on database and application design. Anything from best practices, hints to experience would really help me get on the right track.
How would one tackle ORM for geometry? For example: A location would convert to a class SpatialPoint, where an area would convert to a class SpatialPolygon
How do i keep the massive data stream comming from the vehicles sane? Im thinking a table to keep the latest points in (for realtime data) and batch parsing this data into PolyLines in a separate table for history purposes (one line per employee shift on a vehicle).
Mysql is probably not the best choice for this, but I'm planning on using Solr as the index for quick location based searches. Although we need to do some realtime distance calculation like as which vehicle is nearest to customer X. Any thoughts?
I can help you on one bit, mysql definitely is the best choice, I've been down the same path as you many times and the mysql spatial extension is fantastic, infact it's awesomely fast even over tables with 5 million+ rows of spatial data, it's all in the index. The spatial extension is one of the best kept mysql secrets that few use ;)
ORM, I'd recommend skipping for this tbh - if you have a huge amount of data all those instances of classes will kill your application, stick with a v simple array structure for dealing with the data.
RE massive data stream, either consume it live and only store every 10th entry, or just stick it all in the one table - it won't impact speed due to how the table is indexed, but size considerations may be worth considering.
For an alternative coming from PHP, you could try postgis on postgresql, but I've always favoured mysql for ease of use, native support and all round speed.
Good luck!
Yes, I recommend use of Solr as well. Current release is 1.4. It works incredibly well for this problem.
ORM -
You may need sfSolrPlugin with Doctrine ORM to tie PHP to Solr, see article from LucidWorks entitled Building a search application in 15 person-days
real time index updates -
That is coming in the next release of Solr, I believe Solr 1.5. You can get it from SVN.
Geo-spatial search -
I use Spatial Search Plugin for Apache Solr. G-s capabilities might be included in Solr 1.5. I believe that there are already some rudimentary support for g-s, w/o use of plugin.
On "how to handle/store a lot of points coming from the vehicles":
I'm working on a very similar project. I've solved this problem by maintaining 2 tables (using MySQL but this holds true for any other DB):
one for tracking objects (vehicles, users, whatever)
this table would have the object id as primary key and any updates that violates the primary key constraint would update the data stored for this key. It can be easily achieved with "ON DUPLICATE KEY UPDATE"
This makes the lookup extremely fast for tracking and keeps only one instance of location data/object. I have also implemented server side logic for deleteing records of obsolate data(after a certain ammount of time these data needs to be deleted if no updates received on them)
one for history/lookup purposes
this table would have the object id and the timestamp as composite primary key. The table can be partitioned on the timestamp column.
Any update on an object's location would insert to both tables.
I hope this helps.