I am looking to build a realestate search engine specs are
Approx 500 000 listings
daily updates of potentially 50 000 listings
Data supplied in clean(ish) CSV's - need to remove characters, encode utf, the usual.
50+ fields of data (30 images, various property specs etc)
Im having a lot of problem with Drupal7 and Joomla cannot handle it. Thats just the data import.
Im wanting to have solr index the data and serve as the search engine. I have a few questions.
Can solr serve the listing directly from its index? (If so do I need a data store such as Mysql or even a CMS)
Would I be better off putting the data in a simple single table mysql DB and use that to push documents to solr for index, then either load listings from the DB or from Solr index.
Due to data difficulties, it seems I can simply do away with a lot of complications trying to figure out the inner workings of D7/Joomla/any other cms and just put a few simple php files up as the front end.
I dont need anything fancy looking, was going to use the basic drupal template for this project.
I need speed and reliability and excellent search results.
IMHO it should be possible to use SOLR exclusively for your purpose. The number of 50000 listings is not very much for SOLR even for a single server, but 500000 updates per about 10h I suggest is indeed a lot. Since you will have about 50000 updates per hour which is equivalent to a full reindex per hour.
We use SOLR for our enterprise, too, and with something about 40-120 fields. 40000 items do need about 5 minutes to index completely. If you want to autowarm caches you have to add perhaps some minutes to that.
As far as I see your problem will be the small update periods. If you want to update individual documents instead of all of the 50000 listings once per hour, your solr cannot use caching or you will have to use multiple solr servers. (Perhaps for solr 4.0 you could even consider scaling up your solr server hardware, but i suspect 3.x would have any benefits from that)
No use of caches could lead to slower search performance, but it does not have to.
Since SOLR offers thy dynamic fields functionality you can add different structures per document. This should match your various properties requirement.
Related
I'm the developer for a real estate syndication website and am currently having trouble figuring out a way to update massive numbers of listings/records efficiently (2,000,000+ listings).
We currently accept XML feeds, containing real estate listings, from about ~20 different websites. Most of the incoming feeds are small (~100 or so listings), but we have a couple of XML feeds that contain ~1,000,000 listings. The small feeds are parsed fast and easy, however, the large feeds are taking upwards of 2-3 hours each.
The current "live" database table that contains the listings for viewing on the site is MyISAM. I chose MyISAM because ~95% of the queries to the table are SELECTs. Really the only time there are writes (UPDATE/INSERT queries) are during the time the XML feeds are being processed.
The current process is as follows:
There is a CRON in place that starts the main parsing script.
It loops through a feeds table and grabs the external XML feed source files. It then runs through said file and for each record in the XML file it checks against the listings table to see if a listing needs to be updated or inserted (if it's a new listing).
This is all happening against the live table. What I'd like to find out is if anybody has any better logic to make these updates/inserts happen in the background so as to not slow down the production tables, and ultimately, the user experience.
Would a delta table be the best choice? Maybe do all the heavy work on a separate database and just copy the new table over to the production database? On a separate workhorse domain altogether? Should I have a separate listings table that does all the parsing which would be InnoDB instead of MyISAM?
What we're trying to accomplish is to have our system be able to update listings frequently throughout the day without slowing the site down. Our competitors boast that they are updating their listings every 5 minutes in some cases. I just don't see how that's even possible.
I'm working right now so this is more of a brain dump just to get the ball rolling. If anybody would like me to provide table schematics, I'd be more than happy.
In summary: I'm looking for a way to frequently update millions of records in our database (daily) via a couple dozen external XML feeds/files. I just need some logic on how to effectively, and efficiently, make this happen so as to not drag the production server down with it.
Firstly, for your existing 3 hour import, try wrapping every 100 inserts in a transaction. They will be written to the database in one go, and that might speed things up dramatically. Play around with the 100 value - the best value will depend on how resilient you want it, and how much memory your transaction cache has. (This will of course require you to switch to a different engine).
For providers that are known to offer larger files, try keeping a copy of the previous XML download, and then do a text diff between the old one and the new one. If you set your context settings (i.e. the number of unchanged lines around changed lines) sufficiently you might be able to capture the primary keys of changed items. You would then just do a small number of updates.
Of course, it would help if your providers maintain the order of their XML listing. If they don't, a text sort then a diff may still be faster than importing everything.
FWIW, I think a complete refresh every 5 minutes is probably not feasible. I expect your providers would not be happy with you downloading 1M records at this frequency!
I'm going to add simple live search to website (tips while entering text in input box).
Main task:
39k plain text lines for search into (~500 length of each line, 4Mb total size)
1k online users can simultaneously typing something in inputbox
In some cases 2k-3k resuts can match user request
I'm worried about the following questions:
Database VS textfile?
Are there any general rules or best practices related to my task aimed for decreasing db/server memory load? (caching/indexing/etc)
Do Sphinx/Solr are appropriate for such task?
Any links/advice will be extremely helpful.
Thanks
P.S. May be this is the best solution? PHP to search within txt file and echo the whole line
Put your data in a database (SQLite should do just fine, but you can also use a more heavy-duty RDBMS like MySQL or Postgres), and put an index on the column or columns that will be searched.
Only do the absolute minimum, which means that you should not use a framework, an ORM, etc. They will just slow down your code.
Create a PHP file, grab the search text and do a SELECT query using a native PHP driver, such as SQLite, MySQLi, PDO or similar.
Also, think about how the search box will work. You can prevent many requests if you e.g. put a minimum character limit (it does not make sense to search only for one or two characters), put a short delay between sending requests (so that you do not send requests that are never used), and so on.
Whether or not to use an extension such as Solr depends on your circumstances. If you have a lot of data, and a lot of requests, then maybe you should look into it. But if the problem can be solved using a simple solution then you should probably try it out before making it more complicated.
I have implemented 'live search' many times, always using AJAX with querying the database (MySQL) and haven't had/observed any speed or large load issues yet.
Anyway I saw an implementations using Solr but cannot suggest whether it was quicker or consumed less resources.
It completely depends on the HW the server will run on, IMO. As I wrote somewhere, I had seen a server with very slow filesystem so implementing live search while reading and parsing from txt files (or using Solr) could be slower than when querying the database. On the other hand You can host on poor shared webhosting with slow DB connection (that gets even slower with more concurrent connections) so this won't be the best solution.
My suggestion: use MySQL with AJAX (look at this jquery plugin or this article), set proper INDEXes on the searched columns and if this is found slow You still can move to a txt file.
In the past, i have used Zend search Lucene with great success.
It is a general purpose text search engine written entirely in PHP 5. It manages the indexing of your sources and is quite fast (in my experience). It supports many query types, search fields, search ranking.
I am currently working on an existing website that lists products, there are currently a little over 500 products.
The website has a text file for every product and I want to make a search option, thinking of reading all the text files and create an xml document with the values once a day that can be searched.
The client indicated that they wanted to add products and is used to add them using the text files. There might be over 5000 products in the future so I think it's best to do this with mysql. This means importing the current products and create a crud page for products.
Does anyone have experience with a PHP website that does not use MySQL? Is it possible to keep adding text files and just index them once a day even if it would mean having over 5000 products?
5000 seems like an amount that's still managable to index with a daily cron job. As long as you don't plan on searching them real-time, it should work. It's not ideal, but it would work.
Yes, it is very much possible, NOT plausible that you use files for these type of transactions.
It is also better to use XML instead of normal TXTs for the job. 5000 products with what kind of data associated to them might create problems in future.
PS
Why not MySQL?
Mysql was made because file based databases are slow and inaccurate.
Just use mysql. If you want to keep your old txt based database, just build an easy script that will import each file one by one and create corresponding tables in your sql database.
Good luck.
It's possible, however if this is a anything more than simply an online catalog, then managing transaction integrity is horrendously difficult - and that you're even asking the question implies that you are not in a good position to implement the kind of controls required. And as you've already discovered, it doesn't make for easy searching (BTW: mysql's fulltext indexing is a very blunt instrument - it's not a huge amount of effort to implement an effective search engine yourself - or there are excellent ones available off-the-shelf, e.g. mnogosearch)
(as a conicdental point, why XML? It makes managing the data much more complicated than it needs to be)
and create a crud page for products
Why? If the client wants to maintain the data via file uploads and you already need to port the data, then just use the same interface - where the data is stored is not relevant just now.
If there are issues with hosting+mysql, then using SQLite gives most of the benefits (although it wion't scale as well).
Wonder if anyone has any advice on a project I might be picking up.
It's a bit vague at the mo, but I think basically a scientific app as in you supply a product which is then checked against a database of proteins to see if it has a reaction.
From the information I did get - this could really be any scenario, and sounds quite simple dev wise - the user on the site fills out a form which then checks in some way against a db of other similar attributes to see if it's suitable, it then gets also added to that db at the end (so the db potentially grows as each user does a check).
The part I'm wondering about is this could be potentially huge (eg 10,000 + unlimited) - so would anything be gained from building a custom php app to handle all this, as in would wordpress not be suitable for backend - and should I be catering for time intensive queries ?
Thanks for looking
No.
See ticket http://core.trac.wordpress.org/ticket/9864 (Performance issues with large number of pages). I filed this ticket 3+ years ago, and there is still no resolution. Since that time, the code in question got even more complicated, and started using a heavier version of ther internal query library.
Those severe issues are with pages, but posts are equally taxing in queries. And if you have meta-data, it starts taxing the server even further. To top this off, most leading caching plugins for WP exclude web robots, so every few weeks when google, baidu and yandex start hammering your machine for the 10k pages, it'll bring your machine down.
This just means that you can't use it natively for large content sets, but you can still use most of WordPress with additional code customizations for parts that'll be outside of the WP database/construct.
Edit: to clarify- what I was saying is that solely using WP's native database structure, as defined by this schema in the codex, and query_posts() / WP_Query() to perform the queries is the inefficiency I was referring to. The native WP storage/query system doesn't handle large volumes of pages / posts very efficiently. However, bypassing some of the native functionality will likely work fine.
A company we do business with wants to give us a 1.2 gb CSV file every day containing about 900,000 product listings. Only a small portion of the file changes every day, maybe less than 0.5%, and it's really just products being added or dropped, not modified. We need to display the product listings to our partners.
What makes this more complicated is that our partners should only be able to see product listings available within a 30-500 mile radius of their zip code. Each product listing row has a field for what the actual radius for the product is (some are only 30, some are 500, some are 100, etc. 500 is the max). A partner in a given zip code is likely to only have 20 results or so, meaning that there's going to be a ton of unused data. We don't know all the partner zip codes ahead of time.
We have to consider performance, so I'm not sure what the best way to go about this is.
Should I have two databases- one with zip codes and latitude/longitude and use the Haversine formula for calculating distance...and the other the actual product database...and then what do I do? Return all the zip codes within a given radius and look for a match in the product database? For a 500 mile radius that's going to be a ton of zip codes. Or write a MySQL function?
We could use Amazon SimpleDB to store the database...but then I still have this problem with the zip codes. I could make two "domains" as Amazon calls them, one for the products, and one for the zip codes? I don't think you can make a query across multiple SimpleDB domains, though. At least, I don't see that anywhere in their documentation.
I'm open to some other solution entirely. It doesn't have to be PHP/MySQL or SimpleDB. Just keep in mind our dedicated server is a P4 with 2 gb. We could upgrade the RAM, it's just that we can't throw a ton of processing power at this. Or even store and process the database every night on a VPS somewhere where it wouldn't be a problem if the VPS were unbearably slow while that 1.2 gb CSV is being processed. We could even process the file offline on a desktop computer and then remotely update the database every day...except then I still have this problem with zip codes and product listings needing to be cross-referenced.
You might want to look into PostgreSQL and Postgis. It has similar features as MySQL spacial indexing features, without the need to use MyISAM (which, in my experience, tend to become corrupt as opposed to InnoDB).
In particular with Postgres 9.1, which allows k-nearest neighbour search queries using GIST indexes.
Well, that is an interesting problem indeed.
This seems like its actually two issues, one how should you index the databases and the second is how to you keep it up to date. The first you can achieve as you describe, but normalization may or may not be a problem, depending on how you are storing the zip code. This primarily comes down to what your data looks like.
As for the second one, this is more my area of expertise. You can have your client upload the csv to you as they currently are, keep a copy of the one from yesterday and run it through a diff utility, or you can leverage Perl, PHP, Python, Bash or any other tools you have, to find the lines that have changed. Pass those into a second block that would update your database. I have dealt with clients with issues along this line and scripting it away tends to be the best choice. If you need help with organizing your script that is always available.