Before anything, this is not necessarily a question.. But I really want to know your opinion about the performance and possible problems of this "mode" of search.
I need to create a really complex search on multiple tables with lots of filters, ranges and rules... And I realize that I can create something like this:
Submit the search form
Internally I run every filter and logic step-by-step (this may take some seconds)
After I find all the matching records (the result that I want) I create a record on my searches table generating a token of this search (based on the search params) like 86f7e437faa5 and save all the matching records IDs
Redirect the visitor to a page like mysite.com/search?token=86f7e437faa5
And, on the results page I only need to discover what search i'm talking about and page the results IDs (retrieved from the searches table).
This will make the refresh & pagination much faster since I don't need to run all the search logic on every pageview. And if the user change a filter or search criteria, I go back to step 2 and generate a new search token.
I never saw a tutorial or something about this, but I think that's wat some forums like BBForum or Invision do with search, right? After the search i'm redirect to sometihng like search.php?id=1231 (I don't see the search params on the URL or inside the POST args).
This "token" will no last longer than 30min~1h.. So the "static search" is just for performance reasons.
What do you think about this? It'll work? Any consideration? :)
Your system may have special token like 86f7e437faa5 and cache search requests. It's a very useful mechanism for system efficiency and scalability.
But user must see all parameters in accordance with usability principles.
So generating hash of parameters on the fly on server-side will be a good solution. System checks existanse of genereted hash in the searches table and returns result if found.
If no hash, system makes query from base tables and save new result into searches table.
Seems logical enough to me.
Having said that, given the description of you application, have you considered using Sphinx. Regardless of the number of tables and/or filters and/or rules, all that time consuming work is in the indexing, and is done beforehand/behind the scene. The filtering/rules/fields/tables is all done quickly and on the fly after the fact.
So, similar to your situation, Sphinx could give you your set of ID's very quickly, since all the hard work was pre-done.
TiuTalk,
Are you considering keeping searches saved on your "searches" table? If so, remember that your param-based generated token will remain the same for a given set of parameters, lasting in time. If your search base is frequently altered, you can't rely on saved searches, as it may return outdated results. Otherwise, it seems a good solution at all.
I'd rather base the token on the user session. What do you think?
#g0nc1n
Sphinx seems to be a nice solution if you have control of your server (in a VPS for example).
If you don't and a simple Full Text Search isn't enough for you, I guess this is a nice solution. But it seems not so different to me than a paginated search with caching. It seems better than a paginated search with simple url refered caching. But you still have the problem of the searches remaining static. I recommend you flush the saved searches from time to time.
Related
i've recently started learning Redis and am currently building an app using it as sole datastore and I'd like to check with other Redis users if some of my conclusions are correct as well as ask a few questions. I'm using phpredis if that's relevant but I guess the questions should apply to any language as it's more of a pattern thing.
As an example, consider a CRUD interface to save websites (name and domain) with the following requirements:
Check for existing names/domains when saving/validating a new site (duplicate check)
Listing all websites with sorting and pagination
I have initially chosen the following "schema" to save this information:
A key "prefix:website_ids" in which I use INCR to generate new website id's
A set "prefix:wslist" in which I add the website id generated above
A hash for each website "prefix:ws:ID" with the fields name and website
The saving/validation issue
With the above information alone I was unable (as far as I know) to check for duplicate names or domains when adding a new website. To solve this issue I've done the following:
Two sets with keys "prefix:wsnames" and "prefix:wsdomains" where I also SADD the website name and domains.
This way, when adding a new website I can check if the submitted name or domain already exist in either of these sets with SISMEMBER and fail the validation if needed.
Now if i'm saving data with 50 fields instead of just 2 and wanted to prevent duplicates I'd have to create a similar set for each of the fields I wanted to validate.
QUESTION 1: Is the above a common pattern to solve this problem or is there any other/better way people use to solve this type of issue?
The listing/sorting issue
To list websites and sort by name or domain (ascending or descending) as well as limiting results for pagination I use something like:
SORT prefix:wslist BY prefix:ws:*->name ALPHA ASC LIMIT 0 10
This gives me 10 website ids ordered by name. Now to get these results I came to the following options (examples in php):
Option 1:
$wslist = the sort command here;
$websites = array();
foreach($wslist as $ws) {
$websites[$ws] = $redis->hGetAll('prefix:ws:'.$ws);
}
The above gives me a usable array with website id's as key and an array of fields. Unfortunately this has the problem that I'm doing multiple requests to redis inside a loop and common sense (at least coming from RDBMs) tells me that's not optimal.
The better way it would seem to be to use redis pipelining/multi and send all request in a single go:
Option 2:
$wslist = the sort command here;
$redis->multi();
foreach($wslist as $ws) {
$redis->hGetAll('prefix:ws:'.$ws);
}
$websites = $redis->exec();
The problem with this approach is that now I don't get each website's respective ID unless I then loop the $websites array again to associate each one. Another option is to maybe also save a field "id" with the respective website id inside the hash itself along with name and domain.
QUESTIONS 2/3: What's the best way to get these results in a usable array without having to loop multiple times? Is it correct or good practice to also save the id number as a field inside the hash just so I can also get it with the results?
Disclaimer: I understand that the coding and schema building paradigms when using a key->value datastores like Redis are different from RDBMs and document stores and so notions of "best way to do X" are likely to be different depending on the data and application at hand.
I also understand that Redis might not even be the most suitable datastore to use in mostly CRUD type apps but I'd still like to get any insights from more experienced developers since CRUD interfaces are very common on most apps.
Answer 1
Your proposal looks pretty common. I'm not sure why you need an auto-incrementing ID though. I imagine the domain name has to be unique, or the website name has to be unique, or at the very least the combination of the two has to be unique. If this is the case it sounds like you already have a perfectly good key, so why invent an integer key when you don't need it?
Having a SET for domains and a SET for website names is a perfect solution for quickly checking to see if a specific domain or website name already exists. Though, if one of those (domain or website name) is your key you might not even need these SETs since you could just look if the key prefix:ws:domain-or-ws-name-here exists.
Also, using a HASH for each website so you can store your 50 fields of details for the website inside is perfect. That is what hashes are for.
Answer 2
First, let me point out that if your websites and domain names are stored in SORTED SETs instead of SETs, they will already be alphabetized (assuming they are given the same score). If you are trying to support other sort options this might not help much, but wanted to point it out.
Your Option 1 and Option 2 are actually both relatively reasonable. Redis is lightning fast, so Option 1 isn't as unreasonable as it seems at first. Option 2 is clearly even more optimal from the perspective of redis since all the commands will be bufferred and executed all at once. Though, it will require additional processing in PHP afterwards as you noted if you want the array to be indexed by the id.
There is a 3rd option: lua scripting. You can have redis execute a Lua script that returns both the ids and hash values all in one shot. But, not being super familiar with PHP anymore and how redis's multibyte replies map to PHPs arrays I'm not 100% sure what the lua script would look like. You'll need to look for examples or do some trial and error. It should be a pretty simple script, though.
Conclusion
I think redis sounds like a decent solution for your problem. Just keep in mind the dataset needs to always be small enough to keep in memory. If that's not really a concern (unless your fields are huge, you should be able to fit thousands of websites into only a few MB) or if you don't mind having to upgrade your RAM to grow your DB, then Redis is perfectly suitable.
Be familiar with the various persistence options and configurations for redis and what they mean for availability and reliability. Also, make sure you have a backup solution in place. I would recommend having both a secondary redis instance that slaves off of your main instance, and a recurring process that backs up your redis database file at least daily.
I've been looking at different ways to implement an instant text search on my web application; at the moment it uses a very basic SQL LIKE query with wildcards.
I have looked at many ways to implement searches, but I never saw anyone suggest to do the following:
As the user types, when the query gets to 4 or 5 characters, perform the database search.
Display the results to the user, and as they continue typing, just use Javascript to filter the results, so no more database calls are required.
This way there would only ever be one database call per search, if the user makes a typo, they can backspace and Javascript would take care of displaying the correct results.
Are there any downsides to this method?
This seems to work in theory, but I personally prefer either pressing enter or waiting 500 milliseconds of inactivity before searching.
One thing that may cause an extra DB query is if the user backspaces at your given interval (4 characters in your case).
But I suppose the real downside would be extra JS coding + still needing the PHP coding.
I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?
As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.
You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...
Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.
Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.
I'm in the design phase of a website and I have a solution for a feature but I don't know if it will be the good one when the site, hopefully, grows. I want the users to be able to perform searches for other users and the results they find must be ordered: first the "spotlighted" users, then all the rest. The result must be ordered randomly, respecting the previously mentioned order, and with pagination.
One of the solutions I have in mind is to store the query results in a session variable in the server side. For performance, when the user leaves the search this variable is destroyed.
What will happen when the site has thousands of users and every day thousands of searches are performed? My solution will be viable or the server will be overloaded?
I have more solutions in mind like an intermediate table where n times by day users are dumped in the mentioned order. This way there is no need to create a big array in the user's session and pagination is done via multiple queries against the database.
Although I appreciate any suggestions I'm specially interested into hear opinions from developers seasoned in transited sites.
(The technology employed is LAMP, with InnoDb tables)
Premature optimization is bad. But you should be planning ahead. You dont need to implement it. But prepare yourself.
If there are thousands of users searching this query everyday then caching the query result in session is not a good idea. Cause same result can be cached for some users while other needs to execute it. For such case I'd recommend you save the search result in user independent data structure (File, memory etc).
For each search query save the result, creation date, last access date in your disk or in memory.
If any user searches the same query show the result from cache
Run a cron that invalidates the cache after sometime.
This way frequent searches will most time promptly available. Also it reduces the load on your database.
This is definitely not the answer you are looking for, but I have to say it.
Premature Optimization is the root of all evil.
Get that site up with a simple implementation of that query and come back and ask if that turns out to be your worst bottleneck.
I'm assuming you want to decrease the hitting on the DB by caching search results so other users searching for the same set of factors don't have to hit the DB again--especially on very loose query strings on non-indexed fields. If so, you can't store it in a session--that's only available to the single user.
I'd use a caching layer like Cache_Lite and cache the result set from the db query based on the query string (not the sql query, but the search parameters from your site). That way identical searches will be cached. Handle the sorting and pagination of the array in PHP, not in the DB.
I'd like to find a way to take a piece of user supplied text and determine what addresses on the map are mentioned within the text. I'd be happy to use a free web service if it exists or use a script which will not consume too many resources.
One way I can imagine doing this is taking a gigantic database of addressing and searching for each of them individually in the text, but this does not seem efficient. Is there a better algorithm or technique one can suggest?
My basic idea is to take the location information and turn it into markers on a Google Map. If it is too difficult or CPU intensive to determine the locations automatically, I could require users to add information in a location field if necessary but I would prefer not to do this as some of the users are going to be quite young students.
This needs to be done in PHP as that is the scripting language available on my school hosted server.
Note this whole set-up will happen within the context of a Drupal node, and I plan on using a filter to collect the necessary location information from the individual node, so this parsing would only happen once (when the new text enters the database).
You could get something like opencalais to tag your text. One of the catigories which it returns is "city" you coud then use another third party module to show the location of the city.
If you did have a gigantic list of locations in a relational database, and you're only concerned about 500 to 1000 words, then you could definitely just pass the SQL command to find matches for the 500-1000 words and it would be quite efficient.
But even if you did have to call a slow API, you could feasibly request for 500 words one by one. If you kept a cache of the matches, then the cache would probably quickly fill up with all the stop words (you know, like "the", "if", "and") and then using the cache, it'd be likely that you would be searching much less than 500 words each time.
I think you might be surprised at how fast the brute force approach would work.
For future reference I would just like to mention the Yahoo API called Placemaker and the service GeoMaker that is built on top of it.
Those tools can be used to parse out locations from a text as requested here. Unfortunately no Drupal module seems to exists right now- but a custom solution seems easy to code.