Creating a smart predictive search for website - php

I am trying to write a predictive search system for a website I am making.
The finished functionality will be a lot like this:
I am not sure of the best way to do this, but here is what I have so far:
Searches Table:
id - term - count
Every time a search is made it is inserted into the searches table.
When a user enters a character into the search input, the following occurs:
The page makes an AJAX request to a search PHP file
The PHP file connects to MySQL database and executes a query: SELECT * FROM searches WHERE term LIKE 'x%' AND count >= 10 ORDER BY count DESC LIMIT 10 (x = text in search input)
The 10 top results based on past search criteria are then listed on the page
This solution is far from perfect. If any random person searches for the same term 10 times it will then show up as a recommended search (if somebody where to search a term starting with the same characters). By this I mean, if somebody searched "poo poo" 10 times and then someone on the site searched for "po" looking for potatoes, they would see "poo poo" as a popular search. This is not cool.
A few ideas to get around this do come to my head. For example, I could limit each insert query into the searches table to the user's IP address. This doesn't fully solve the problem though, if the user has a dynamic IP address they could just restart their modem and perform the search 10 times on each IP address. Sure, the amount of times it has to be entered could remain a secret so it is a little more secure.
I suppose another solution would be to add a blacklist to remove words like "poo poo" from showing up.
My question is, is there a better way of doing this or am I moving along the right lines? I would like to write code that is going to allow this to scale up.
Thanks

You are on the right track.
What I would do:
You store every query uniquely. Add a table tracking each IP for that search term and only update your count once per IP
If a certain new/unique keyword gets upcounted more then X times in an X period of time, let your system mail you/your admin so you have the opportunity to blacklist they keyword manually. This has to be manually because some hot topic might also show this behavior.
This is the most interesting one: Once the query is complete, check the amount of results. It is pointless to suggest keywords that give no results. So only suggest queries that atleast will give X amount of results. Queries like "poo poo" will give no results, so they won't show up in your suggestion list.
I hope this helps. Talk to me further in chat if you have questions :)

For example, you could add a new boolean column called validate, and avoid using a blacklist. If validate is false, not appear in recommended list
This field can be ajusted manually by an administrator (via query or backoffice tool). You could add another column called audit, which stores the timestamp of the query. If the difference between the maximum and minimum timestamp exceeds a value, validate field could be false by default.
This solution is easy and fast for develop your idea.
Regards and good luck.

Related

PHP/AJax Freemium Tools - How to Limit The Number of Uses Before Showing Results

I've done some searches for this and can't seem to find anything on it. I'm looking for a starting point to create freemium model tools.
The languages that I'll be using are PHP, Ajax and MySQL.
Here's what I would like to get done.
Any random user can use the free tools on the site, but after X number of uses, they are asked to register an account, otherwise, they can't use the tool for another 24 hours.
From what I've seen from other tools, it seems to be done through IP tracking and storing them in a DB. But I can see this getting pretty messy after hitting the millions of results.
Can anyone with experience provide guidance on how I can start limiting the number of uses? I just have no idea where to start at this point.
if they don't register with a email first then the only solution i think is IP it doesnt have to get messy if you set it up right. you just a table with column for ip column for counter and column for date and time.
then when you insert the data the same time you run another query to delete data older than 24 hours. Some guys use IP combined with device info.

Reduce join table with performing pagination, order by?

Sorry this may be a noob question, but I don't know how to search about this.
User case
A full-site Search function : when the user input keyword and submit the form, the system should be search in both title & content of forum, blog, products. The search result of all those type of page should display in one single list with pagination. The user can also chose to ordering the result by relevance or recency.
What I did
I am using LMAP. I have data tables for those three page type , and I have make the title & content column as index Key.
I knew that join table is a very bad idea, so I make three separate query for searching the forum, blog, and products. I get all the data into PHP, make them into array, write a function for making a relevance value for every row of search result. For recency, there is "updateDate" column in all those table, so it is ok.
Now I have three nice array. I can implode() them and sort() them easily. I can also render pagination by array_slice().
What make me Frown
Unnecessary performance waste. Yes, what I did is able to do all the things in user case , but --- I don't know how to do (I am a beginner), --- but I am sure the performance can be a lot better.
after the first time query, all the data we need has already get from database. but with my solution, whenever user click to another page of search result, or change the "sort by", the php will start over again, and do the [sql query, relevance function, implode()] again. can I someHow store the result array in someWhere , so the system can save some energy for next user action ?
most of the user will not click on all page of search result. I will guess 90% of user will not keep looking after 10th page, which mean (may be) the first 200 recorded. So, can I do any thing to stop the sql query somewhere instead of all result ?
furthermore, while the traffic grow, there may be some keywords be come common and repeated searching lots of time, what can I do reduce the repeat of those search ? (pls slap me if you think i am thinking too much)
Thank you for reading these, Please correct me if my concept is incorrect, or tell me if I miss something to notice in this user case. Thank you and may God's love be with you.
Edit : I am not using any php framework.
To get you the full story is probably like writing a book. Here are some extracted thoughts:
fully blown page indicators cost you extra data set counts - just present "Next" buttons which can be made up by select ... limit [nr_of_items_per_page+1] and then if(isset($result[nr_of_items_per_page+1])) output next button
these days net traffic costs are not as high as ten years ago and users demand for more. Increase your nr_of_items_per_page to 100, 200, 500 (depending on the data size per record)
Zitty Yams comments work out - I have loaded >10000 records in one go to a client and presented those piece by piece - it just rocks - eg. a list of 10000 names with 10 characters avg makes just 100000 Bytes. Most of the images you get in the net are bigger then that. Of cause there are limits...
php caching via $SESSION works as well - however keep in mind that each Byte to be reserved for php cannot be dedicated to the database (at least not on a shared server). As long as not all data in the database fit into memory, in most cases it is more efficient to extend database memory rather than increasing php caches or os caches.

Autocomplete concept

I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?
As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.
You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...
Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.
Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.

Computing word, image, video and audio file counts in a scalable way?

I am attempting to gather as much interesting metadata as possible to display for readers of an expression engine site I'm developing and am looking for guidance on methods (or indeed the feasibility) of computing specific bits of this metadata in a scalable way.
Expression Engine allows for quite a few bits of data to be gathered and displayed natively, for example post totals and dates, comment totals and dates, tag totals, etc. However I'm specifically interested in finding a method to count and display totals for data like number of words, images, videos, or audio files, not only within individual posts but across a channel, as well as site-wide.
These totals would be displayed contextually depending on where they were accessed. So for example search results would display the number of words/images/etc contained in individual posts, a channel's "about" page would display totals for the entire channel, and the site's "about" page would display site-wide totals. I'm not clear on the best approach or whether this is even really feasible.
I'm not a professional web designer, so my knowledge of anything beyond html5/css3/ee is somewhat limited, but I've pondered:
Entering these numbers on a per-post basis, in custom fields, but am not clear on whether they can be added together for channel and site-wide totals.
Using PHP's "count" method, but am not very familiar with PHP so unsure of it's appropriate.
Using some mySql method to query the database, again unsure.
Utilizing the Expression Engine "Query Module." !?
Using some Jquery plug-in to do the counting individually and then adding after the fact.
It may be that the counting of words, images, video, and audio files and the scalability are different questions all together but the truth is I'm very confused as to what avenue to even explore. So any and all suggestions or guidance would be greatly appreciated.
Update: I'm looking into database methods to collect and add the results but am still interested in identifying the best ways to actually perform the word/image/video/audio file counts.
There's many solutions but I have a few in mind that may help you out. I'll just show the one I like really well that I even use for my own site.
One solution is to make a count column in tables you are interested in that is automatically updated when someone posts or does something. You can also make a new table called globalcount or whatever that counts everything site wide. This can then later just be displayed. You would need to first have a method/function of counting words and such if you want that info. And when someone makes a post, just count one up from the previous.
The above is what I use. I use a misc table (It has one row that contains all the data. You could instead make each row contain your info like 'name' 'value') that looks something like:
(`views`, `totalusers`, `totalgroups`, `totalthreads`, `totalposts`, `totalarticles`, `totalcomments`, `totalpms`, `activeusercount`)
And in something like my 'news' table I use 'totalcomments' to count the local comments posted in that article. So I have both the local and global comments.
In my case, if I wanted to update 'totalusers' in the 'misc' table after a new user registers, I'd just call my $misc array and go: $newtotalusers = intval($misc['totalusers'] + 1);
mysql_query("UPDATE `misc` SET `totalusers`='$newtotalusers'");
Or you could instead just use "totalusers+1".
Same can be done with any other thing you wish to do, such as with any file count or visa versa. Hope this helps :)
One last thing, you could also make a script that in the case the data becomes off because of an error that would update and fix any table's count values.

Twitter competition ~ saving tweets (PHP & MySQL)

I am creating an application to help our team manage a twitter competition. So far I've managed to interact with the API fine, and return a set of tweets that I need.
I'm struggling to decide on the best way to handle the storage of the tweets in the database, how often to check for them and how to ensure there are no overlaps or gaps.
You can get a maximum number of 100 tweets per page. At the moment, my current idea is to run a cron script say, once every 5 minutes or so and grab a full 100 tweets at a time, and loop through them looking in the db to see if I can find them, before adding them.
This has the obvious drawback of running 100 queries against the db every 5 minutes, and however many INSERT there are also. Which I really don't like. Plus I would much rather have something a little more real time. As twitter is a live service, it stands to reason that we should update our list of entrants as soon as they enter.
This again throws up a drawback of having to repeatedly poll Twitter, which, although might be necessary, I'm not sure I want to hammer their API like that.
Does anyone have any ideas on an elegant solution? I need to ensure that I capture all the tweets, and not leave anyone out, and keeping the db user unique. Although I have considered just adding everything and then grouping the resultant table by username, but it's not tidy.
I'm happy to deal with the display side of things separately as that's just a pull from mysql and display. But the backend design is giving me a headache as I can't see an efficient way to keep it ticking over without hammering either the api or the db.
100 queries in 5 minutes is nothing. Especially since a tweet has essentially only 3 pieces of data associated with it: user ID, timestamp, tweet, tweet ID - say, about 170 characters worth of data per tweet. Unless you're running your database on a 4.77MHz 8088, your database won't even blink at that kind of "load"
The Twitter API offers a streaming API that is probably what you want to do to ensure you capture everything:
http://dev.twitter.com/pages/streaming_api_methods
If I understand what you're looking for, you'll probably want a statuses/filter, using the track parameter with whatever distinguishing characteristics (hashtags, words, phrases, locations, users) you're looking for.
Many Twitter API libraries have this built in, but basically you keep an HTTP connection open and Twitter continuously sends you tweets as they happen. See the streaming API overview for details on this. If your library doesn't do it for you, you'll have to check for dropped connections and reconnect, check the error codes, etc - it's all in the overview. But adding them as they come in will allow you to completely eliminate duplicates in the first place (unless you only allow one entry per user - but that's client-side restrictions you'll deal with later).
As far as not hammering your DB, once you have Twitter just sending you stuff, you're in control on your end - you could easily have your client cache up the tweets as they come in, and then write them to the db at given time or count intervals - write whatever it has gathered every 5 minutes, or write once it has 100 tweets, or both (obviously these numbers are just placeholders). This is when you could check for existing usernames if you need to - writing a cached-up list would allow you the best chance to make things efficient however you want to.
Update:
My solution above is probably the best way to do it if you want to get live results (which it seems like you do). But as is mentioned in another answer, it may well be possible to just use the Search API to gather entries after the contest is over, and not worry about storing them at all - you can specify pages when you ask for results (as outlined in the Search API link), but there are limits as to how many results you can fetch overall, which may cause you to miss some entries. What solution works best for your application is up to you.
I read over your question and it seems to me that you want to duplicate data already stored by Twitter. Without more specifics on the competition your running, how users enter for example, estimated amount of entries; its impossible to know whether or not storing this information locally on a database is the best way to approach this problem.
Might a better solution to be, skip storing duplicate data locally and drag the entrants directly from twitter, i.e. when your attempting to find a winner.
You could eliminate duplicate entries on-the-fly then whilst the code is running. You would just need to call "the next page" once its finished processing the 100 entries its already fetched. Although, i'm not sure if this is possible directly through the Twitter API.
I think running a cron every X minutes and basing it off of the tweets creation date may work. You can query your database to find the last date/time of the last recorded tweet, then only run selects if there are matching times to prevent duplicates. Then, when you do your inserts into the database, use one or two insert statements containing all the entries you want to record to keep performance up.
INSERT INTO `tweets` (id, date, ...) VALUES (..., ..., ...), (..., ..., ...), ...;
This doesn't seem too intensive...also depends on the number of tweets you expect to record though. Also make sure to index the table properly.

Categories