I have been doing a bit of searching round StackOverflow and the Interweb and I have not had much luck.
I have a URL which looks like this...
nr/online-marketing/week-in-review-mobile-google-and-facebook-grab-headlines
I am getting the article name from the URL and replacing the '-' with ' ' to give me:
week in review mobile google and facebook grab headlines
At this point this is all the information that I have on the article so I need to use this to query the database to get the rest of the article information, the problem comes around but this string does not match the actual headline of the article, this this instance the actual headline is:
Week in review: Mobile, Google+ and Facebook grab headlines
As you can see it include extra punctuation, so I need to find a way of using MYSQL LIKE to match the article.
Hope someone can help, a standard SELECT * FROM table WHERE field LIKE $name does not work , im hoping of finding a way of doing it without splitting up each individual word but if that what it comes down to then so be it!
Thanks.
Try MySQL MyISAM engine's full-text search. In your case the query will be:
SELECT * FROM table
WHERE MATCH (title) AGAINST ('week in review mobile google and facebook grab headlines');
That requires you to convert the table to MyISAM. Also depending on the size of the table, test the performance of the query.
See more info under:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
This really seems more like a database design issue... If you're using large texts with different fields as forms of primary keys it could lead to duplicates or synchronization problems.
One potential solution is to give each entry a unique identifier (perhaps an int or uniqueidentifier field if MSQL supports that), and use that field to map the actual healdine to the URL.
another potential solution is to create a table that will associate each headline with its URL and use that table for lookups. This will incur a little extra overhead, but will ensure that special characters in the title will never effect the lookup process.
As for a way to do this with your current design, you may be able to do some kind of regular expression search by tokenizing each word individually and then searching for an entry that includes all tokens, but I'm fairly certain that MSQL doesn't provide this functionality in a basic command.
Related
I'm working on a tool in PHP that scans Instagram to gather analytics on a bunch of hashtags. The aim is to monitor the evolution / growth of certain hashtags and provide a search engine for people to get up to date statistics on each hashtag.
So far I've got a fairly simple search engine in place, and I run a a SQL query that looks for LIKE %'travel'%. So if someone types "#travel", they'll get anything that contains the world "travel" such as "travelagent" "iliketotravel", etc.
The issue I'm facing is I'd like to broaden the search results to include things that are related to #travel, much like websites like http://displaypurposes.com or http://best-hashtags.com/ and I'm trying to figure out just HOW they do it.
I'm especially fascinated by the first one, and the Graph function: https://displaypurposes.com/graph?tag=travel
It looks like they've effectively mapped all the links between a huge number of hashtags and provide results based on that.
I have about 45 000 hashtags in my database, how would I go about linking them together to enable a "relevancy search" like the two websites I mentioned above? How does one go about building something similar? I've spent ages looking online and can't find the answer to my question.
Thanks for your help! :)
This isn't really a programming question but I'll try answer it in a way that addresses it in such a way.
It's possible to have multiple tags on a single Instagram post. For example, you might have someone posting a picture of Rome with the hashtags #rome #travel. This now associates #rome with #travel and counts this as a connection between the two.
As long as we have a table structure with the following attributes:
PostNumber
Hashtag
We can find the top relations by running something like the following code:
SELECT COUNT(*) `Relation Occurances`,
b.Hashtag
FROM
Posts a
JOIN
Posts b
ON
a.PostNumber = b.PostNumber
WHERE
a.Hashtag = '#travel'
AND
b.Hashtag != '#travel'
You can refine the query to limit to 100 top relations and so on if required.
To further expand on this, the key is splitting the post out into a table with 1 row per post per hashtag. If you're doing wildcard searches on large text, this will lead to long processing times and be inefficient.
Good evening,
I am facing a small problem whilst trying to build a little search algorithm.
I have a database table containing video game names and software names. Now I would like to add new offers by fetching and parsing xml files on other servers. The issue is:
How can I compare the strings for the product name so it works even if the offer name doesn't match the product name stored in my database up to a 100%?
As an example I am currently using this PHP + SQL code to compare the strings:
$query_GID = "select ID,game from gkn_catalog where game like '%$batch_name%' or meta like '%$batch_name%' ";
I am currently using the like operator in conjunction with two wild-cards to compare the offer name (batch_name) with the name in the database (game).
I would like to know how I can improve on this as this method isn't very failsafe or whatever you want to call it, what happens is:
If the database says the game title is:
Deus Ex Human Revolution Missing Link
and the batch_name says:
Deus Ex Human Revolution Missing Link DLC
the result will be empty/wrong/false ... well it won't find the game in my database at all.
Same goes for something like this:
Database = Lego Star Wars The Complete Saga batch_name = Lego
Star Wars : The Complete Saga
Result: False
Is there a better way to do the SQL query? Or how can I try to get that query working so it can deal with strings that come with special characters (like -minus- & [brackets]) and or characters which aren't included in the names within the database (like DLC, CE...)?
You're looking for fuzzy search algorithms and fuzzy search results. This is a whole field of study. However, there are also some straightforward tutorials to get you started if you take a quick google around.
You might be tempted to try something like PHP's wonderful levenshtein method, which calculates the "closeness" of two strings. However, this would require matching it against every record. If there will be thousands of records, that's out of the question.
MySQL has some matching tools which may help. I see that as I'm writing this, somebody has already mentioned FULLTEXT and MATCH() in the comments. Those are a great way to go.
There are a few other good solutions to look into as well. Storing an index of keywords (with all the articles and helpers like of/the/an/am/is/are/was/of/from removed) and then searching on each word in the search is a simple solution. However, it doesn't produce great results in that the returned values are not weighted well, and it doesn't localize at all.
There are lots of cheap and wonderful third party search tools (Lucene comes to mind) as well that will do most of this work for you. You just call an API and they manage the caching, keywords, indexing, fuzzying, et al for searches.
Here are some SO questions that are related to fuzzy searches, which will help you find more terminology and ideas:
Lightweight fuzzy search library
Fuzzy queries to database
Fuzzy matching on string
fuzzy searching an array in php
MySQL queries, as you found out can use the percent character as a joker (%) in conjunction with the LIKE operator.
You have multiple solutions depending on what you want exactly.
you can make a fulltext search
you can search using language algorithm like soundex
you can search by keywords
Remember that you can make a search in multiple passes (search for exact match, then percent on every side, explode in words then insert % between every word, search by keyword, etc.) depending if exact match has priority over close search, etc.
I run a photo website where users are free to enter any tag they like, even tags not used before. As a result, a photo of a tag may sometimes be tagged as "insect" whilst somebody else tags it as "insects".
I'd like to keep the free-tagging capability, yet would like to have a way to filter out such near-duplicates. The total collection of tags is currently at 1,500. My idea is to read all of them from the DB into mem and then run an alghoritm on it that displays "suspects".
My idea of a suspect is that x% of the characters in the string are the same (same char and order), where x is configurable. I could probably code a really inefficient way to do this but I was wondering if there is an existing solution to this problem?
Edit: Forgot to mention: just sorting the tags isn't enough, as that would require me to go through the entire set to find dupes.
There are some flaws in your logic. For example, what happens when the plural of an object is different from the singular (i.e. person vs. people or even candy vs. candies).
If English is the primary language, check out Soundex which allows phonetic matches. Also consider using a crowd-sourced synonym model where users can create links to existing tags.
Maybe the algorithm you are looking for is approximate string matching.
http://en.wikipedia.org/wiki/Approximate_string_matching.
by a given word you can match it to list of words and if the 'distance' is close add it to suspects.
A fast implementation is to use dynamic programming like the Needleman–Wunsch algorithm.
I have made a blog example of this in C# where you can configure the 'distance' using a matrix character lookup file.
http://kunuk.wordpress.com/2010/10/17/dynamic-programming-example-with-c-using-needleman-wunsch-algorithm/
Is "either contains either" fine? You could do a SQL query something like this, if your images are in a database (which would only make sense):
SELECT * FROM ImageTags WHERE INSTR('theNewTag', TagName) > 0 OR INSTR(TagName, 'theNewTag') > 0 LIMIT 1;
If you really want to do this efficiently I would suggest some sort of JavaScript implementation that displays possibilities as the user is typing in a tag that they want. Not only will it save the user time to happily see 5 suggestions as they type. It will automatically stop them from typing "suspects" when "suspect" shows up as a suggestion. That is, of course, unless they really want "suspects" as a point of urgency.
You could load a huge list of words and as the user types narrow them down. I get the feeling that this could be very simplistic esp if you want to anticipate correctly spelled words. If someone misses a letter, they'll probably go back to fix it when they see a list of suggestions that isn't at all what they meant to type. And when they do correctly type a word it'll pop up in the suggestions.
I was wondering what would be the best approach to generate a tag cloud from a input text (while user is typing it). For example, if user types a story's text containing keywords "sci-fi, technology, effects", the tag cloud will be formed from each of this keywords ordered by relevance according to their frequency on every story. The tag cloud will be displayed in descending order and using the same font size, it's not the display algorithm, but the search algorithm I should implement.
I'm using mysql and php.
Should I stick to MATCH...AGAINST clause? should I implement a tags table?
More details
I have a mysql table containing a lot of stories. When user is typing one of his/her own, I want to display a tag cloud containing the most frequent words, taken from the input text, occurring on this set of stories that are saved on my db.
The tag cloud will only be used to show to the user the relevance of the words he/she has entered on his/her own story according to the frequency they occur on all stories entered by all users.
I think the first thing you need to do is more clearly define the purpose of your tagging system. Do you want to simply build tags based on the words that occur most frequently within the text? This strikes me as something designed with search rankings in mind.
...Or do you want your content to be better organized, and the tag cloud be a way of providing a better user experience and creating more distinct relationships between pieces of content (ie both of these are tagged sci-fi, so display them in the sci-fi category).
If the former is the case, you might not need to do anything but:
Explode the text by a delimiter like a single space explode(' ', $content);
Have a list (possibly in a config file or within the script itself) of words which will occur frequently which you want to exclude from being tags (and, or, this, the, etc. You could just jack them off pages like this: http://www.esldesk.com/vocabulary/pronouns , http://www.english-grammar-revolution.com/list-of-conjunctions.html
Then you just need to decide how many times a word has to occur (either percentage or numeric), and store those tags in a table that shows the connection between tags and content.
To implement the "as the user is typing" part you just need to use a bit of jQuery's ajax functionality to continually call your script that builds the tag list (ie on keydown).
The other option (better user experience) will incorporate a lot of the same elements, but you'll have to think about a bit more. Some things I would consider:
Do you want to restrict to certain tags (perhaps you don't want to allow just anyone to create new tags)?
How you will deal with synonyms
If you will support multiple languages
If you want a preference towards suggesting existing tags (which might be close) over suggesting new ones
Once you've fully defined the logic and user experience you can come back to the search algorithm. MATCH and AGAINST are good options but you may find that a simple LIKE will do it for you.
Good luck = )
If you want the tag cloud to be generated as the user is typing it, you can do it in two ways.
Directly update the tag cloud from the input text
Send the input text to the backend (in realtime using ajax/comet), which then saves, calculates the word frequency and returns data from which you generate the cloud.
I would go with the former using a jQuery plugin such as - http://plugins.jquery.com/plugin-tags/tag-cloud
I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.
What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat
Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.