mysql group duplicates based on title and description

mysql group duplicates based on title and description - php

Indeed.com groups duplicate job postings by title and description. Here is an example of what I am talking about. How would I go about doing something like that? Is it just a simple Group By statement or something else entirely?

It could be done with a simple group by, but that will only group exact matches.
There are several parameters you can test to determine whether to group entries. In their example: company name, location, and keywords.
"Something else entirely" would involve analyzing the fields of one row to determine their similarity to another row. I think this would probably be too processor intensive to integrate on a large-scale.

I'm not exactly sure what you're looking at in the example. But it wouldn't really make sense to do a sql group on something like description. That would cause a ton of overhead, especially with the amount of data indeed is keeping track of.
A good way to store data similar to what indeed stores would be with document index, try googling solr or nosql.

Related

performance issue from 5 queries in one page

As i am a junior PHP Developer growing day by day stuck in a performance problem described here:
I am making a search engine in PHP ,my database has one table with 41 column and million's of rows obviously it is a very large dataset. In index.php i have a form for searching data.When user enters search keyword and hit submit the action is on search.php with results.The query is like this.
SELECT * FROM TABLE WHERE product_description LIKE '%mobile%' ORDER BY id ASC LIMIT 10
This is the first query.After result shows i have to run 4 other query like this:
SELECT DISTINCT(weight_u) as weight from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(country_unit) as country_unit from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(country) as country from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(hs_code) as hscode from TABLE WHERE product_description LIKE '%mobile%'
These queries are for FILTERS ,the problem is this when i submit search button ,all queries are running simultaneously at the cost of Performance issue,its very slow.
Is there any other method to fetch weight,country,country_unit,hs_code speeder or how can achieve it.
The same functionality is implemented here,Where the filter bar comes after table is filled with data,How i can achieve it .Please help
Full Functionality implemented here.
I have tried to explain my full problem ,if there is any mistake please let me know i will improve the question,i am also new to stackoverflow.

Firstly - are you sure this code is working as you expect it? The first query retrieves 10 records matching your search term. Those records might have duplicate weight_u, country_unit, country or hs_code values, so when you then execute the next 4 queries for your filter, it's entirely possible that you will get values back which are not in the first query, so the filter might not make sense.
if that's true, I would create the filter values in your client code (PHP)- finding the unique values in 10 records is going to be quick and easy, and reduces the number of database round trips.
Finally, the biggest improvement you can make is to use MySQL's fulltext searching features. The reason your app is slow is because your search terms cannot use an index - you're wild-carding the start as well as the end. It's like searching the phonebook for people whose name contains "ishra" - you have to look at every record to check for a match. Fulltext search indexes are designed for this - they also help with fuzzy matching.

I'll give you some tips that will show useful in many situations when querying a large dataset, or mostly any dataset.
If you can list the fields you want instead of querying for '*' is a better practice. The weight of this increases as you have more columns and more rows.
Always try to use the PK's to look for the data. The more specific the filter, the less it will cost.
An index in this kind of situation would come pretty handy, as it will make the search more agile.
LIKE queries are generally pretty slow and resource heavy, and more in your situation. So again, the more specific you are, the better it will get.
Also add, that if you just want to retrieve data from this tables again and again, maybe a VIEW would fit nicely.
Those are just some tips that came to my mind to ease your problem.
Hope it helps.

MYSQL like + group

Well I'm having a problem mainly caused by bad structure in database. I'm coding this for a company whose code is quite messy and the table is very large so I don't think it's an option to fix the structure.
Anyway, my issue is that I'm trying to somehow group a value that won't be alone in the string...
They are storing values separated with commas... So it would be like
field: "category" value: 'var1, var2, var3'
And I will search using this query:
SELECT name, category
FROM companies
WHERE (MATCH(name, category) AGAINST ('$search' IN BOOLEAN MODE)
OR category LIKE '$search%')
It would match with for example var2 (it's not limited to 3 variables though, can be solo or many more) and I'd split it manually in PHP, no problem. Although I will not get enough matches, I want e.g. 10 matches by different searches. To be more specific I'm making an autosuggest feature, which means I will for example want to match "moto%" with motorbike, motor alone or whatever but I keep getting the same values, like there'd be a couple of 100 of results that contains "motorbike" and I don't know how to filter them, as I'm not able to use GROUP BY due to bad db structure...
I found this: T-SQL - GROUP BY with LIKE - is this possible?
It SEEMED as something that would be a solution, but as far as I've tried I could not get it work with what I wanted.
So I'm wondering which solutions there are... If there are ABSOLUTELY no way of working this around I might probably have to fix the db structure (but this really has to be the last option)

Start taking steps to make database structure proper. Make an extra table and fill it with split values.
Then you can use proper queries to select the data you need. Both you and next developer will have less troubles with this project in the future, not mentioning queries speed gain.

I am not sure why i cannot write a comment, but maybe you can try this:
SELECT name, category FROM companies WHERE category LIKE '$search%' or LOCATE('search', category)>0;
That would look if in category appears any of your 'search' value.

I would have to agree that you should make the database right. It'll save you much trouble and time later. However, using SELECT DISTINCT may fix your immediate issue.

Process optimization for large sets of data

I currently have a project where we are dealing with 30million+ keywords for PPC advertising. We maintain these lists in Oracle. There are times where we need to remove certain keywords from the list. The process includes various match-type policies to determine if the keywords should be removed:
EXACT: WHERE keyword = '{term}'
CONTAINS: WHERE keyword LIKE '%{term}%'
TOKEN: WHERE keyword LIKE '% {term} %' OR keyword LIKE '{term} %'
OR keyword LIKE '% {term}'
Now, when a list is processed, it can only use one of the match-types listed above. But, all 30mil+ keywords must be scanned for matches, returning the results for the matches. Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Do you have any suggestions on how to optimize the process so this will run much faster?
UPDATE:
Here is an example query to search for Holiday Inn:
SELECT * FROM keyword_list
WHERE
(
lower(text) LIKE 'holiday inn' OR
lower(text) LIKE '% holiday inn %' OR
lower(text) LIKE 'holiday inn %'
);
Here is the pastebin for the output of EXPLAIN: http://pastebin.com/tk74uhP4
Some additional information that may be useful. A keyword can consist of multiple words like:
this is a sample keyword
i like my keywords
keywords are great

Never use a LIKE match starting with "%" o large sets of data - it can not use the table index on that field and will do a table scan. This is your source of slowness.
The only matches that can use the index are the ones starting with hardcoded string (e.g. keyword LIKE '{term} %').
To work around this problem, create a new indexing table (not to be confused with database's table index) mapping individual terms to keyword strings contining those terms; then your keyword LIKE '% {term} %' becomes t1.keyword = index_table.keyword and index_table.term="{term}".

I know that mine approach can look like heresies for RDBMS guys but I verified it many times in practice and there is no magic. One should just know little bit about possible IO and processing rates and some of simple calculation. In short, RDBMS is not right tool for this sort of processing.
From mine experience perl is able do regexp scan roughly in millions per second. I don't know how fast you are able dump it from database (MySQL can up to 200krows/s so you can dump all your keywords in 2.5 min, I know that Oracle is much worse here but I hope it is not more than ten times i.e. 25 min). If your data are average 20 chars your dump will be 600MB, for 100 chars it is 3GB. It means that with slow 100MB/s HD your IO will take from 6s to 30s. (All involved IO is sequential!) It is almost nothing in comparison with time of dump and processing in perl. Your scan can slow down to 100k/s depending of number of keywords you would like to remove (I have experienced regexp with 500 branching patterns with this speed) so you can process resulting data in less than 5 minutes. If resulting cardinality will not be huge (in tens of hundreds) output IO should not be problem. Anyway your processing should be in minutes, not hours. If you generate whole keyword values for deletion you can use index in delete operation, so you will generate series of DELETE FROM <table> WHERE keyword IN (...) stuffed with keywords to remove in amount up to maximal length of SQL statement. You can also try variant where you will upload this data to temporary table and then use join. I don't know what would be faster in Oracle. It would take about 10 minutes in MySQL. You are unlucky that you have to deal with Oracle but you should be able remove hundreds of {term}'s in less than hour.
P.S.: I would recommend you to use something with better regular expressions like http://code.google.com/p/re2/ (included in V8 aka node.js) or new binary module in Erlang R14A but weak regexp engine in perl would not be weak point in this task, it would be RDBMS.

I think the problem is one of how the keywords are stored. If I'm interpreting your code correctly, the KEYWORD column is made up of a string of blank-separated keyword values, such as
KEYWORD1 KEYWORD2 KEYWORD3
Because of this you're forced to use LIKE to do your searches, and that's probably the souce of the slowness.
Although I realize this may be somewhat painful, it might be better to create a second table, perhaps called KEYWORDS, which would contain the individual keywords which relate to a given base table record (I'll refer to the base table as PPC since I don't know what's it really called). Assuming that your current base table looks like this:
CREATE TABLE PPC
(ID_PPC NUMBER PRIMARY KEY,
KEYWORD VARCHAR2(1000),
<other fields>...);
What you could do would be to rebuild the tables as follows:
CREATE TABLE NEW_PPC
(ID_PPC NUMBER PRIMARY KEY,
<other fields>...);
CREATE TABLE NEW_PPC_KEYWORD
(ID_NEW_PPC NUMBER,
KEYWORD VARCHAR2(25), -- or whatever is appropriate for a single keyword
PRIMARY KEY (ID_NEW_PPC, KEYWORD));
CREATE INDEX NEW_PPC_KEYWORD_1
ON NEW_PPC_KEYWORD(KEYWORD);
You'd populate the NEW_PPC_KEYWORD table by pulling out the individual keywords from the old PPC.KEYWORD field, putting them into the NEW_PPC_KEYWORD table. With only one keyword in each record in NEW_PPC_KEYWORD you could now use a simple join to pull all the records in NEW_PPC which had a keyword by doing something like
SELECT P.*
FROM NEW_PPC P
INNER JOIN NEW_PPC_KEYWORD K
ON (K.ID_NEW_PPC = P.ID_NEW_PPC)
WHERE K.KEYWORD = '<whatever>';
Share and enjoy.

The info is insufficient to give any concrete advice. If the expensive LIKE matching is unavoidable then the only thing I see at the moment is this:
Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Have you tried to cache the results of the queries in a table? Keyed by the input keyword?
Because I do not believe that the whole data set, all keywords can change overnight. And since they do not change very often it makes sense to simply keep the results in a extra table precomputed so that future queries for the keyword can be resolved via cache instead of going again over the 30Mil entries. Obviously, some sort of periodic maintenance has to be done on the cache table: when keywords are modified/deleted and when the lists are modified the cache entries have to be updated recomputed. To simplify the update, one would keep in the cache table also the ID of the original rows in keyword_list table which contributed the results.
To the UPDATE: Insert data into the keyword_list table already lower-cased. Use extra row if the original case is needed for later.
In the past I have participated in the design of one ad system. I do not remember all the details but the most striking difference is that we were tokenizing everything and giving every unique word an id. And keywords were not free form - they were also in DB table, were also tokenized. So we never actually matched the keywords as strings: queries were like:
select AD.id
from DICT, AD
where
DICT.word = :input_word and
DICT.word_id = AD.word_id
DICT is a table with words and AD (analogue of your keyword_list) with the words from ads.
Essentially one can summarize the problem you experience as "full table scan". This is pretty common issue, often highlighting poor design of data layout. Search the net for more information on what can be done. SO has many entries too.

Your explain plan says this query should take a minute, but it's actually taking hours? A simple test on my home PC verifies that a minute seems reasonable for this query. And on a server with some decent IO this should probably only take a few seconds.
Is the problem that you're running the same query dozens of times sequentially for different keywords? If so, you need to combine all the searches together so you only scan the table once.

You could look into Oracle Text indexing. It is designed to support the kind of in-text search you are talking about.

My advice is to raise the cach size to hundreds of gb. Throw hardware at it. If you cant build a Beowulf cluster or build a binAry space search engine.

How can I search for multiple terms in multiple table columns?

I have a table that lists people and all their contact info. I want for users to be able to perform an intelligent search on the table by simply typing in some stuff and getting back results where each term they entered matches at least one of the columns in the table. To start I have made a query like
SELECT * FROM contacts WHERE
firstname LIKE '%Bob%'
OR lastname LIKE '%Bob%'
OR phone LIKE '%Bob%' OR
...
But now I realize that that will completely fail on something as simple as 'Bob Jenkins' because it is not smart enough to search for the first an last name separately. What I need to do is split up the the search terms and search for them individually and then intersect the results from each term somehow. At least that seems like the solution to me. But what is the best way to go about it?
I have heard about fulltext and MATCH()...AGAINST() but that sounds like a rather fuzzy search and I don't know how much work it is to set up. I would like precise yes or no results with reasonable performance. The search needs to be done on about 20 columns by 120,000 rows. Hopefully users wouldn't type in more than two or three terms.
Oh sorry, I forgot to mention I am using MySQL (and PHP).
I just figured out fulltext search and it is a cool option to consider (is there a way to adjust how strict it is? LIMIT would just chop of the results regardless of how well it matched). But this requires a fulltext index and my website is using a view and you can't index a view right? So...

I would suggest using MATCH / AGAINST. Full-text searches are more advanced searches, more like Google's, less elementary.
It can match across multiple tables and rank them to how many matches they have.
Otherwise, if the word is there at all, esp. across multiple tables, you have no ranking. You can do ranking server-side, but that is going to take more programming/time.
Depending on what database you're using, the ability to do cross columns can become more or less difficult. You probably don't want to do 20 JOINs as that will be a very slow query.
There are also engines such as Sphinx and Lucene dedicated to do these types of searches.

BOOLEAN MODE
SELECT * FROM contacts WHERE
MATCH(firstname,lastname,email,webpage,country,city,street...)
AGAINST('+bob +jenkins' IN BOOLEAN MODE)
Boolean mode is very powerful. It might even fulfil all my needs. I will have to do some testing. By placing + in front of the search terms those terms become required. (The row must match 'bob' AND 'jenkins' instead of 'bob' OR 'jenkins'). This mode even works on non-indexed columns, and thus I can use it on a view although it will be slower (that is what I need to test). One final problem I had was that it wasn't matching partial search terms, so 'bob' wouldn't find 'bobby' for example. The usual % wildcard doesn't work, instead you use an asterisk *.

How-to: Ranking Search Results

I have a webapp development problem that I've developed one solution for, but am trying to find other ideas that might get around some performance issues I'm seeing.
problem statement:
a user enters several keywords/tokens
the application searches for matches to the tokens
need one result for each token
ie, if an entry has 3 tokens, i need the entry id 3 times
rank the results
assign X points for token match
sort the entry ids based on points
if point values are the same, use date to sort results
What I want to be able to do, but have not figured out, is to send 1 query that returns something akin to the results of an in(), but returns a duplicate entry id for each token matches for each entry id checked.
Is there a better way to do this than what I'm doing, of using multiple, individual queries running one query per token? If so, what's the easiest way to implement those?
edit
I've already tokenized the entries, so, for example, "see spot run" has an entry id of 1, and three tokens, 'see', 'spot', 'run', and those are in a separate token table, with entry ids relevant to them so the table might look like this:
'see', 1
'spot', 1
'run', 1
'run', 2
'spot', 3

you could achive this in one query using 'UNION ALL' in MySQL.
Just loop through the tokens in PHP creating a UNION ALL for each token:
e.g if the tokens are 'x', 'y' and 'z' your query may look something like this
SELECT * FROM `entries`
WHERE token like "%x%" union all
SELECT * FROM `entries`
WHERE token like "%y%" union all
SELECT * FROM `entries`
WHERE token like "%z%" ORDER BY score ect...
The order clause should operate on the entire result set as one, which is what you need.
In terms of performance it won't be all that fast (I'm guessing), however with databases the main overhead in terms of speed is often sending the query to the database engine from PHP and receiving the results. With this technique this only happens once instead of once per token, so performance will increase, I just don't know if it'll be enough.

I know this isn't strictly an answer to the question you're asking but if your table is thousands rather than millions of rows, then a FULLTEXT solution might be the best way to go here.
In MySQL when you use MATCH on your indexed column, each keyword you supply will be given a relevance score (calculated roughly by the number of times each keyword was mentioned) that will be more accurate than your method and certainly more effecient for multiple keywords.
See here:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

If you're using the UNION ALL pattern you may also want to include the following parts to your query:
SELECT COUNT(*) AS C
...
GROUP BY ID
ORDER BY c DESC
While this is a really trivial example it does get you the frequency of the matches for each result and this could be a pseudo rank to start with.

You'll probably get much better performance if you used a data structure designed for search tasks rather than a database. For example, you might try looking at building an inverted index. Rather than writing it youself, however, you might also want to look into something like Lucene which does most of the work for you.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.