I have a MySQL Database table with peoples names with thousands of rows.
I'm coding a search script for this table to display the most similiar names stored in the table.
So I thought of fetching ALL the rows of the table, then using a FOREACH loop that will call similar_text() (a function that returns a percentage number) and then display on the table only the names that matches 60% of similarity.
Will my website performance slow too much if do this (fetching all rows)?
Will my server bandwidth suffer because of that?
ps: 'SOUNDS LIKE' MySQL command doesn't help much on this case
Let the database do the searching.
See this question, looks like what you need: How to find similar results and sort by similarity?
Yes this will most likely slow down your site, especially as your site grows and you have many users searching simultaneously.
If possible use a stored procedure or user defined function inside the database to do the searching. Also even if you don't know the exact spelling of the entry you are looking for, if you know the first letter you can speed up the search. You can use something like WHERE name LIKE 'F%' AND similar_text(name, 'FOOBAR') > 0.6 because then an index can be used to find only those rows that start with F.
Related
As i am a junior PHP Developer growing day by day stuck in a performance problem described here:
I am making a search engine in PHP ,my database has one table with 41 column and million's of rows obviously it is a very large dataset. In index.php i have a form for searching data.When user enters search keyword and hit submit the action is on search.php with results.The query is like this.
SELECT * FROM TABLE WHERE product_description LIKE '%mobile%' ORDER BY id ASC LIMIT 10
This is the first query.After result shows i have to run 4 other query like this:
SELECT DISTINCT(weight_u) as weight from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(country_unit) as country_unit from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(country) as country from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(hs_code) as hscode from TABLE WHERE product_description LIKE '%mobile%'
These queries are for FILTERS ,the problem is this when i submit search button ,all queries are running simultaneously at the cost of Performance issue,its very slow.
Is there any other method to fetch weight,country,country_unit,hs_code speeder or how can achieve it.
The same functionality is implemented here,Where the filter bar comes after table is filled with data,How i can achieve it .Please help
Full Functionality implemented here.
I have tried to explain my full problem ,if there is any mistake please let me know i will improve the question,i am also new to stackoverflow.
Firstly - are you sure this code is working as you expect it? The first query retrieves 10 records matching your search term. Those records might have duplicate weight_u, country_unit, country or hs_code values, so when you then execute the next 4 queries for your filter, it's entirely possible that you will get values back which are not in the first query, so the filter might not make sense.
if that's true, I would create the filter values in your client code (PHP)- finding the unique values in 10 records is going to be quick and easy, and reduces the number of database round trips.
Finally, the biggest improvement you can make is to use MySQL's fulltext searching features. The reason your app is slow is because your search terms cannot use an index - you're wild-carding the start as well as the end. It's like searching the phonebook for people whose name contains "ishra" - you have to look at every record to check for a match. Fulltext search indexes are designed for this - they also help with fuzzy matching.
I'll give you some tips that will show useful in many situations when querying a large dataset, or mostly any dataset.
If you can list the fields you want instead of querying for '*' is a better practice. The weight of this increases as you have more columns and more rows.
Always try to use the PK's to look for the data. The more specific the filter, the less it will cost.
An index in this kind of situation would come pretty handy, as it will make the search more agile.
LIKE queries are generally pretty slow and resource heavy, and more in your situation. So again, the more specific you are, the better it will get.
Also add, that if you just want to retrieve data from this tables again and again, maybe a VIEW would fit nicely.
Those are just some tips that came to my mind to ease your problem.
Hope it helps.
I have a dilemma that I'm trying to solve right now. I have a table called "generic_pricing" that has over a million rows. It looks like this....
I have a list of 25000 parts that I need to get generic_pricing data for. Some parts have a CLEI, some have a partNumber, and some have both. For each of the 25000 parts, I need to search the generic_pricing table to find all rows that match either clei or partNumber.
Making matters more difficult is that I have to do matches based on substring searches. For example, one of my parts may have a CLEI of "IDX100AB01", but I need the results of a query like....
SELECT * FROM generic_pricing WHERE clei LIKE 'IDX100AB%';
Currently, my lengthy PHP code for finding these matches is using the following logic is to loop through the 25000 items. For each item, I use the query above on clei. If found, I use that row for my calculations. If not, I execute a similar query on partNumber to try to find the matches.
As you can imagine, this is very time consuming. And this has to be done for about 10 other tables similar to generic_pricing to run all of the calculations. The system is now bogging down and timing out trying to crunch all of this data. So now I'm trying to find a better way.
One thought I have is to just query the database one time to get all rows, and then use loops to find matches. But for 25000 items each having to compare against over a million rows, that just seems like it would take even longer.
Another thought I have is to get 2 associative arrays of all of the generic_pricing data. i.e. one array of all rows indexed by clei, and another all indexed by partNumber. But since I am looking for substrings, that won't work.
I'm at a loss here for an efficient way to handle this task. Is there anything that I'm overlooking to simplify this?
Do not query the db for all rows and sort them in your app. Will cause a lot more headaches.
Here are a few suggestions:
Use parameterized queries. This allows your db engine to compile the query once and use it multiple times. Otherwise it will have to optimize and compile the query each time.
Figure out a way to make in work. Instead of using like try ... left(clei,8) in ('IDX100AB','IDX100AC','IDX101AB'...)
Do the calculations/math on the db side. Build a stored proc which takes a list of part/clei numbers and outputs the same list with the computed prices. You'll have a lot more control of execution and a lot less network overhead. If not a stored proc, build a view.
Paginate. If this data is being displayed somewhere, switch to processing in batches of 100 or less.
Build a cheat sheet. If speed is an issue try precomputing prices into a separate table nightly, include some partial clei/part numbers if needed. Then use the precomputed lookup table.
I have a large table of about 14 million rows. Each row has contains a block of text. I also have another table with about 6000 rows and each row has a word and six numerical values for each word. I need to take each block of text from the first table and find the amount of times each word in the second table appears then calculate the mean of the six values for each block of text and store it.
I have a debian machine with an i7 and 8gb of memory which should be able to handle it. At the moment I am using the php substr_count() function. However PHP just doesn't feel like its the right solution for this problem. Other than working around time-out and memory limit problems does anyone have a better way of doing this? Is it possible to use just SQL? If not what would be the best way to execute my PHP without overloading the server?
Do each record from the 'big' table one-at-a-time. Load that single 'block' of text into your program (php or what ever), and do the searching and calculation, then save the appropriate values where ever you need them.
Do each record as its own transaction, in isolation from the rest. If you are interrupted, use the saved values to determine where to start again.
Once you are done the existing records, you only need to do this in the future when you enter or update a record, so it's much easier. You just need to take your big bite right now to get the data updated.
What are you trying to do exactly? If you are trying to create something like a search engine with a weighting function, you maybe should drop that and instead use the MySQL fulltext search functions and indices that are there. If you still need to have this specific solution, you can of course do this completely in SQL. You can do this in one query or with a trigger that is run each time after a row is inserted or updated. You wont be able to get this done properly with PHP without jumping through a lot of hoops.
To give you a specific answer, we indeed would need more information about the queries, data structures and what you are trying to do.
Redesign IT()
If for size on disc is not !important just joints table into one
Table with 6000 put into memory [ memory table ] and make backup every one hour
INSERT IGNORE into back.table SELECT * FROM my.table;
Create "own" index in big table eq
Add column "name index" into big table with id of row
--
Need more info about query to find solution
the query i'd like to speed up (or replace with another process):
UPDATE en_pages, keywords
SET en_pages.keyword = keywords.keyword
WHERE en_pages.keyword_id = keywords.id
table en_pages has the proper structure but only has non-unique page_ids and keyword_ids in it. i'm trying to add the actual keywords(strings) to this table where they match keyword_ids. there are 25 million rows in table en_pages that need updating.
i'm adding the keywords so that this one table can be queried in real time and return keywords (the join is obviously too slow for "real time").
we apply this query (and some others) to sub units of our larger dataset. we do this frequently to create custom interfaces for specific sub units of our data for different user groups (sorry if that's confusing).
this all works fine if you give it an hour to run, but i'm trying to speed it up.
is there a better way to do this that would be faster using php and/or mysql?
I actually don't think you can speed up the process.
You can still add brutal power to your database by cluserting new servers.
Maybe I'm wrong or missunderstood the question but...
Couldn't you use TRIGGERS ?
Like... when a new INSERT is detected on "en_pages", doing a UPDATE after on that same row?
(I don't know how frequent INSERTS are in that table)
This is just an idea.
How often does "en_pages.keyword" and "en_pages.keyword_id" changes after being inserted ?!?!?
I don't know about mySQL but usually this sort of thing runs faster in SQL Server if you process a limited number of batches of records (say a 1000) at a time in a loop.
You might also consider a where clause (I don't know what mySQL uses for "not equal to" so I used the SQL Server verion):
WHERE en_pages.keyword <> keywords.keyword
That way you are only updating records that have a difference in the field you are updating not all of the them.
I currently have a project where we are dealing with 30million+ keywords for PPC advertising. We maintain these lists in Oracle. There are times where we need to remove certain keywords from the list. The process includes various match-type policies to determine if the keywords should be removed:
EXACT: WHERE keyword = '{term}'
CONTAINS: WHERE keyword LIKE '%{term}%'
TOKEN: WHERE keyword LIKE '% {term} %' OR keyword LIKE '{term} %'
OR keyword LIKE '% {term}'
Now, when a list is processed, it can only use one of the match-types listed above. But, all 30mil+ keywords must be scanned for matches, returning the results for the matches. Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Do you have any suggestions on how to optimize the process so this will run much faster?
UPDATE:
Here is an example query to search for Holiday Inn:
SELECT * FROM keyword_list
WHERE
(
lower(text) LIKE 'holiday inn' OR
lower(text) LIKE '% holiday inn %' OR
lower(text) LIKE 'holiday inn %'
);
Here is the pastebin for the output of EXPLAIN: http://pastebin.com/tk74uhP4
Some additional information that may be useful. A keyword can consist of multiple words like:
this is a sample keyword
i like my keywords
keywords are great
Never use a LIKE match starting with "%" o large sets of data - it can not use the table index on that field and will do a table scan. This is your source of slowness.
The only matches that can use the index are the ones starting with hardcoded string (e.g. keyword LIKE '{term} %').
To work around this problem, create a new indexing table (not to be confused with database's table index) mapping individual terms to keyword strings contining those terms; then your keyword LIKE '% {term} %' becomes t1.keyword = index_table.keyword and index_table.term="{term}".
I know that mine approach can look like heresies for RDBMS guys but I verified it many times in practice and there is no magic. One should just know little bit about possible IO and processing rates and some of simple calculation. In short, RDBMS is not right tool for this sort of processing.
From mine experience perl is able do regexp scan roughly in millions per second. I don't know how fast you are able dump it from database (MySQL can up to 200krows/s so you can dump all your keywords in 2.5 min, I know that Oracle is much worse here but I hope it is not more than ten times i.e. 25 min). If your data are average 20 chars your dump will be 600MB, for 100 chars it is 3GB. It means that with slow 100MB/s HD your IO will take from 6s to 30s. (All involved IO is sequential!) It is almost nothing in comparison with time of dump and processing in perl. Your scan can slow down to 100k/s depending of number of keywords you would like to remove (I have experienced regexp with 500 branching patterns with this speed) so you can process resulting data in less than 5 minutes. If resulting cardinality will not be huge (in tens of hundreds) output IO should not be problem. Anyway your processing should be in minutes, not hours. If you generate whole keyword values for deletion you can use index in delete operation, so you will generate series of DELETE FROM <table> WHERE keyword IN (...) stuffed with keywords to remove in amount up to maximal length of SQL statement. You can also try variant where you will upload this data to temporary table and then use join. I don't know what would be faster in Oracle. It would take about 10 minutes in MySQL. You are unlucky that you have to deal with Oracle but you should be able remove hundreds of {term}'s in less than hour.
P.S.: I would recommend you to use something with better regular expressions like http://code.google.com/p/re2/ (included in V8 aka node.js) or new binary module in Erlang R14A but weak regexp engine in perl would not be weak point in this task, it would be RDBMS.
I think the problem is one of how the keywords are stored. If I'm interpreting your code correctly, the KEYWORD column is made up of a string of blank-separated keyword values, such as
KEYWORD1 KEYWORD2 KEYWORD3
Because of this you're forced to use LIKE to do your searches, and that's probably the souce of the slowness.
Although I realize this may be somewhat painful, it might be better to create a second table, perhaps called KEYWORDS, which would contain the individual keywords which relate to a given base table record (I'll refer to the base table as PPC since I don't know what's it really called). Assuming that your current base table looks like this:
CREATE TABLE PPC
(ID_PPC NUMBER PRIMARY KEY,
KEYWORD VARCHAR2(1000),
<other fields>...);
What you could do would be to rebuild the tables as follows:
CREATE TABLE NEW_PPC
(ID_PPC NUMBER PRIMARY KEY,
<other fields>...);
CREATE TABLE NEW_PPC_KEYWORD
(ID_NEW_PPC NUMBER,
KEYWORD VARCHAR2(25), -- or whatever is appropriate for a single keyword
PRIMARY KEY (ID_NEW_PPC, KEYWORD));
CREATE INDEX NEW_PPC_KEYWORD_1
ON NEW_PPC_KEYWORD(KEYWORD);
You'd populate the NEW_PPC_KEYWORD table by pulling out the individual keywords from the old PPC.KEYWORD field, putting them into the NEW_PPC_KEYWORD table. With only one keyword in each record in NEW_PPC_KEYWORD you could now use a simple join to pull all the records in NEW_PPC which had a keyword by doing something like
SELECT P.*
FROM NEW_PPC P
INNER JOIN NEW_PPC_KEYWORD K
ON (K.ID_NEW_PPC = P.ID_NEW_PPC)
WHERE K.KEYWORD = '<whatever>';
Share and enjoy.
The info is insufficient to give any concrete advice. If the expensive LIKE matching is unavoidable then the only thing I see at the moment is this:
Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Have you tried to cache the results of the queries in a table? Keyed by the input keyword?
Because I do not believe that the whole data set, all keywords can change overnight. And since they do not change very often it makes sense to simply keep the results in a extra table precomputed so that future queries for the keyword can be resolved via cache instead of going again over the 30Mil entries. Obviously, some sort of periodic maintenance has to be done on the cache table: when keywords are modified/deleted and when the lists are modified the cache entries have to be updated recomputed. To simplify the update, one would keep in the cache table also the ID of the original rows in keyword_list table which contributed the results.
To the UPDATE: Insert data into the keyword_list table already lower-cased. Use extra row if the original case is needed for later.
In the past I have participated in the design of one ad system. I do not remember all the details but the most striking difference is that we were tokenizing everything and giving every unique word an id. And keywords were not free form - they were also in DB table, were also tokenized. So we never actually matched the keywords as strings: queries were like:
select AD.id
from DICT, AD
where
DICT.word = :input_word and
DICT.word_id = AD.word_id
DICT is a table with words and AD (analogue of your keyword_list) with the words from ads.
Essentially one can summarize the problem you experience as "full table scan". This is pretty common issue, often highlighting poor design of data layout. Search the net for more information on what can be done. SO has many entries too.
Your explain plan says this query should take a minute, but it's actually taking hours? A simple test on my home PC verifies that a minute seems reasonable for this query. And on a server with some decent IO this should probably only take a few seconds.
Is the problem that you're running the same query dozens of times sequentially for different keywords? If so, you need to combine all the searches together so you only scan the table once.
You could look into Oracle Text indexing. It is designed to support the kind of in-text search you are talking about.
My advice is to raise the cach size to hundreds of gb. Throw hardware at it. If you cant build a Beowulf cluster or build a binAry space search engine.