I'm working on an application which is a large database of chemical substances (approx 250,000 but rising) and associated data. I'm looking at ways to optimise the way searching is performed.
The application is running under PHP 7.0.27, MariaDB 5.5.56, and Apache 2.4.6
The application allows searching by chemical name and various chemical codes (such as EC number and CAS number). The schema is such that there are separate tables to hold the data, and the relationships of which codes apply to which chemicals.
These tables are in the database:
substances - unique ID and name for each chemical substance.
ecs - a list of EC Numbers
ecs_substances - which EC Number(s) apply to which substances
cas - a list of CAS Numbers
cas_substances - which CAS Number(s) apply to which substances
Note: there are other tables than the ones above where similar logic will apply, but for now I want to focus on these for this example.
It is possible for a substance to have multiple EC/CAS numbers, and a small number do not have them - i.e. it's not a simple 1:1 relationship.
The application has search fields for the substance name (substances.name), EC number (ecs.value) CAS number (cas.value). These can be used on their own, or in conjuction with each other. For example: find a substance by name, or find a substance by name and CAS number.
I believe the "quickest" way of performing a search for any given value would be to use a LIKE condition on the specific table required. So if I want to look up substances which have "acids" as part of the name:
SELECT id FROM substances WHERE name LIKE '%acids%' LIMIT 0,250
However the results that the application gives are shown in a table which includes headings for substance name, CAS number, EC number. It also allows the results to be ordered on a column (e.g. order by substance name, CAS, EC, etc). This requires JOIN conditions.
I'm doing it like this:
$sql = 'SELECT
DISTINCT(substances.`id`),
substances.`name`,
"" AS cas_number,
"" AS ec_number
FROM
substances ';
// Search - EC Number, or if trying to order by EC column (JOIN has to occur to make that possible)
if ( (isset($search['ecNumber'])) || (isset($order['column']) && ($order['column'] == 'ec_number')) ) {
$sql .= ' LEFT JOIN ecs_substances ON substances.id = ecs_substances.substance_id LEFT JOIN ecs ON ecs_substances.ec_id = ecs.id ';
}
// Search - CAS Number, or if trying to order by CAS column (JOIN has to occur to make that possible)
if ( (isset($search['casNumber'])) || (isset($order['column']) && ($order['column'] == 'cas_number')) ) {
$sql .= ' LEFT JOIN cas_substances ON cas_substances.substance_id = substances.id LEFT JOIN cas ON cas_substances.cas_id = cas.id ';
}
The problem is that because of all the JOINs that are occurring it's slowing down how quickly the results can be obtained.
Benchmark: The first query I posted which just uses a LIKE condition on 1 table will execute in 140ms, whereas it's taking 506ms for the same search criteria with all of the JOIN statements in the second block of code.
I'd like to know if there are ways to optimise this such that the time taken to present results to the user decreases.
It's worth mentioning that the results are displayed in DataTables and PHP is producing a JSON feed of the results. The LIMIT 0,250 is something the end user can override by setting results per page, but I'm happy to limit them to say no more than 500 per page.
Some things I've looked into are:
Caching the JSON. Not a big fan of this because the data is updated quite regularly. The data presented must always be what is in the database, not some cached copy.
Do a search on the required table as in the first code sample. Update the other columns with ajax. This would "appear" to give instant results on the column the user has searched and then quickly thereafter populate the other columns required by the DataTable. This seems incredibly fiddly to do and I don't know whether it's really a good idea.
Consider FULLTEXT because it allows for much faster searching than LIKE with a leading wildcard %. `MATCH(col) AGAINST('+acid' IN BOOLEAN MODE)
Sounds like you need a "many:many" mapping table. Tips on efficiency in such: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Consider using GROUP_CONCAT(cas) for provideing a comma-list of CASs.
JSON does not seem practical. And even less so since you are using only MySQL 5.5.
I think a response time of half a second is quite good, given what you want to do. You must have done all necessary database optimizations? (db type, indexes, etc).
There are several things you could explore:
Prepare all possible searches and store them in a database for quick access. This may sound stupid but this is how I often achieve fast searches. It's difficult for me to judge what the best way to do this, with your data, would be. You could start by adding a TEXT column to your substances table and store all the information about the substance in it: It's name, and all EC/CAS numbers. Separate the items with something like '|', or any other character not used in searches. I would call that the 'search' column. Alternatively you could make a new table, just for searching with that column in it, and the id of the substance. Now you can make one search input field for all three types of data and search in one column only. Would that work for you? Would it be faster? Possibly, but I cannot guarantee it. I don't know, but it's quite easy to try. There is a disadvantage: You would have to update that column with every change in the database.
Use a proper search engine. Several are available for mariadb. Start at: https://mariadb.com/kb/en/library/about-sphinxse It basically does something far more advanced than what I described under point 1: Prepare a database with data for optimized searching.
Still, a response of half a second would be something I could live with.
Related
I am working on big eCommerce shopping website. I have around 40 databases. i want to create search page which show 18 result after searching by title in all databases.
(SELECT id_no,offers,image,title,mrp,store from db1.table1 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db3.table3 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db2.table2 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
LIMIT 18
currently i am using the above query its working fine for 4 or more character keyword search like laptop nokia etc but takes 10-15 sec for processes but for query with keyword less than 3 characters it takes 30-40sec or i end up with 500 internal server error. Is there any optimized way for searching in multiple databases. I generated two index primary and full text index with title
Currently my search page is in php i am ready to code in python or any
other language if i gets good speed
You can use the sphixmachine:http://sphinxsearch.com/. This is powerfull search for database. IMHO Sphinx this best decision
for search in your site.
FULLTEXT is not configured (by default) for searching for words less than three characters in length. You can configure that to handle shorter words by setting a ...min_token_size parameter. Read this. https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html You can only do this if you control the MySQL server. It won't be possible on shared hosting. Try this.
FULLTEXT is designed to produce more false-positive matches than false-negative matches. It's generally most useful for populating dropdown picklists like the ones under the location field of a browser. That is, it requires some human interaction to choose the correct record. To expect FULLTEXT to be able to do absolutely correct searches is probably a bad idea.
You simply cannot use AND column LIKE '%whatever%' if you want any reasonable performance at all. You must get rid of that. You might be able to rewrite your python program to do something different when the search term is one or two letters, and thereby avoid many, but not all, LIKE '%a%' and LIKE '%ab%' operations. If you go this route, create ordinary indexes on your title columns. Whatever you do, don't combine the FULLTEXT and LIKE searches in a single query.
If this were my project I'd consider using a special table with columns like this to hold all the short words from the title column in every row of each table.
id_pk INT autoincrement
id_no INT
word VARCHAR(3)
Then you can use a query like this to look up short words
SELECT a.id_no,offers,image,title,mrp,store
FROM db1.table1 a
JOIN db1.table1_shortwords s ON a.id_no = s.id_no
WHERE s.word = '$searchkey'
To do this, you will have to preprocess the title columns of your other tables to populate the shortwords tables, and put an index on the word column. This will be fast, but it will require a special-purpose program to do the preprocessing.
Having to search multiple tables with your UNION ALL operation is a performance problem. You will be able to improve performance dramatically by redesigning your schema so you need search only one table.
Having to search databases on different server machines is a performance problem. You may be able to rig up your python program to search them in parallel: that is, to somehow use separate tasks to search each one, then aggregate the results. Each of those separate search tasks requires its own connection to the data base, so this is not a cheap or simple solution.
If this system faces the public web, you will have to redesign it sooner or later, because it will never perform well enough as it is now. (Sorry to be the bearer of bad news.) Many system designers like to avoid redesigning systems after they become enormous. So, if I were you I would get the redesign done.
If your focus is on searching, then bend the schema to facilitate searching rather than the other way around.
Collect all the strings to search for in a single table. Whereas a UNION of 40 tables does work, it will be ~40 times as slow as having the strings collected together.
Use FULLTEXT when the words are long enough, use some other technique when they are not. (This addresses your 3-char problem; see also the Answer discussing innodb_ft_min_token_size. You are using InnoDB, correct?)
Use + and boolean mode to say that a word is mandatory: MATCH(col) AGAINST("+term" IN BOOLEAN MODE)
Do not add on a LIKE clause unless there is a good reason.
I am a field service technician and I have an inventory of parts that is either issued to me by the company I work for or through orders for specific jobs. I am trying to design a website to manage my parts, both on-hand inventory and parts that have been returned or transferred to someone else. Here is the information I need to track:
part number(10 digit)
req number(8 digit, unique)
description(up to 50 characters)
location(Van or shed).
WorkOrder("w"+9 digits ex: 'W212141234')
BOL(15 digit bill of lading #)
TransferDate(date I get rid of part)
TransferMethod(enum 'DEF','RTS','OBF')
I will probably use PHP to make a website and interact with the MySQL database.
What is the best design? A multi-table approach or one table with webpages that display queries of only certain fields? I need a list of on hand parts that list part number, req number, description, and location. I will also need to be able to have "defective returns" view that will list what parts I returned as DEF with all the remaining fields filled in.
Besides the "on hand" fields, the rest of the fields won't have data until they are no longer "on hand".
I really appreciate any help because I am new to both SQL and PHP. I have experimented with Ruby on Rails and django but I am not sure if I need to tackle all that at this point.
Even though you give some information on your issue, it is hard to actually approach it as the question itself on "what is the best design" is vague.
What I would do is this:
MYSQL TABLE DESIGN
Table parts
req number(int(8), unique, KEY)
part number(int(10))
description(varchar(50))
location(enum 'Van','shed')
WorkOrder(varchar(10))
BOL(varchar(15))
TransferDate(date)
TransferMethod(enum 'DEF','RTS','OBF')
onhand (boolean)
PHP SCRIPTS
and then i would make 2 php scripts with a single query each and a table displaying the info
onhand.php
select *fields filled for on hand parts* from parts where onhand = 1
notonhand.php
select *fields filled for not on hand parts* from parts where onhand = 0
I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.
I have many fields which are multi valued and not sure how to store them? if i do 3NF then there are many tables. For example: Nationality.
A person can have single or dual nationality. if dual this means it is a 1 to many. So i create a user table and a user_nationality table. (there is already a nationality lookup table). or i could put both nationalities into the same row like "American, German" then unserialize it on run-time. But then i dont know if i can search this? like if i search for only German people will it show up?
This is an example, i have over 30 fields which are multi-valued, so i assume i will not be creating 61 tables for this? 1 user table, 30 lookup tables to hold each multi-valued item's lookups and 30 tables to hold the user_ values for the multi valued items?
You must also keep in mind that some multi-valued fields group together like "colleges i have studied at" it has a group of fields such as college name, degree type, time line, etc. And a user can have 1 to many of these. So i assume i can create a separate table for this like user_education with these fields, but lets assume one of these fields is also fixed list multi-valued like college campuses i visited then we will end up in a never ending chain of FK tables which isn't a good design for social networks as the goal is it put as much data into as fewer tables as possible for performance.
If you need to keep using SQL, you will need to create these tables. you will need to decide on how far you are willing to go, and impose limitations on the system (such as only being able to specify one campus).
As far as nationality goes, if you will only require two nationalities (worst-case scenario), you could consider a second nationality field (Nationality and Nationality2) to account for this. Of course this only applies to fields with a small maximum number of different values.
If your user table has a lot of related attributes, then one possibility is to create one attributes table with rows like (user_id, attribute_name, attribute_value). You can store all your attributes to one table. You can use this table to fetch attributes for given users, also search by attribute names and values.
The simple solution is to stop using a SQL table. This what NoSQL is deigned for. Check out CouchDB or Mongo. There each value can be stored as a full structure - so this whole problem could be reduced to a single (not-really-)table.
The downside of pretty much any SQL based solution is that it will be slow. Either slow when fetching a single user - a massive JOIN statement won't execute quickly or slow when searching (if you decide to store these values as serialized).
You might also want to look at ORM which will map your objects to a database automatically.
http://en.wikipedia.org/wiki/List_of_object-relational_mapping_software#PHP
This is an example, i have over 30
fields which are multi-valued, so i
assume i will not be creating 61
tables for this?
You're right that 61 is the maximum number of tables, but in reality it'll likely be less, take your own example:
"colleges i have studied at"
"college campuses i visited"
In this case you'll probably only have one "collage" table, so there would be four tables in this layout, not five.
I'd say don't be afraid of using lots of tables if the data set you're modelling is large - just make sure you keep an up to date ERD so you don't get lost! Also, don't get caught up too much in the "link table" paradigm - "link tables" can be "entities" in their own rights, for example you could think of the "colleges i have studied at" link table as an "collage enrolments" table instead, give it it's own primary key, and store each of the times you pay your course fees as rows in a (linked) "collage enrolment payments" table.
I currently have a project where we are dealing with 30million+ keywords for PPC advertising. We maintain these lists in Oracle. There are times where we need to remove certain keywords from the list. The process includes various match-type policies to determine if the keywords should be removed:
EXACT: WHERE keyword = '{term}'
CONTAINS: WHERE keyword LIKE '%{term}%'
TOKEN: WHERE keyword LIKE '% {term} %' OR keyword LIKE '{term} %'
OR keyword LIKE '% {term}'
Now, when a list is processed, it can only use one of the match-types listed above. But, all 30mil+ keywords must be scanned for matches, returning the results for the matches. Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Do you have any suggestions on how to optimize the process so this will run much faster?
UPDATE:
Here is an example query to search for Holiday Inn:
SELECT * FROM keyword_list
WHERE
(
lower(text) LIKE 'holiday inn' OR
lower(text) LIKE '% holiday inn %' OR
lower(text) LIKE 'holiday inn %'
);
Here is the pastebin for the output of EXPLAIN: http://pastebin.com/tk74uhP4
Some additional information that may be useful. A keyword can consist of multiple words like:
this is a sample keyword
i like my keywords
keywords are great
Never use a LIKE match starting with "%" o large sets of data - it can not use the table index on that field and will do a table scan. This is your source of slowness.
The only matches that can use the index are the ones starting with hardcoded string (e.g. keyword LIKE '{term} %').
To work around this problem, create a new indexing table (not to be confused with database's table index) mapping individual terms to keyword strings contining those terms; then your keyword LIKE '% {term} %' becomes t1.keyword = index_table.keyword and index_table.term="{term}".
I know that mine approach can look like heresies for RDBMS guys but I verified it many times in practice and there is no magic. One should just know little bit about possible IO and processing rates and some of simple calculation. In short, RDBMS is not right tool for this sort of processing.
From mine experience perl is able do regexp scan roughly in millions per second. I don't know how fast you are able dump it from database (MySQL can up to 200krows/s so you can dump all your keywords in 2.5 min, I know that Oracle is much worse here but I hope it is not more than ten times i.e. 25 min). If your data are average 20 chars your dump will be 600MB, for 100 chars it is 3GB. It means that with slow 100MB/s HD your IO will take from 6s to 30s. (All involved IO is sequential!) It is almost nothing in comparison with time of dump and processing in perl. Your scan can slow down to 100k/s depending of number of keywords you would like to remove (I have experienced regexp with 500 branching patterns with this speed) so you can process resulting data in less than 5 minutes. If resulting cardinality will not be huge (in tens of hundreds) output IO should not be problem. Anyway your processing should be in minutes, not hours. If you generate whole keyword values for deletion you can use index in delete operation, so you will generate series of DELETE FROM <table> WHERE keyword IN (...) stuffed with keywords to remove in amount up to maximal length of SQL statement. You can also try variant where you will upload this data to temporary table and then use join. I don't know what would be faster in Oracle. It would take about 10 minutes in MySQL. You are unlucky that you have to deal with Oracle but you should be able remove hundreds of {term}'s in less than hour.
P.S.: I would recommend you to use something with better regular expressions like http://code.google.com/p/re2/ (included in V8 aka node.js) or new binary module in Erlang R14A but weak regexp engine in perl would not be weak point in this task, it would be RDBMS.
I think the problem is one of how the keywords are stored. If I'm interpreting your code correctly, the KEYWORD column is made up of a string of blank-separated keyword values, such as
KEYWORD1 KEYWORD2 KEYWORD3
Because of this you're forced to use LIKE to do your searches, and that's probably the souce of the slowness.
Although I realize this may be somewhat painful, it might be better to create a second table, perhaps called KEYWORDS, which would contain the individual keywords which relate to a given base table record (I'll refer to the base table as PPC since I don't know what's it really called). Assuming that your current base table looks like this:
CREATE TABLE PPC
(ID_PPC NUMBER PRIMARY KEY,
KEYWORD VARCHAR2(1000),
<other fields>...);
What you could do would be to rebuild the tables as follows:
CREATE TABLE NEW_PPC
(ID_PPC NUMBER PRIMARY KEY,
<other fields>...);
CREATE TABLE NEW_PPC_KEYWORD
(ID_NEW_PPC NUMBER,
KEYWORD VARCHAR2(25), -- or whatever is appropriate for a single keyword
PRIMARY KEY (ID_NEW_PPC, KEYWORD));
CREATE INDEX NEW_PPC_KEYWORD_1
ON NEW_PPC_KEYWORD(KEYWORD);
You'd populate the NEW_PPC_KEYWORD table by pulling out the individual keywords from the old PPC.KEYWORD field, putting them into the NEW_PPC_KEYWORD table. With only one keyword in each record in NEW_PPC_KEYWORD you could now use a simple join to pull all the records in NEW_PPC which had a keyword by doing something like
SELECT P.*
FROM NEW_PPC P
INNER JOIN NEW_PPC_KEYWORD K
ON (K.ID_NEW_PPC = P.ID_NEW_PPC)
WHERE K.KEYWORD = '<whatever>';
Share and enjoy.
The info is insufficient to give any concrete advice. If the expensive LIKE matching is unavoidable then the only thing I see at the moment is this:
Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Have you tried to cache the results of the queries in a table? Keyed by the input keyword?
Because I do not believe that the whole data set, all keywords can change overnight. And since they do not change very often it makes sense to simply keep the results in a extra table precomputed so that future queries for the keyword can be resolved via cache instead of going again over the 30Mil entries. Obviously, some sort of periodic maintenance has to be done on the cache table: when keywords are modified/deleted and when the lists are modified the cache entries have to be updated recomputed. To simplify the update, one would keep in the cache table also the ID of the original rows in keyword_list table which contributed the results.
To the UPDATE: Insert data into the keyword_list table already lower-cased. Use extra row if the original case is needed for later.
In the past I have participated in the design of one ad system. I do not remember all the details but the most striking difference is that we were tokenizing everything and giving every unique word an id. And keywords were not free form - they were also in DB table, were also tokenized. So we never actually matched the keywords as strings: queries were like:
select AD.id
from DICT, AD
where
DICT.word = :input_word and
DICT.word_id = AD.word_id
DICT is a table with words and AD (analogue of your keyword_list) with the words from ads.
Essentially one can summarize the problem you experience as "full table scan". This is pretty common issue, often highlighting poor design of data layout. Search the net for more information on what can be done. SO has many entries too.
Your explain plan says this query should take a minute, but it's actually taking hours? A simple test on my home PC verifies that a minute seems reasonable for this query. And on a server with some decent IO this should probably only take a few seconds.
Is the problem that you're running the same query dozens of times sequentially for different keywords? If so, you need to combine all the searches together so you only scan the table once.
You could look into Oracle Text indexing. It is designed to support the kind of in-text search you are talking about.
My advice is to raise the cach size to hundreds of gb. Throw hardware at it. If you cant build a Beowulf cluster or build a binAry space search engine.