I need to use a query which will pull data from a field when it is less then 4 words, is this possible and if so could you give some pointers? I suppose if not possible it could cycle through the results until it found one which was 3 words and then end the while statement when it has found one.
SELECT column FROM table where LENGTH(column) - LENGTH(REPLACE(column, ' ', ''))+1 < 4
Credits and full explanation
Here Regexp Solution.
select * from TABLE where column not regexp '^.+( .+){4}$'
Calling functions in a select on a per-record basis is almost always a bad idea in terms of performance - disk space is usually much 'cheaper' than CPU capacity (I enclose that in quotes since I'm not talking about dollar costs).
Most databases tend to be "more read than write" as my DBA mentor once told me; the ideal solution in that situation is to have a trigger when inserting or updating the row to calculate the number of words in that column and store them in another column. The reason it's ideal is that the number of words only changes when you update the field, so it's better to perform the time-intensive calculation at that point rather than (unnecessarily) every time you read the record.
The added required space will be minimal since you're adding an (short) integer to a long (string) but you'll find your queries scream along by just using the integer field (make sure you index on it).
I was going to suggest a LIKE '% % % %' but I'm not entirely sure if that would work or not.
Related
I am probably thinking about this wrong but here goes.
A computer starts spitting out a gazillion random numbers between 11111111111111111111 and 99999999999999999999, in a linear row:
Sometimes the computer adds a number to one end of the line.
Sometimes the computer adds a number to the other end of the line.
Each number has a number that comes, or will come, before.
Each number has a number that comes, or will come, after.
Not all numbers are unique, many, but not most, are repeated.
The computer never stops spitting out numbers.
As I record all of these numbers, I need to be able to make an educated guess, at any given time:
If this is the second time I have seen a number I must know what number preceded it in line last time.
If it has appeared more than two times, I must know the probability/frequency of numbers preceding it.
If this is the second time I have seen a number, I must also know what number came after it in line last time.
If it has appeared more than two times, I must know the probability/frequency of numbers coming after it.
How the heck do I structure the tables in a MySQL database to store all these numbers? Which engine do I use and why? How do I formulate my queries? I need to know fast, but capacity is also important because when will the thing stop spitting them out?
My ill-conceived plan:
2 Tables:
1. Unique ID/#
2. #/ID/#
My thoughts:
Unique ID's are almost always going to be shorter than the number = faster match.
Numbers repeat = fewer ID rows = faster match initially.
Select * in table2 where id=(select id in table1 where #=?)
OR:
3 Tables:
1. Unique ID/#
2. #/ID
3. ID/#
My thoughts:
If I only need left/before, or only need after/right, im shrinking the size of the second query.
SELECT # IN table2(or 3) WHERE id=(SELECT id IN table1 WHERE #=?)
OR
1 Table:
1. #/#/#
Thoughts:
Less queries = less time.
SELECT * IN table WHERE col2=#.
I'm lost.... :( Each number has four attributes, that which comes before+frequency and that which comes after+frequency.
Would I be better off thinking of it in that way? If I store and increment frequency in the table, I do away with repetition and thus speed up my queries? I was initially thinking that if I store every occurrence, it would be faster to figure the frequency programmatically.......
Such simple data, but I just don't have the knowledge of how databases function to know which is more efficient.
In light of a recent comment, I would like to add a bit of information about the actual problem: I have a string of indefinite length. I am trying to store a Markov chain frequency table of the various characters, or chunks of characters, in this string.
Given any point in the string I need to know the probability of the next state, and the probability of the previous state.
I am anticipating user input, based on a corpus of text and past user input. A major difference compared to other applications I have seen is that I am going farther down the chain, more states, at a given time and I need the frequency data to provide multiple possibilities.
I hope that clarifies the picture a lot more. I didn't want to get into the nitty gritty of the problem, because in the past I have created questions that are not specific enough to get a specific answer.
This seems maybe a bit better. My primary question with this solution is: Would providing the "key" (first few characters of the state) increase the speed of the system? i.e query for state_key, then query only the results of that query for the full state?
Table 1:
name: state
col1:state_id - unique, auto incrementing
col2:state_key - the first X characters of the state
col3:state - fixed length string or state
Table 2:
name: occurence
col1:state_id_left - non unique key from table 1
col2:state_id_right - non unique key from table 1
col3:frequency - int, incremented every time the two states occur next to each other.
QUERY TO FIND PREVIOUS STATES:
SELECT * IN occurence WHERE state_id_right=(SELECT state_id IN state WHERE state_key=? AND state=?)
QUERY TO FIND NEXT STATES:
SELECT * IN occurence WHERE state_id_left=(SELECT state_id IN state WHERE state_key=? AND state=?)
I'm not familiar with Markov Chains but here is an attempt to answer the question. Note: To simplify things, let's call each string of numbers a 'state'.
First of all I imagine a table like this
Table states:
order : integer autonumeric (add an index here)
state_id : integer (add an index here)
state : varchar (?)
order: just use a sequential number (1,2,3,...,n) this will make it easy to search for the previous or next state.
state_id: a unique number associated to the state. As an example, you can use the number 1 to represent the state '1111111111...1' (whatever the length of the sequence is). What's important is that a reoccurrence of a state needs to use the same state_id that was used before. You may be able to formulate the state_id based on the string (maybe substracting a number). Of course a state_id only makes sense if the number of possible states fits in a MySQL int field.
state: that is the string of numbers '11111111...1' to '99999999...9' ... I'm guessing this can only be stored as a string but if it fits in an integer/number column you should try it as it may well be that you don't need the state_id
The point of state_id is that searching number is quicker than searching texts, but there will always be trade-offs when it comes to performance ... profile and identify your bottlenecks to make better design decisions.
So, how do you look for a previous occurrence of the state S_i ?
"SELECT order, state_id, state FROM states WHERE state_id = " and then attach get_state_id(S_i) where get_state_id ideally uses a formula to generate a unique id for the state.
Now, with order - 1 or order + 1 you can access the neighboring states issuing an additional query.
Next we need to track the frequency of different occurrences. You can do that in a different table that could look like this:
Table state_frequencies:
state_id integer (indexed)
occurrences integer
And only add records as you get the numbers.
Finally, you can have tables to track frequency for the neighboring states:
Table prev_state_frequencies (next_state_frequencies is the same):
state_id: integer (indexed)
prev_state_id: integer (indexed)
occurrences: integer
You will be able to infer probabilities (i guess this is what you are trying to do) by looking at the number of occurrences of a state (in state_frequencies) vs the number of occurrences of it's predecessor state (in prev_state_frequencies).
I'm not sure if I got your problem right but if this makes sense I'm guessing I have.
Hope it helps,
AH
It seems to me that the Markov Chain is finite, so first I would start by defining the limit of the chain (i.e. 26 characters with x number of spaces to fill) then you can calculate the total number of possible combinations. to determine the probability of a certain arrangement of characters the math if I remember correctly is:
x = ((C)(C))(P)
where
C = the number of possible characters and
P = the total potential outcomes.
this is a ton of data to store and creating procedures to filter through the data could turn out to be a seemingly endless task.
->
if you are using an auto incremented id in your table you could query the table and use preg_match to test the new result against the previous results then insert the number of total matches with the new result into the table, this would also allow you to query the preceding results to see what came before it this should give you a general idea of the pattern within the results as well as a general base for statistical relevance and new algorithm generation
Suppose Table1 contains column orderid (not a key, although it's NOT NULL and unique). It contains 5-digit numbers.
What's the best way to generate a php var $unique_var that is not in that column.
What could be important, from 10% to 30% of 5-digit numbers are in the table (so generating a number and checking while (mysql_num_rows() ==0) {} is not a best way to find one, are there any better solution for performance?).
Thank you.
If there is just 10-30% of numbers already taken - then it means that only 10-30% of queries will be performed at least twice. Which is not a big performance issue at all.
Otherwise - just create all-5-digit-numbers-list table (just 100k rows) and remove all that exist. When you need another random number - just pick one and delete.
I would suggest finding the biggest number (with a MAX() clause) and start from there.
Here are a couple of suggestions. Each has its drawbacks.
Pre-populate your table and add a column to indicate that a number is unused. Select an unused number using LIMIT = 1 and mark it used. This uses a lot of space.
Keep a separate table containing previously used numbers. If that table is empty, generate numbers sequentially from the last used number (or from 00001 if Table1 is empty). This requires some extra bookkeeping.
From someone with more experience than myself, would it be a better idea to simply count the number of items in a table (such as counting the number of topics in a category) or to keep a variable that holds that value and just increment and call it (an extra field in the category table)?
Is there a significant difference between the two or is it just very slight, and even if it is slight, would one method still be better than the other? It's not for any one particular project, so please answer generally (if that makes sense) rather than based on something like the number of users.
Thank you.
To get the number of items (rows in a table), you'd use standard SQL and do it on demand
SELECT COUNT(*) FROM MyTable
Note, in case I've missed something, each item (row) in the table has some unique identifier, whether it's a part number, some code, or an auto-increment. So adding a new row could trigger the "auto-increment" of a column.
This is unrelated to "counting rows". Because of DELETEs or ROLLBACK, numbers may not be contiguous.
Trying to maintain row counts separately will end in tears and/or disaster. Trying to use COUNT(*)+1 or MAX(id)+1 to generate a new row identifier is even worse
I think there is some confusion about your question. My interpretation is whether you want to do a select count(*) or a column where you track your actual count.
I would not add such a column, if you don't have reasons to do so. This is premature optimization and you complicate your software design.
Also, you want to avoid having the same information stored in different places. Counting is a trivial task, so you actually duplicating information, which is a bad idea.
I'd go with just counting. If you notice a performance issue, you can consider other options, but as soon as you keep a value that's separate, you have to do some work to make sure it's always correct. Using COUNT() you always get the actual number "straight from the horse's mouth" so to speak.
Basically, don't start optimizing until you have to. If everything works fine and fast using COUNT(), then do that. Otherwise, store the count somewhere, but rather than adding/subtracting to update the stored value, run COUNT() when needed to get the new number of items
In my forum I count the sub-threads in a forum like this:
SELECT COUNT(forumid) AS count FROM forumtable
As long as you're using an identifier that is the same to specify what forum and/or sub-section, and the column has an index key, it's very fast. So there's no reason to add more columns than you need to.
I currently have a project where we are dealing with 30million+ keywords for PPC advertising. We maintain these lists in Oracle. There are times where we need to remove certain keywords from the list. The process includes various match-type policies to determine if the keywords should be removed:
EXACT: WHERE keyword = '{term}'
CONTAINS: WHERE keyword LIKE '%{term}%'
TOKEN: WHERE keyword LIKE '% {term} %' OR keyword LIKE '{term} %'
OR keyword LIKE '% {term}'
Now, when a list is processed, it can only use one of the match-types listed above. But, all 30mil+ keywords must be scanned for matches, returning the results for the matches. Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Do you have any suggestions on how to optimize the process so this will run much faster?
UPDATE:
Here is an example query to search for Holiday Inn:
SELECT * FROM keyword_list
WHERE
(
lower(text) LIKE 'holiday inn' OR
lower(text) LIKE '% holiday inn %' OR
lower(text) LIKE 'holiday inn %'
);
Here is the pastebin for the output of EXPLAIN: http://pastebin.com/tk74uhP4
Some additional information that may be useful. A keyword can consist of multiple words like:
this is a sample keyword
i like my keywords
keywords are great
Never use a LIKE match starting with "%" o large sets of data - it can not use the table index on that field and will do a table scan. This is your source of slowness.
The only matches that can use the index are the ones starting with hardcoded string (e.g. keyword LIKE '{term} %').
To work around this problem, create a new indexing table (not to be confused with database's table index) mapping individual terms to keyword strings contining those terms; then your keyword LIKE '% {term} %' becomes t1.keyword = index_table.keyword and index_table.term="{term}".
I know that mine approach can look like heresies for RDBMS guys but I verified it many times in practice and there is no magic. One should just know little bit about possible IO and processing rates and some of simple calculation. In short, RDBMS is not right tool for this sort of processing.
From mine experience perl is able do regexp scan roughly in millions per second. I don't know how fast you are able dump it from database (MySQL can up to 200krows/s so you can dump all your keywords in 2.5 min, I know that Oracle is much worse here but I hope it is not more than ten times i.e. 25 min). If your data are average 20 chars your dump will be 600MB, for 100 chars it is 3GB. It means that with slow 100MB/s HD your IO will take from 6s to 30s. (All involved IO is sequential!) It is almost nothing in comparison with time of dump and processing in perl. Your scan can slow down to 100k/s depending of number of keywords you would like to remove (I have experienced regexp with 500 branching patterns with this speed) so you can process resulting data in less than 5 minutes. If resulting cardinality will not be huge (in tens of hundreds) output IO should not be problem. Anyway your processing should be in minutes, not hours. If you generate whole keyword values for deletion you can use index in delete operation, so you will generate series of DELETE FROM <table> WHERE keyword IN (...) stuffed with keywords to remove in amount up to maximal length of SQL statement. You can also try variant where you will upload this data to temporary table and then use join. I don't know what would be faster in Oracle. It would take about 10 minutes in MySQL. You are unlucky that you have to deal with Oracle but you should be able remove hundreds of {term}'s in less than hour.
P.S.: I would recommend you to use something with better regular expressions like http://code.google.com/p/re2/ (included in V8 aka node.js) or new binary module in Erlang R14A but weak regexp engine in perl would not be weak point in this task, it would be RDBMS.
I think the problem is one of how the keywords are stored. If I'm interpreting your code correctly, the KEYWORD column is made up of a string of blank-separated keyword values, such as
KEYWORD1 KEYWORD2 KEYWORD3
Because of this you're forced to use LIKE to do your searches, and that's probably the souce of the slowness.
Although I realize this may be somewhat painful, it might be better to create a second table, perhaps called KEYWORDS, which would contain the individual keywords which relate to a given base table record (I'll refer to the base table as PPC since I don't know what's it really called). Assuming that your current base table looks like this:
CREATE TABLE PPC
(ID_PPC NUMBER PRIMARY KEY,
KEYWORD VARCHAR2(1000),
<other fields>...);
What you could do would be to rebuild the tables as follows:
CREATE TABLE NEW_PPC
(ID_PPC NUMBER PRIMARY KEY,
<other fields>...);
CREATE TABLE NEW_PPC_KEYWORD
(ID_NEW_PPC NUMBER,
KEYWORD VARCHAR2(25), -- or whatever is appropriate for a single keyword
PRIMARY KEY (ID_NEW_PPC, KEYWORD));
CREATE INDEX NEW_PPC_KEYWORD_1
ON NEW_PPC_KEYWORD(KEYWORD);
You'd populate the NEW_PPC_KEYWORD table by pulling out the individual keywords from the old PPC.KEYWORD field, putting them into the NEW_PPC_KEYWORD table. With only one keyword in each record in NEW_PPC_KEYWORD you could now use a simple join to pull all the records in NEW_PPC which had a keyword by doing something like
SELECT P.*
FROM NEW_PPC P
INNER JOIN NEW_PPC_KEYWORD K
ON (K.ID_NEW_PPC = P.ID_NEW_PPC)
WHERE K.KEYWORD = '<whatever>';
Share and enjoy.
The info is insufficient to give any concrete advice. If the expensive LIKE matching is unavoidable then the only thing I see at the moment is this:
Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Have you tried to cache the results of the queries in a table? Keyed by the input keyword?
Because I do not believe that the whole data set, all keywords can change overnight. And since they do not change very often it makes sense to simply keep the results in a extra table precomputed so that future queries for the keyword can be resolved via cache instead of going again over the 30Mil entries. Obviously, some sort of periodic maintenance has to be done on the cache table: when keywords are modified/deleted and when the lists are modified the cache entries have to be updated recomputed. To simplify the update, one would keep in the cache table also the ID of the original rows in keyword_list table which contributed the results.
To the UPDATE: Insert data into the keyword_list table already lower-cased. Use extra row if the original case is needed for later.
In the past I have participated in the design of one ad system. I do not remember all the details but the most striking difference is that we were tokenizing everything and giving every unique word an id. And keywords were not free form - they were also in DB table, were also tokenized. So we never actually matched the keywords as strings: queries were like:
select AD.id
from DICT, AD
where
DICT.word = :input_word and
DICT.word_id = AD.word_id
DICT is a table with words and AD (analogue of your keyword_list) with the words from ads.
Essentially one can summarize the problem you experience as "full table scan". This is pretty common issue, often highlighting poor design of data layout. Search the net for more information on what can be done. SO has many entries too.
Your explain plan says this query should take a minute, but it's actually taking hours? A simple test on my home PC verifies that a minute seems reasonable for this query. And on a server with some decent IO this should probably only take a few seconds.
Is the problem that you're running the same query dozens of times sequentially for different keywords? If so, you need to combine all the searches together so you only scan the table once.
You could look into Oracle Text indexing. It is designed to support the kind of in-text search you are talking about.
My advice is to raise the cach size to hundreds of gb. Throw hardware at it. If you cant build a Beowulf cluster or build a binAry space search engine.
I have a field in a table recipes that has been inserted using mysql_real_escape_string, I want to count the number of line breaks in that field and order the records using this number.
p.s. the field is called Ingredients.
Thanks everyone
This would do it:
SELECT *, LENGTH(Ingredients) - LENGTH(REPLACE(Ingredients, '\n', '')) as Count
FROM Recipes
ORDER BY Count DESC
The way I am getting the amount of linebreaks is a bit of a hack, however, and I don't think there's a better way. I would recommend keeping a column that has the amount of linebreaks if performance is a huge issue. For medium-sized data sets, though, I think the above should be fine.
If you wanted to have a cache column as described above, you would do:
UPDATE
Recipes
SET
IngredientAmount = LENGTH(Ingredients) - LENGTH(REPLACE(Ingredients, '\n', ''))
After that, whenever you are updating/inserting a new row, you could calculate the amounts (probably with PHP) and fill in this column before-hand. Or, if you're into that sort of thing, try out triggers.
I'm assuming a lot here, but from what I'm reading in your post, you could change your database structure a little bit, and both solve this problem and open your dataset up to more interesting uses.
If you separate ingredients into its own table, and use a linking table to index which ingredients occur in which recipes, it'll be much easier to be creative with data manipulation. It becomes easier to count ingredients per recipe, to find similarities in recipes, to search for recipes containing sets of ingredients, etc. also your data would be more normalized and smaller. (storing one global list of all ingredients vs. storing a set for each recipe)
If you're using a single text entry field to enter ingredients for a recipe now, you could do something like break up that input by lines and use each line as an ingredient when saving to the database. You can use something like PHP's built-in levenshtein() or similar_text() functions to deal with misspelled ingredient names and keep the data as normalized as possbile without having to hand-groom your [users'] data entry too much.
This is just a suggestion, take it as you like.
You're going a bit beyond the capabilities and intent of SQL here. You could write a stored procedure to scan the string and return the number and then use this in your query.
However, I think you should revisit the design of whatever is inserting the Ingredients so that you avoid searching strings in of every row whenever you do this query. Add a 'num_linebreaks' column, calculate the number of line breaks and set this column when you're adding the Indgredients.
If you've no control over the app that's doing the insertion, then you could use a stored procedure to update num_linebreaks based on a trigger.
Got it thanks, the php code looks like:
$check = explode("\r\n", $_POST['ingredients']);
$lines = count($check);
So how could I update all the information in the table so Ingred_count based on field Ingredients in one fellow swoop for previous records?