MySQL Finding a larger string by knowing only a part of it - php

I have a string like this:
x26y6z8/0|x999y0z1/1|x1y5z40/9999|etc...
Let's say I know this:
x1y5z40
How can I find the whole part of the string I'm looking for? Which is
x1y5z40/9999
It is actually simple, but the way I'm doing it is absolutely not the correct one, as I'm spamming the database with queries and doing it all with php, which obviously results in it being slow.
To make things more difficult, once I found
x1y5z40/9999
I need to replace it with, for example:
x1y5z40/0
I would like to do it entirely with MySQL if possible, maybe with 1 query, somebody got any idea on how could I do?

Use REPLACE().
update TBL
set content = REPLACE(content , 'x1y5z40/9999', 'x1y5z40/0')
Source: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_replace

I think you are leaving out that the components of the string are separated by |. I think you can do this using MySQL string functions. The following should return the second part:
select substring_index(substring_index(col, '|x1y5z40/', 2), '|', 1)
Ah, now to replace the second part with 0. That will be a bit uglier:
select replace(col,
concat('|x1y5z40/',
substring_index(substring_index(col, '|x1y5z40/', 2), '|', 1),
'|'
),
'|x1y5z40/0'
)
You can also express this as an update, if you are trying to change the data in the database.
By the way, storing lists of things in strings is a very bad idea in SQL. Perhaps you don't have choice. But SQL has an excellent data structure for storing lists, with all sorts of built-in functionality. It is called a table. You should have a junction table with one row per entity and value in the string.

Related

Database/datasource optimized for string matching?

I want to store large amount (~thousands) of strings and be able to perform matches using wildcards.
For example, here is a sample content:
Folder1
Folder1/Folder2
Folder1/*
Folder1/Folder2/Folder3
Folder2/Folder*
*/Folder4
*/Fo*4
(each line has additionnal data too, like tags, but the matching is only against that key)
Here is an example of what I would like to match against the data:
Folder1
Folder1/Folder2/Folder3
Folder3
(* being a wildcard here, it can be a different character)
I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters on the left of the wildcard, and in my case it can be anywhere (i.e. %/Folder3).
So I'm looking for a fast solution, that could be used from PHP. And I am open: it can be a separate server, a PHP library using files with regex, ...
Have you considered using MySQL's regular expression engine? Try something like this:
SELECT *
FROM your_table
WHERE your_query_string REGEXP pattern_column
This will return rows with regex keys that your query string matches. I expect it will perform better than running a query to pull all of the data and doing the matching in PHP.
More info here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
You might want to use the multicore approach to solve that search in a fraction of the time, i would recommend for search and matching, using FPGA's but thats probably the hardest way to do it, consider THIS ARTICLE using CUDA, you can do that searches in 16x usual time, in multicore CPU Systems, you can use posix, or a cluster of computers to do the job (MPI for example), you can call Gearman service to run the searches using advanced algorithms.
Were it me, I'd store out the key field two times ... once forward and once reversed (see mysql's reverse function). you can then search the index with left(main_field) and left(reversed_field). it won't help you when you have a wildcard in the middle of the string AND the beginning (e.g. "*Folder1*Folder2), but it will when you have a wildcard at the beginning or the end.
e.g. if you want to search */Folder1 then search where left(reverse_field, 8) = '1redloF/';
for Folder1/*/FolderX search where left(reverse_field, 8) = 'XredloF/' and left(main_field, 8) = 'Folder1/'
If your strings represent some kind of hierarchical structure (as it looks like in your sample content), actually not "real" files, but you say you are open to alternative solutions - why not consider something like a file-based index?
Choose a new directory like myindex
Create an empty file for each entry using the string key as location & file name in myindex
Now you can find matches using glob - thanks to the hierarchical file structure a glob search should be much faster than searching up all your database entries.
If needed you can match the results to your MySQL data - thanks to your MySQL index on the key this action will be very fast.
But don't forget to update the myindex structure on INSERT, UPDATE or DELETE in your MySQL database.
This solution will only compete on a huge data-set (but not too huge as #Kyle mentioned) with a rather deep than wide hierarchical structure.
EDIT
Sorry this would only work if the wildcards are in your search terms not in the stored strings itself.
As the wildcards (*) are in your data and not in your queries I think you should start with breaking up your data into pieces. You should create an index-table having columns like:
dataGroup INT(11),
exactString varchar(100),
wildcardEnd varchar(100),
wildcardStart varchar(100),
If you have a value like "Folder1/Folder2" store it in "exactString" and assign the ID of the value in the main data table to "dataGroup" in the above index table.
If you have a value like "Folder1/*" store a value of "Folder1/" to "wildcardEnd" and again assign the id of the value in the main table to the "dataGroup" field in above Table.
You can then do a match within your query using:
indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))
This will truncate the search string ('Folder1/WhatAmILookingFor/Data') to "Folder1/" and then match it against the wildcardEnd field. I assume mysql is clever enough not to do the truncate for every row but to start with the first character and match it against every row (using B-Tree indexes).
A value like "*/Folder4" will go into the field "wildcardStart" but reversed. To cite Missy Elliot: "Is it worth it, let me work it
I put my thing down, flip it and reverse it" (http://www.youtube.com/watch?v=Ke1MoSkanS4). So store a value of "4redloF/" in "wildcardStart". Then a WHERE like the following will match rows:
indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))
of course you could do the "REVERSE" already in your application logic.
Now about the tricky part. Something like "*/Fo*4" should get split up into two records:
# Record 1
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> oF/
wildcardEnd ==> /Fo
# Record 2
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> 4
Now if you match something you have to take care that every index-record of a dataGroup gets returned for a complete match and that no overlapping occurs. This could also get solved in SQL but is beyond this question.
Database isn't the right tool to do these kinds of searches. You can still use a database (any database and any structure) to store the strings, but you have to write the code to do all the searches in memory. Load all the strings from the database (a few thousand strings is really no biggy), cache them and run your search\match algorithm on them.
You probably have to code your algorithm yourself because the standard tools will be an overkill for what you are trying to achieve and there is no garantee that they will be able to achieve exactly what you need.
I would build a regex representation of your wildcard based strings and run those regexs on your input. Your probabaly will have to do some work until you get the regex right, but it will be the fastest way to go.
I suggest reading the keys and their associated payload into a binary tree representation ordered alphanumerically by key. If your keys are not terribly "clumped" then you can avoid the (slight additional) overhead building of a balanced tree. You also can avoid any tree maintenance code as, if I understand your problem correctly, the data will be changing frequently and it would be simplest to rebuild the tree rather than add/remove/update nodes in place. The overhead of reading into the tree is similar to performing an initial sort, and tree traversal to search for your value is straight-forward and much more efficient than just running a regex against a bunch of strings. You may even find while working it through that your wild cards in the tree will lead to some shortcuts to prune the search space. A quick search show lots of resources and PHP snippets to get you started.
If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col do you get duplicate folder_col values (ie count(*) greater than 1)?
If not, that means you can produce an SQL that would generate a valid sphinx index (see http://sphinxsearch.com/).
I wouldn't recommend to do text search on large collection of data in MySQL. You need a database to store the data but that would be it. For searching use a search engine like:
Solr (http://lucene.apache.org/solr/)
Elastic Search (http://www.elasticsearch.org/)
Sphinx (http://sphinxsearch.com/)
Those services will allow you doing all sort of funky text search (including Wildcards) in a blink of an eye ;-)

How to find records in a database which differ only from one character to the search string?

I have database with a field 'clinicNo' and that field contains records like 1234A, 2343B, 9999Z ......
If by mistake I use '1234B' instead of '1234A' for the select statement, I want to get a result set which contains clinicNos which are differ only by a one character to the given string (ie. 1234B above)
Eg. Field may contain following values.
1234A, 1235B, 5433A, 4444S, 2978C
If I use '1235A' for the select query, it should give 1234A and 1235B as the result.
You could use SUBSTRING for your column selection, below example return '1235' with 'A to Z'
select * from TableName WHERE SUBSTRING(clinicNo, 0, 5) LIKE '1235A'
What you're looking for is called the Levenshtein Distance algorithm. While there is a levenshtein function in PHP, you really want to do this in MySQL.
There are two ways to implement a Levenshtein function in MySQL. The first is to create a STORED FUNCTION which operates much like a STORED TRANSACTION, except it has distinct inputs and an output. This is fine for small datasets, but a little slow on anything approaching several thousand rows. You can find more info here: http://kristiannissen.wordpress.com/2010/07/08/mysql-levenshtein/
The second method is to implement a User Defined Function in C/C++ and link it into MySQL as a shared library (*.so file). This method also uses a STORED FUNCTION to call the library, which means the actual query for this or the first method may be identical (providing the inputs to both functions are the same). You can find out more about this method here: http://samjlevy.com/2011/03/mysql-levenshtein-and-damerau-levenshtein-udfs/
With either of these methods, your query would be something like:
SELECT clinicNo FROM words WHERE levenshtein(clinicNo, '1234A') < 2;
It's important to remember that the 'threshold' value should change in relation to the original word length. It's better to think of it in terms of a percentage value, i.e. half your word = 50%, half of 'term' = 2. In your case, you would probably be looking for a difference of < 2 (i.e. a 1 character difference), but you could go further to account for additional errors.
Also see: Wikipedia: Levenshtein Distance.
SELECT * FROM TABLE
WHERE ClinicNo like concat(LEFT(ClinicNo,4),'%')
In general development, you could use a function like Levenshtein to find the difference between two strings and it returns you a number of "how similar they are". You probably want then the result with the most similarity.
To get Levenshtein also in MySQL, read this post.
Or just get all results and use the Levenshtein function of PHP.

How to find 'similar' records in a MySQL table based on 'title' and 'description' columns?

I have a MySQL table storing some user generated content. For each piece of content, I have a title (VARCHAR 255) and a description (TEXT) column.
When a user is viewing a record, I want to find other records that are 'similar' to it, based on the title/description being similar.
What's the best way to go about doing this? I'm using PHP and MySQL.
My initial ideas are:
1) Either to strip out common words from the title and description to be left with 'unique' keywords, and then find other records which share those keywords.
E.g in the sentence: "Bob woke up at 5 am and went to school", the keywords would be: "Bob, woke, 5, went, school". Then if there's another record whose title talks about 'bob' and 'school', they would be considered 'similar'.
2) Or to use MySQL's full text search, though I don't know if this would be any good for something like this?
Which method would be better out of the two, or is there another method which is even better?
I'll keep this short (it could be way too long)...
I would not select they keywords 'manually' or modify your original data.
MySQL supports full text search with MyISAM (not InnoDB) engine. A full description of the options available when querying the DB are available here. The query can automatically get rid of common stop-words and words too common in the data set (more than 50% of the rows contains them) depending on the querying method. Query expansion is also available and the query type should be decided depending on your needs.
Consider also using a separate engine like Lucene. With Lucene you will probably have more functionalities and better indexing/searching. You can automatically get rid of common words (they get a low score and do not influence the search) and use things as stemming for instance. There is a little bit of a learning curve but I'll definitely look into it.
EDIT:
The MySQL 'full-text natural language search' returns the most similar rows (and their relevance score) and is not a boolean matching search.
You would start by defining what similar means to you and how you want to score the similarity between two different documents.
Using that algorithm you can processing all your documents and build a table of similarity scores.
Depending on the complexity of your scoring algorithm and size of data set, this may not be something you would run realtime, but instead batch it through something like Hadoop.
I have done something like this. I replace all of the spaces in the string with % then use LIKE in the where clause. Here, I will give you my code. It is from MSSQL but minor adjustments can be made to work it with MySQL. Hope it helps.
CREATE FUNCTION [dbo].[fss_MakeTextSearchable] (#text NVARCHAR(MAX)) RETURNS NVARCHAR(MAX)
--replaces spaces with wildcard characters to return more matches in a LIKE condition
-- for example:
-- #text = 'my file' will return '%my%file%'
-- SELECT WHERE 'my project files' like #text would return true
AS
BEGIN
DECLARE #searchableText NVARCHAR(MAX)
SELECT #searchableText = '%' + replace(#text, ' ', '%') + '%'
RETURN #searchableText
END
Then use the function like this:
SELECT #searchString = dbo.fss_MakeTextSearchable(#String)
Then in your query:
Select * from Table where title LIKE #searchString

How do I find records when data entry has been inconsistent?

A group of people have been inconsistently entering data for a while.
Some people will enter this:
101mxeGte - TS 200-10
And other people will enter this
101mxeGte-TS-200-10
The sad thing is, those are supposed to be identical records.
They will also search inconsistently. If a record was entered one way, some people will search the other way.
Now, I know all about how you can fix data entry for the future, but that's NOT what I am asking about. I want to know how it is possible to:
Leave the data alone, but...
Search for the right thing.
Am I asking for the impossible here?
The best thing I found so far was a suggestion to simply muck about with the existing data, using the REPLACE function in mySQL.
I am uncomfortable with this option, as it means it will certainly actively piss off half of the users. The unfocused angst of all is less than the active ire of half.
The problem is that it has to go both ways:
Entering spaces in the query has to find both space and not-space entries,
and NOT entering spaces ALSO has to find both space and not-space entries.
Thanks for any help you can offer!
The "ideal" solution is pretty straightforward:
Decide what is the canonical way of representing a record
When someone saves a record, canonicalize it before saving
When someone searches for a record, canonicalize the input before searching for it
You could also write a small program to convert all existing data to the canonical form (you will have the code for it anyway, as "canonicalize" in steps 2 and 3 require that you write code that does so).
Edit: some specific information on how to canonicalize
With the sample data you give, the algorithm might be:
Replace all spaces with hyphens
Replace all runs of one or more hyphens with a single hyphen (a regex would be easiest for this -- actually, a regex can do both steps in one go)
Is there any practical problem with this approach?
Trim whitespaces from BOTH the existing data and the input of the search. That way the intended record(s) will always be returned. Hope your data size is small, though, because it's going to perform pretty poorly.
Edit: by "existing data" I meant "the query of existing data". My answer was based on assumption that the actual data could not be touched (which might not be correct).
If it where up to me, I'd have the data in the database updated with REPLACE, and on future searches when dealing with the given row remove all spaces in the input.
Presumably your users enter the search terms (or record details, when creating a record) in an HTML form, which then goes to a PHP script. It looks like your data can always be written in a way that contains no spaces, so why don't you do this:
Run a query that strips spaces from the existing data
Add code in the PHP script(s) that receives the form(s), so that it strips spaces from submitted data - whether that data is to be used for search or for writing new data.
Edit: I guess you would also need to change some spaces to hyphens. Shouldn't be too hard to write logic to accomplish that.
Something like this.
pseudo code:
$myinput = mysql_real_escape_string('101mxeGte-TS-200-10')
$query = " SELECT * FROM table1
WHERE REPLACE(REPLACE(f1, ' ', ''),'-','')
= REPLACE(REPLACE($myinput, ' ', ''),'-','') "
Alternatively you might write your own function to trim the data so it can be compared.
DELIMITER $$
CREATE FUNCTION myTrim(AStr varchar) RETURNS varchar
BEGIN
declare Result varchar;
SET Result = REPLACE(AStr, ' ','');
SET Result = ......
.....
RETURN Result;
END$$
DELIMITER ;
And then use this in your select
$query = " SELECT * FROM table1
WHERE MyTrim(f1) = MyTrim($myinput) "
have you ever heard of SQL's LIKE?
http://dev.mysql.com/doc/refman/4.1/en/string-comparison-functions.html
there's also regex
http://dev.mysql.com/doc/refman/4.1/en/regexp.html#operator_regexp
101mxeGte - TS 200-10
101mxeGte-TS-200-10
how about this?
SELECT 'justalnums' REGEXP '101mxeGte[[:blank:]]*(\-[[:blank:]]*)?TS[[:blank:]-]*200[[:blank:]-]*10'
digits can be represented by [0-9] and alphas as [a-z] or [A-Z] or [a-zA-Z]
append a + to make then multiple of that. perens allow you to group and even capture what is in the perens and reuse it later in a replace or something else.
RLIKE is the same as REGEXP.

mysql select query within a serialized array

I'm storing a list of items in a serialized array within a field in my database (I'm using PHP/MySQL).
I want to have a query that will select all the records that contain a specific one of these items that is in the array.
Something like this:
select * from table WHERE (an item in my array) = '$n'
Hopefully that makes sense.
Any ideas would be greatly appreciated.
Thanks
As GWW says in the comments, if you need to query things this way, you really ought to be considering storing this data as something other than a big-ole-string (which is what your serialized array is).
If that's not possible (or you're just lazy), you can use the fact that the serialized array is just a big-ole-string, and figure out a LIKE clause to find matching records. The way PHP serializes data is pretty easy to figure out (hint: those numbers indicate lengths of things).
Now, if your serialized array is fairly complex, this will break down fast. But if it's a flat array, you should be able to do it.
Of course, you'll be using LIKE '%...%', so you'll get no help from any indicies, and performance will be very poor.
Which is why folks are suggesting you store that data in some normalized fashion, if you need to query "inside" it.
If you have control of the data model, stuffing serialized data in the database will bite you in the long run just about always. However, oftentimes one does not have control over the data model, for example when working with certain open source content management systems. Drupal sticks a lot of serialized data in dumpster columns in lieu of a proper model. For example, ubercart has a 'data' column for all of its orders. Contributed modules need to attach data to the main order entity, so out of convenience they tack it onto the serialized blob. As a third party to this, I still need a way to get at some of the data stuffed in there to answer some questions.
a:4:{s:7:"cc_data";s:112:"6"CrIPY2IsMS1?blpMkwRj[XwCosb]gl<Dw_L(,Tq[xE)~(!$C"9Wn]bKYlAnS{[Kv[&Cq$xN-Jkr1qq<z](td]ve+{Xi!G0x:.O-"=yy*2KP0#z";s:7:"cc_txns";a:1:{s:10:"references";a:1:{i:0;a:2:{s:4:"card";s:4:"3092";s:7:"created";i:1296325512;}}}s:13:"recurring_fee";b:1;s:12:"old_order_id";s:2:"25";}
see that 'old_order_id'? thats the key I need to find out where this recurring order came from, but since not everybody uses the recurring orders module, there isnt a proper place to store it in the database, so the module developer opted to stuff it in that dumpster table.
My solution is to use a few targeted SUBSTRING_INDEX's to chisel off insignificant data until I've sculpted the resultant string into the data gemstone of my desires.
Then I tack on a HAVING clause to find all that match, like so:
SELECT uo.*,
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX( uo.data, 'old_order_id' , -1 ),
'";}', 1),
'"',-1)
AS `old order id`
FROM `uc_orders AS `uo`
HAVING `old order id` = 25
The innermost SUBSTRING_INDEX gives me everything past the old_order_id, and the outer two clean up the remainder.
This complicated hackery is not something you want in code that runs more than once, more of a tool to get the data out of a table without having to resort to writing a php script.
Note that this could be simplified to merely
SELECT uo.*,
SUBSTRING_INDEX(
SUBSTRING_INDEX( uo.data, '";}' , 1 ),
'"',-1)
AS `old order id`
FROM `uc_orders` AS `uo`
HAVING `old order id` = 25
but that would only work in this specific case (the value I want is at the end of the data blob)
So you mean to use MySQL to search in a PHP array that has been serialized with the serialize command and stored in a database field? My first reaction would be: OMG. My second reaction would be: why? The sensible thing to do is either:
Retrieve the array into PHP, unserialize it and search in it
Forget about storing the data in MySQL as serialized and store it as a regular table and index it for fast search
I would choose the second option, but I don't know your context.
Of course, if you'd really want to, you could try something with SUBSTRING or another MySQL function and try to manipulate the field, but I don't see why you'd want to. It's cumbersome, and it would be an unnecessary ugly hack. On the other hand, it's a puzzle, and people here tend to like puzzles, so if you really want to then post the contents of your field and we can give it a shot.
You can do it like this:
SELECT * FROM table_name WHERE some_field REGEXP '.*"item_key";s:[0-9]+:"item_value".*'
But anyway you should consider storing that data in a separate table.
How about you serialize the value you're searching for?
$sql = sprintf("select * from tbl WHERE serialized_col like '%%%s%%'", serialize($n));
or
$sql = sprintf("select * from tbl WHERE serialized_col like '%s%s%s'", '%', serialize($n), '%');
Working with php serialized data is obviously quite ugly, but I've got this one liner mix of MySQL functions that help to sort that out:
select REPLACE(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(searchColumn, 'fieldNameToExtract', -1), ';', 2), ':', -1), '"', '') AS extractedFieldName
from tableName as t
having extractedFieldName = 'expressionFilter';
Hope this can help!
Well, i had the same issue, and apparently it's a piece of cake, but maybe it needs more tests.
Simply use the IN statement, but put the field itself as array!
Example:
SELECT id, title, page FROM pages WHERE 2 IN (child_of)
~ where '2' is the value i'm looking for inside the field 'child_of' that is a serialized array.
This serialized array was necessary because I cannot duplicate the records just for storing what id they were children of.
Cheers
If I have attribute_dump field in log table and the value in one of its row has
a:69:{s:9:"status_id";s:1:"2";s:2:"id";s:5:"10215"}
If I want to fetch all rows having status_id is equal to 2, then the query would be
SELECT * FROM log WHERE attribute_dump REGEXP '.*"status_id";s:[0-9]+:"2".*'
There is a good REGEX answer above, but it assumes a key and value implementation. If you just have values in your serialized array, this worked for me:
value only
SELECT * FROM table WHERE your_field_here REGEXP '.*;s:[0-9]+:"your_value_here".*'
key and value
SELECT * FROM table WHERE your_field_here REGEXP '.*"array_key_here";s:[0-9]+:"your_value_here".*'
For easy method use :
column_field_name LIKE %VALUE_TO_BE_SEARCHED_FOR%
in MySQL query
You may be looking for an SQL IN statement.
http://www.w3schools.com/sql/sql_in.asp
You'll have to break your array out a bit first, though. You can't just hand an array off to MySQL and expect it will know what to do with it. For that, you may try serializing it out with PHP's explode.
http://php.net/manual/en/function.explode.php
Select * from table where table_field like '%"enter_your_value"%'
select * from postmeta where meta_key = 'your_key' and meta_value REGEXP ('6')
foreach( $result as $value ) {
$hour = unserialize( $value->meta_value );
if( $hour['date'] < $data['from'] ) {
$sum = $sum + $hour['hours'];
}
}

Categories