CI & MySQL; using LIKE on a delimited string - php

Maybe LIKE isn't the proper solution for what I'm attempting to do; but what I'm looking to achieve is to take a string retrieved from the database. (i.e. tags [ tag1,tag2,tag3,tag4 ]) - then compare each value (delimited by the comma) and return like rows accordingly.
<?
$string = 'tag1,tag2,tag3,tag4';
$tags = explode(',', $string);
foreach ( $tags as $tag )
{
$this->db->limit(2); $this->db->like('Tags', $tag);
$result = $this->db->get('table');
$recommend[] = $result;
}
This is the best I could understand how to do what I'm looking for, but I'm quite confused on the LIKE method in general of CI's Active Record Class.
Is performing searches on delimited string in a database even really entirely wise?
Thanks for the input, and ideas you two. MySQL goes a bit over my head, but after reading a bit about the things possible with relational databases; I'm going to attempt to learn this process so I can build my database more efficient.
My database is going to be a catalog of video games and files, so obviously that means it's going to reach insane amounts of entries - so attempting to look through any sort of string for possible returned rows would be by far the dumbest thing I could do in the long run.
Again, thanks for the information and ideas.

If you have a properly normalized database, tags should be split up to a new table as this is a one-to-many (or maybe even a many-to-many) relationship.
However, the easiest way to achieve this is to also place a comma before the first tag, and after the last, and simply search for:
SELECT * FROM table WHERE Tags LIKE "%,coleslaw,%";
But yes, this is bad, and can be slow if your table grows as you will not be able to utilize indexes.

There is a FIELD command in MySQL that you might be able to take advantage of. You can read up on it here.
But in general, performing searches on delimited strings, or substrings in general, in a database will kill performance as you will lose any indexing when matching on the substring. If you have a thousand rows, MySQL will have to scan through all one thousand rows before discovering that there is no match. This applies to both Like and Substring appearing in a where clause. Many database implementations specifically provide a BEGINS WITH operator that can still take advantage of indexing, but I'm not aware of this in MySQL.
Consider instead breaking the string up during insertion, and storing each word seperately in a different table. If you find that you do this a lot, consider an engine built for text searching, such as Lucene. You can use PHP with Lucene through SOLR. While I throw that out there, it may very well be complete overkill for your project.
I know this probably isn't what you want to hear, but your gut is correct on this one.

Related

php sql search in_array in sql [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

How to use implode() in sql operator? [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

MongoDB search - autocomplete

We are currently running our app on MySQL and are planning to move to MongoDB. We have moved some parts already but having issues with MongoRegex performance.
We have an autocomplete search box that joins 6 tables (indexed / non-indexed fields) and returns results super fast on mysql. The same thing on MongoDB performs really slow. It takes about 2.3 seconds only on one collection. The user has to wait for a long time. The connection time is 0.064 secs. Query time 2.36 seconds. I did a bit of Googling and couldn't find a perfect answer. Everyone said MongoRegex is slow. If that's true how are other companies overcoming this problem?
What is best way to improve autocomplete performance / experience when running in on MongoDB?
First of all, you will have to design your query carefully. Carefully as in, selecting properly indexed fields and designing accordingly. Also if you are using regex make sure you are writing the regex in a way which forces the query to use an indexed field. Something like /^prefix/ will do. [ See this link : http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use ]
I have seen many implementations using range query of mongodb, but Im not sure if thats the best one, since instantaneous results are a key thing.
Apart from that, I have seen some one who recommended prefix-trees. Which effectively stores the prefixes in a field, and then stores all the words starting with that particular prefix in the next fields as an array. This solution sounds convincing and fast, since the prefix fields are supposed to be indexed, but you will have to think about the storage factor also.
It is hard to tell as we don't have the search query, but worth mentioning that, after taking a look at the documentation, it appears that MongoDB is able to use an index when using RE search if the pattern is prefixed by a constant string only if it is anchored:
http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use
So, to quote the doc:
A regular expression is a “prefix expression” if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
But /abc.*/ ou /^.*abc/ will not use the index.

MYSQL search database for similar results

Essentially what I want to do is search a number of MYSQL databases and return results where a certain field is more than 50% similar to another record in the databases.
What am I trying to achieve?
I have a number of writers who add content to a network of websites that I own, I need a tool that will tell me if any of the pages they have written are too similar to any of the pages currently published on the network. This could run on post/update or as a cron... either way would work for me.
I've tried making something with php, drawing the records from the database and using the function similar_text(), which gives a % difference between two strings - this however is not a workable solution as you have to compare every entry against every other entry & I worked out with microtime that it would take around 80 hours to completely search all of the entries!
Wondering if it's even possible!?
Thanks!
You are probably looking for is SOUNDEX. It is the only sound based search in mysql. If you have A LOT of data to compare, you're probably going to need to pregenerate the soundex and compare the soundex columns or use it live like this:
SELECT * FROM data AS t1 LEFT JOIN data AS t2 ON SOUNDEX(t1.fieldtoanalyse) = SOUNDEX(t2.fieldtoanalyse)
Note that you can also use the
t1.fieldtoanalyze SOUNDS LIKE t2.fieldtoanalyze
syntax.
Finaly, you can save the SOUNDEX to a column, just create a column and:
UPDATE data SET fieldsoundex = SOUNDEX(fieldtoanalyze)
and then compare live with pregenerated values
More on Soundex
Soundex is a function that analyzes the composition of a word but in a very crude way. It is very useful for comparisons of "Color" vs "Colour" and "Armor" vs "Armour" but can also sometimes dish out weird results with long words because the SOUNDEX of a word is a letter + a 3 number code. There is just so much you can do sadly with these combinations.
Note that there is no levenstein or metaphone implementation in mysql... not yet, but probably, levenstein would have been the best for your case.
Anything is possible.
Without knowing your criteria for similar, it's difficult to offer a specific solution. However, my suggestion would be pre-build a similarity table, utilize a function such as similar_text(). Use this as your index table when searching by term.
You'll take an initial hit to build such an index. However, you can manage it easier as new records are added.
Thanks for your answers guys, for anyone looking for a solution to a problem similar to this I used the SOUNDEX function to pull out entries that had a similar title then compared them with the similar_text() function. Not quite a complete database comparison, but near as I could get it!

What method should be used for searching strings in mysql data?

I've got a mysql dataset that contains 86 million rows.
I need to have a relatively fast search through this data.
The data I'll be searching through is all strings.
I also need to do partial matches.
Now, if I have 'foobar' and search for '%oob%' I know it'll be really slow - it has to look at every row to see if there is a match.
What methods can be used to speed up queries like this?
I don't think fulltext search will allow for partial matches (I could be wrong on this).
You might take a look at Sphinx Search which is a full-text and partial match search engine system. You can easily import your mysql data into it, and then use simple PHP queries to search the data. It is far more efficient than using MySQL to do the query.
What you seek is Full Text Search. This would enable you to do partial matches quickly and searches on multiple columns quickly.

Categories