Algorithm to sanitize MySQL data

Algorithm to sanitize MySQL data - php

Let's say I have a table of 100,000 MySQL records in a table with 2 columns: title and description.
There's also a table containing all the bad words that need to be sanitized.
For e.g. let's say the title column contains the string "Forget this" and the profanity table says that the "Forget" string should be replaced with "F*****".
Currently I implemented it with a brute force method, but this is way too slow. It checks every single substring from the sentence and compares it with every single string that exists in the profanity filter.
public function sanitizeSiteProfanity($word, $replacement)
{
$query = $this->_ci->db->select('title, description')->get('top_sites')->result_array();
$n = $query->num_rows();
for($i = 0; $i < $n; $i++)
{
str_replace($word, $replacement, $query[$i]['title']);
str_replace($word, $replacement, $query[$i]['description']);
}
}
Is there a faster method to sanitize all the substrings?

I don't know if there is a fast way to sanitize the data. It seems that you have to loop through all the words for the replacement, because one title could have multiple offensive words.
If you are looking for complete words, a full text index and contains should speed things up. Essentially, you would set up a loop for each of the words and then run:
update table
set title = replace(title, 'F***')
where match (title) against ('Fuck' in boolean mode);
You would need to put this in a stored procedure loop. But, the match() would be quite fast and this would probably significantly speed up the current process.

The best way to optimize this is to delegate the replacement step to the database and let mysql do the heavy lifting. You'll need to use the REPLACE mysql built-in. The (not-so-big) drawback is that you'll need to use explicit sql instead of the code igniter expression builder.

Related

Search in MySQL with permutations

I need help.
I have a table where only two columns are: ID and NAME and these data:
ID | NAME
1 HOME
2 GAME
3 LINK
And I want show e.g. row with name: HOME if user search: HOME or OMEH or EMOH or HMEO, etc... - all permutations from word HOME.
I can't save to mysql all these permutations and search in this columns, because some words will be a too big (9-10 chars) and more than 40 MB for each 9 chars words.

One way to solve this problem is to store the sorted set of characters in each name in your database as an additional column and then sort the string the user inputs before searching e.g. database has
ID NAME CHARS
1 HOME EHMO
2 GAME AEGM
3 LINK IKLN
Then when searching in PHP you would do this:
$search = 'MEHO'; // user input = MEHO
$chars = str_split($search);
sort($chars);
$search = implode('', $chars); // now contains EHMO
$sql = "SELECT ID, NAME FROM table1 WHERE CHARS = '$search'";
// perform query etc.
Output
ID NAME
1 HOME

This sounds like a "please do my homework for me" question. It is hard to conceive what real world problem this is applicable to and there is no standard solution. It is OK to ask for help with your homework here, but you should state that this is the case.
more than 40 MB for each 9 chars words
Your maths is a bit wonky, but indeed the storage does not scale well. OTOH leaving aside the amount of storage, in terms of the processing workload it does scale well as a solution.
You could simply brute-force a dynamic query:
function mkqry($word)
{
$qry="SELECT * FROM yourtable WHERE 1 ";
$last=strlen($word);
for ($x=0; $x<$last; $x==) {
$qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
}
return $qry;
}
However this will always result in a full table scan (slow) and won't correctly handle cases where a letter occurs twice in a word.
The solution is to use an indexing function which is independent of the order in which the characters appear - a non-cryptographic hash. An obvious candidate would be to XOR the characters together, although this only results in a one character identifier which is not very selective. So I would suggest simply adding the character codes:
function pos_ind_hash($word)
{
$sum=0;
for ($x=0; $x<$last; $x==) {
$sum+=ord(substr($word, $x));
}
return $sum;
}
function mkqry($word)
{
$qry="SELECT * FROM yourtable WHERE 1 ";
$last=strlen($word);
for ($x=0; $x<$last; $x==) {
$qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
}
$qry.=" AND yourtable.hash=" . pos_ind_hash($word);
return $qry;
}
Note that the hash mechanism here does not uniquely identify a single word, but is specific enough to reduce the volume to the point where an index (on the hash) would be effective.
Multiplying rather than adding would create fewer collisions but at a greater risk of overflowing (which would create ambiguity between implementations).
But both the hash and the single character LIKE only reduce the number of potential matches. To get the query to behave definitively, you need to go further. You could add an attribute to the table (and to the index with the hash)containing the string length - this would be more selective (i.e. improve effectiveness of the index) but still not definitive.
For a definitive method you would need to specify in your query that the data does NOT contain characters which are NOT in the word you are looking for.
The wrong way to do that would be to add a loop specifying "AND NOT LIKE....".
A valid way of doing that would be to add a test in the query which replaces all the letters in the table attribute which appear in the word you are searching for which results in a zero length string.

Optimization of search function MySQL or PHP wise

After running a few test I realized that my search method does not perform very well if some words of the query is short (2~3 letters).
The way I made the search is by making a MySQL query for every words in the string the visitor entered and then filtering result from each word to see if every words had that result. Once one result has been returned for all words its a match and il show that result to the visitor.
But I was wondering if that's an effective way to do it. Is there any better way while keeping the same functionality ?
Currently the code I have takes about .7Sec making MySQL queries. And the rest of the stuff is under .1Sec.
Normally I would not care much about my search taking .7Sec, But Id like to create a "LiveSearch" and is critical that it loads faster than that.
Here is my code
public static function Search($Query){
$Querys = explode(' ',$Query);
foreach($Querys as $Query)
{
$MatchingRow = \Database\dbCon::$dbCon -> prepare("
SELECT
`Id`
FROM
`product_products` as pp
WHERE
CONCAT(
`Id`,
' ',
(SELECT `Name` FROM `product_brands` WHERE `Id` = pp.BrandId),
' ',
`ModelNumber`,
' ',
`Title`,
IF(`isFreeShipping` = 1 OR `isFreeShippingOnOrder` = 1, ' FreeShipping', '')
)
LIKE :Title;
");
$MatchingRow -> bindValue(':Title','%'.$Query.'%');
try{
$MatchingRow -> execute();
foreach($MatchingRow -> fetchAll(\PDO::FETCH_ASSOC) as $QueryInfo)
$Matchings[$Query][$QueryInfo['Id']] = $QueryInfo['Id'];
}catch(\PDOException $e){
echo 'Error MySQL: '.$e->getMessage();
}
}
$TmpMatch = $Matchings;
$Matches = array_shift(array_values($TmpMatch));
foreach($TmpMatch as $Query)
{
$Matches = array_intersect($Matches,$Query);
}
foreach($Matches as $Match){
$Products[] = new Product($Match);
}
return $Products;
}

As others have already suggested, fulltext search is your friend.
The logic should go more or less like this.
Add a text column to your "product_products" table called, say, "FtSearch"
Write a small script that will run only once, in which you write, for each existing product, the text that has to be searched for into the "FtSearch" column. This is, of course, the text you compose in your query (id + brand name + title and so forth, including the FreeShipping part). You can probably do this with a single SQL statement, but I have no mysql at hand and can't provide you the code for that... you might be able to find it out by yourself.
Create a fulltext index on the "FtSearch" column (doing this after having populated the FtSearch column saves you a little execution time).
Now you have to add the code necessary to ensure that every time any of the fields involved in the search string is inserted/updated, you insert/update the search string as well. Pay attention here, since this includes not only the "Title", "ModelNumber" and "FreeShipping" of the "product_product", but as well the "Name" of the "product_brand". This means that if the name of a product_brand is updated, you will have to regenerate all search strings of all products having that brand. This might be a little slow, depending on how many products are bound to that brand. However I assume it does not happen too often that a brand changes its name, and if it does, it certainly happens in some sort of administration interface where such things are usually acceptable.
At this point you can query the table using the MATCH construct, which is way way faster you could ever get by using your current approach.

Is it worth to save keyword <-> link relation into "hastable" like structure in mysql?

im working on PHP + MySQL application, which will crawl HDD/shared drive and index all files and directories into database, to provide "fulltext" search on it. So far im doing well, but im stuck on question, if i chosed good way how to store data into database.
On picture below, you can see part schema of my database. Thought is, that i'm saving domain (which represents part of disk which i wana to index) then there are some link(s) (which represents files and folder (with content, filepath, etc) then i have table to store sole (uniq) keywords, which i find in file/folder name or content.
And finaly, i have 16 tables linkkeyword to store relations between links and keywords. I have 16 of them because i thought it might be good to make something like hashtable, because im expecting high number of relations between link <-> keyword. (so far for 15k links and 400k keywords i have about 2.5milion of linkkeyword records). So to avoid storing so much data into one table (and later search above them) i thought that this hastable can be faster. It works like i wana to search for word, i compute it md5 and look at first character of md5 and then i know to which linkkeyword table i should use. So there is only about 150~200k records in each linkkeyword table (against 2.5milions)
So there im curious, if this approach can be of any use, or if will be better to store all linkkeyword information to single table and mysql will take care of it (and to how much link<->keyword it can work?)
So far this was great solution to me, but i crushed hard when i tried to implement regular-expression search. So user can use e.g. "tem*" which can result in temp, temporary, temple etc... In normal way when searching for word, i will conpute in md5 hash and then i know to which linkkeyword table i need to look. But for regular expression i need to get all keywords from keywords table (which matches regular expression) and then process them one by one.
Im also attaching part of code for normal keyword search
private function searchKeywords($selectedDomains) {
$searchValues = $this->searchValue;
$this->resultData = array();
foreach (explode(" ", $searchValues) as $keywordName) {
$keywordName = strtolower($keywordName);
$keywordMd5 = md5($keywordName);
$selection = $this->database->table('link');
$results = $selection->where('domain.id', $selectedDomains)->where('domain.searchable = ?', '1')->where(':linkkeyword' . $keywordMd5[0] . '.keyword.keyword LIKE ?', $keywordName)
->select('link.*,:linkkeyword' . $keywordMd5[0] . '.weight,:linkkeyword' . $keywordMd5[0] . '.keyword.keyword');
foreach ($results as $result) {
$keyExists = array_key_exists($result->linkId, $this->resultData);
if ($keyExists) {
$this->resultData[$result->linkId]->updateWeight($result->weight);
$this->resultData[$result->linkId]->addKeyword($result->keyword);
} else {
$domain = $result->ref('domain');
$linkClass = new search\linkClass($result, $domain);
$linkClass->updateWeight($result->weight);
$linkClass->addKeyword($result->keyword);
$this->resultData[$result->linkId] = $linkClass;
}
}
}
}
and regular expression search function
private function searchRegexp($selectedDomains) {
//get stored search value
$searchValues = $this->searchValue;
//replace astering and exclamation mark (counted as characters for regular expression) and replace them by their mysql equivalent
$searchValues = str_replace("*", "%", $searchValues);
$searchValues = str_replace("!", "_", $searchValues);
// empty result array to prevent previous results to interfere
$this->resultData = array();
//searched phrase can be multiple keywords, so split it by space and get results for each keyword
foreach (explode(" ", $searchValues) as $keywordName) {
//set default link result weight to -1 (default value)
$weight = -1;
//select all keywords, which match searched keyword (or its regular expression)
$keywords = $this->database->table('keyword')->where('keyword LIKE ?', $keywordName);
foreach ($keywords as $keyword) {
//count keyword md5 sum to determine which table should be use to match it links
$md5 = md5($keyword->keyword);
//get all link ids from linkkeyword relation table
$keywordJoinLink = $keyword->related('linkkeyword' . $md5[0])->where('link.domain.searchable','1');
//loop found links
foreach ($keywordJoinLink as $link) {
//store link weight, for later result sort
$weight = $link->weight;
//get link ID
$linkId = $link->linkId;
//check if link already exists in results, to prevent duplicity
$keyExists = array_key_exists($linkId, $this->resultData);
//if link already exists in result set, just update its weight and insert matching keyword for later keyword tag specification
if ($keyExists) {
$this->resultData[$linkId]->updateWeight($weight);
$this->resultData[$linkId]->addKeyword($keyword->keyword);
//if link isnt in result yet, insert it
} else {
//get link reference
$linkData = $link->ref('link', 'linkId');
//get information about domain, to which link belongs (location, flagPath,...)
$domainData = $linkData->ref('domain', 'domainId');
//if is domain searchable and was selected before search, add link to result set. Otherwise ignore it
if ($domainData->searchable == 1 && in_array($domainData->id, $selectedDomains)) {
//create new link instance
$linkClass = new search\linkClass($linkData, $domainData);
//insert matching keyword to links keyword set
$linkClass->addKeyword($keyword->keyword);
//set links weight
$linkClass->updateWeight($weight);
//insert link into result set
$this->resultData[$linkId] = $linkClass;
}
}
}
}
}
}

Your question is mostly one of opinion, so you may want to include the criteria that allow us to answer "worth it' more objectively.
It appears you've re-invented the concept of database sharding (though without distributing your data across multiple servers).
I assume you are trying to optimize search time; if that's the case, I'd suggest that 2.5 million records on a modern hardware is not a particularly big performance challenge, as long as your queries can use an index. If you can't use an index (e.g. because you're doing a regular expression search), sharding will probably not help at all.
My general recommendation with database performance tuning is to start with the simplest possible relational solution, keep tuning that until it breaks your performance goals, then add more hardware, and only once you've done that should you go for "exotic" solutions like sharding.
This doesn't mean using prayer as a strategy. For performance-critical application, I typically build a test database, where I can experiment with solutions. In your case, I'd build a database with your schema without the "sharding" tables, and then populate it with test data (either write your own population routines, or use a tool like DBMonster). Typically, I'd go for at least double the size I expect in production. You can then run and tune queries to prove, one way or another, whether your schema is good enough. It sounds like a lot of work, but it's much less work than your sharding solution is likely to bring along.
There are (as #danFromGermany comments) solutions that are optimized for text serach, and you could use MySQL fulltext search features rather than regular expressions.

How to search partial/masked strings?

I am storing social security numbers in the database, but instead of storing whole numbers, I only store only 5 digits sequence. So, if SSN# is 123-12-1234, my database would store it #23121### or ####21234 or anything else, as long as it has a 5 digits in the row.
Therefore, when user enters whole SSN, I want the database to locate all matches.
So, I can do this :
SELECT * FROM user WHERE ssn like 123121234
But the query above would not work, since I have some masked characters in the SSN field (#23121###). Is there a good way of doing this?
Maybe a good way would be to use
SELECT * FROM user WHERE REPLACE (ssn, '#', '') like 123121234
Although there could be an issue - the query might return non-relevant matches since 5 numbers that I store in the DB could be anywhere in a sequence.
Any idea how to do a better search?

If the numbers are always in a sequential block, you can generate a very efficient query by just generating the 5 variations of the ssn that could be stored in the DB and search for all of them with an exact match. This query can also use indexes to speed things up.
SELECT *
FROM user
WHERE ssn IN ('12312####',
'#23121###',
'##31212##',
'###12123#',
'####21234');

I think you can do something like this:
Extract all possible 5-char combinations out of the queried SSN.
Make an IN() query on those numbers. I'm not sure though how many results you would get from this.
$n = 123121234;
$sequences = array();
for($i = 0; $i + 5 <= strlen($n); $i++) {
$sequences[] = substr($n, $i, 5);
}
var_dump($sequences);
Tell me if you need those hash sign surrounding the strings.

how to implement the an effective search algorithm when using php and a mysql database?

I'm new to web design, especially backend design so I have a few questions about implementing a search function in PHP. I already set up a MySQL connection but I don't know how to access specific rows in the MySQL table. Also is the similar text function implemented correctly considering I want to return results that are nearly the same as the search term? Right now, I can only return results that are the exact same or it gives "no result." For example, if I search "tex" it would return results containing "text"? I realize that there are a lot of mistakes in my coding and logic, so please help if possible. Event is the name of the row I am trying to access.
$input = $_POST["searchevent"];
while ($events = mysql_fetch_row($Event)) {
$eventname = $events[1];
$eventid = $events[0];
$diff = similar_text($input, $event, $hold)
if ($hold == '100') {
echo $eventname;
break;
else
echo "no result";
}
Thank you.
I've noticed some of the comments mentioned more efficient ways of performing the search than with the "similar text" function, if I were to use the LIKE function, how would it be implemented?

A couple of different ways of doing this:
The faster one (performance wise) is:
select * FROM Table where keyword LIKE '%value%'
The trick in this one is the placement of the % which is a wildcard, saying either search everything that ends or begins with this value.
A more flexible but (slightly) slower one could be the REGEXP function:
Select * FROM Table WHERE keyword REGEXP 'value'
This is using the power of regular expressions, so you could get as elaborate as you wanted with it. However, leaving as above gives you a "poor man's Google" of sorts, allowing the search to be bits and pieces of overall fields.
The sticky part comes in if you're trying to search names. For example, either would find the name "smith" if you searched SMI. However, neither would find "Jon Smith" if there was a first and last name field separated. So, you'd have to do some concatenation for the search to find either Jon OR Smith OR Jon Smith OR Smith, Jon. It can really snowball from there.
Of course, if you're doing some sort of advanced search, you'll have to condition your query accordingly. So, for instance, if you wanted to search first, last, address, then your query would have to test for each:
SELECT * FROM table WHERE first LIKE '%value%' OR last LIKE '%value%' OR address LIKE '%value'

Look at below example :
$word2compare = "stupid";
$words = array(
'stupid',
'stu and pid',
'hello',
'foobar',
'stpid',
'upid',
'stuuupid',
'sstuuupiiid',
);
while(list($id, $str) = each($words)){
similar_text($str, $word2compare, $percent);
if($percent > 90) // Change percentage value to 80,70,60 and see changes
print "Comparing '$word2compare' with '$str': ";
}
You can check with $percent parameter for how strong match you want to apply.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.