I'm using this function to clean strings for elastic search:
function cleanString($string){
$string = mb_convert_encoding($string, "UTF-8");
$string = str_ireplace(array('<', '>'), array(' <', '> '), $string);
$string = strip_tags($string);
$string = filter_var($string, FILTER_SANITIZE_STRING);
$string = str_ireplace(array("\t", "\n", "\r", " "," ",":"), ' ', $string);
$string = str_ireplace(array("","«","»","£"), '', $string);
return trim($string, ",;.:-_*+~#'\"´`!§$%&/()=?«»")
}
It does all sorts of stuff, but the problem I am facing has to do with the trim function at the very end. It is supposed to trim away whitespaces and special characters, and worked fine until recently, when I added two more special character to trim away from string: « and ». This caused problems with another special character:
When I pass the word België into the function, the ë gets corrupted and elastic throws an error.
Why does trim corrupt a completely different character?
How can I fix
that, so that I parse out « and » and preserve ë?
trim is not encoding aware and just looks at individual bytes. If you tell it to trim '«»', and that's encoded in UTF-8, it will look for the bytes C2 AB C2 BB (where C2 is redundant, so AB BB C2 are the actual search terms). "ë" in UTF-8 is C3 AB, so half of it gets removed and the character is thereby broken.
You'll need to use an encoding aware functions to safely remove multibyte characters, e.g.:
preg_replace('/^[«»]+|[«»]+$/u', '', $str)
Related
I'm trying to remove non printable characters in a string, except some characters that I need.
$arr = ['Ù', 'é', '€'];
$string = "é & Ù # ♣ ☂ % & € À";
$acceptedChars = implode('\\', $arr);
$string = preg_replace('/[^[:print:] ' . $acceptedChars . ']/', '', $string);
echo 'Test : ' . $string;
My issue is that instead of replacing the unwanted characters by an empty string as set in the second parameter, I get this instead :
To remove all chars other than printable ASCII chars and $acceptedChars you
can use
$string = preg_replace('/[^ -~' . $acceptedChars . ']/u', '', $string);
See the PHP demo.
The -~ pattern is a known pattern to match any printable ASCII chars.
The u modifier is necessary to make the regex work with Unicode strings.
$string = #iconv("UTF-8", "UTF-8", $string);
I'm using this code to replace Unicode characters in my string, but actually what this does is remove all characters after the first Unicode sign in the string. Is there any other function to helps me to do this?
I suggest doing this with preg_replace like this:
preg_replace('/[\x00-\x1F\x7F]/u', '', $string);
or even better:
preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);
If the above does not work for your case, this might:
preg_replace( '/[^[:cntrl:]]/', '',$string);
There is also the option to filter what you need instead of removing what you do not. Something like this should work:
filter_var($string, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW|FILTER_FLAG_STRIP_HIGH);
wondering how I can replace all special chars on my string like: hello this is a test!
I've wrote this code:
$text = preg_replace("/[^A-Za-z0-9]/", ' ', $text);
This works need more flexibility to allow special chars like áéíóú... and remove only certain chars like: :!"#$%&/()=?¿¡...
Any ideas?
Use $text = preg_replace("/[^\p{L}\p{N}]/u", ' ', $text);
This will match all characters that are not letters or numbers and will treat Unicode letters appropriately.
OK I have read many threads and have found some options that work but now I am just more curious than anything...
When trying to remove characters like: Â é as google does not like them in the XML product feed.
Why does this work:
But neither of these 2 do?
$string = preg_replace("/[^[:print:]]+/", ' ', $string);
$string = preg_replace("/[^[:print:]]/", ' ', $string);
To put it all in context here is the full function:
// Remove all unprintable characters
$string = ereg_replace("[^[:print:]]", ' ', $string);
// Convert back into HTML entities after printable characters removed
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
// Decode back
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
// Return the UTF-8 encoded string
$string = strip_tags(stripslashes($string));
// Return the UTF-8 encoded string
return utf8_encode($string);
}
The reason that code doesn't work is because it removes characters that are not in the posix :print: character group which is comprised of printable characters. á É, etc are all printable.
You can find more about posix sets here.
Also, removing accentuated characters might not always be the best option... Check out this question for alternatives.
I know this question has been asked several times for sure, but I have my problems with regular expressions... So here is the (simple) thing I want to do in PHP:
I want to make a function which replaces unwanted characters of strings. Accepted characters should be:
a-z A-Z 0-9 _ - + ( ) { } # äöü ÄÖÜ space
I want all other characters to change to a "_". Here is some sample code, but I don't know what to fill in for the ?????:
<?php
// sample strings
$string1 = 'abd92 s_öse';
$string2 = 'ab! sd$ls_o';
// Replace unwanted chars in string by _
$string1 = preg_replace(?????, '_', $string1);
$string2 = preg_replace(?????, '_', $string2);
?>
Output should be:
$string1: abd92 s_öse (the same)
$string2: ab_ sd_ls_o
I was able to make it work for a-z, 0-9 but it would be nice to allow those additional characters, especially äöü. Thanks for your input!
To allow only the exact characters you described:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ -]/", "_", $str);
To allow all whitespace, not just the (space) character:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ\s-]/", "_", $str);
To allow letters from different alphabets -- not just the specific ones you mentioned, but also things like Russian and Greek, or other types of accent marks:
$str = preg_replace("/[^\w+(){}#\s-]/", "_", $str);
If I were you, I'd go with the last one. Not only is it shorter and easier to read, but it's less restrictive, and there's no particular advantage to blocking stuff like и if äöüÄÖÜ are all fine.
Replace [^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ] with _.
$string1 = preg_replace('/[^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ]/', '_', $string1);
This replaces any characters except the ones after ^ in the [character set]
Edit: escaped the - dash.