OK I have read many threads and have found some options that work but now I am just more curious than anything...
When trying to remove characters like: Â é as google does not like them in the XML product feed.
Why does this work:
But neither of these 2 do?
$string = preg_replace("/[^[:print:]]+/", ' ', $string);
$string = preg_replace("/[^[:print:]]/", ' ', $string);
To put it all in context here is the full function:
// Remove all unprintable characters
$string = ereg_replace("[^[:print:]]", ' ', $string);
// Convert back into HTML entities after printable characters removed
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
// Decode back
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
// Return the UTF-8 encoded string
$string = strip_tags(stripslashes($string));
// Return the UTF-8 encoded string
return utf8_encode($string);
}
The reason that code doesn't work is because it removes characters that are not in the posix :print: character group which is comprised of printable characters. á É, etc are all printable.
You can find more about posix sets here.
Also, removing accentuated characters might not always be the best option... Check out this question for alternatives.
Related
I'm using this function to clean strings for elastic search:
function cleanString($string){
$string = mb_convert_encoding($string, "UTF-8");
$string = str_ireplace(array('<', '>'), array(' <', '> '), $string);
$string = strip_tags($string);
$string = filter_var($string, FILTER_SANITIZE_STRING);
$string = str_ireplace(array("\t", "\n", "\r", " "," ",":"), ' ', $string);
$string = str_ireplace(array("","«","»","£"), '', $string);
return trim($string, ",;.:-_*+~#'\"´`!§$%&/()=?«»")
}
It does all sorts of stuff, but the problem I am facing has to do with the trim function at the very end. It is supposed to trim away whitespaces and special characters, and worked fine until recently, when I added two more special character to trim away from string: « and ». This caused problems with another special character:
When I pass the word België into the function, the ë gets corrupted and elastic throws an error.
Why does trim corrupt a completely different character?
How can I fix
that, so that I parse out « and » and preserve ë?
trim is not encoding aware and just looks at individual bytes. If you tell it to trim '«»', and that's encoded in UTF-8, it will look for the bytes C2 AB C2 BB (where C2 is redundant, so AB BB C2 are the actual search terms). "ë" in UTF-8 is C3 AB, so half of it gets removed and the character is thereby broken.
You'll need to use an encoding aware functions to safely remove multibyte characters, e.g.:
preg_replace('/^[«»]+|[«»]+$/u', '', $str)
I have been trying to remove junk character from a stream of html strings using PHP but haven't been successfull yet. Is there any special syntax or logics to remove special character from the string?
I had tried this so far, but ain't working
$new_string = preg_replace("�", "", $HtmlText);
echo '<pre>'.$new_string.'</pre>';
\p{S}
You can use this.\p{S} matches math symbols, currency signs, dingbats, box-drawing characters, etc
See demo.
https://www.regex101.com/r/rK5lU1/30
$re = "/\\p{S}/i";
$str = "asdas�sadsad";
$subst = "";
$result = preg_replace($re, $subst, $str);
This is due to mismatch in Charset between database and front-end. Correcting this will fix the issue.
function clean($string) {
return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
}
I've been reading up on a few solutions but have not managed to get anything to work as yet.
I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.
I'd like to use PHP to convert these into either £ or £.
I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:
$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');
The output is £.
Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?
UPDATE
It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:
That\u00e2\u0080\u0099s (right single quotation)
\u00c2\u00a (pound symbol)
It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.
What you should have is \u00a3 which is the unicode code point for £.
{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.
If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.
function fixBadUnicode($str) {
return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}
Example here: http://phpfiddle.org/main/code/6sq-rkn
Edit:
If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:
function fixBadUnicodeForJson($str) {
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
return $str;
}
Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.
Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.
The output is correct.
\u00c2 == Â
\u00a3 == £
So nothing is wrong here. And converting to HTML entities is easy:
htmlentities($title);
Here is an updated version of the function using preg_replace_callback instead of preg_replace.
function fixBadUnicodeForJson($str) {
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")); },
$str
);
return $str;
}
wondering how I can replace all special chars on my string like: hello this is a test!
I've wrote this code:
$text = preg_replace("/[^A-Za-z0-9]/", ' ', $text);
This works need more flexibility to allow special chars like áéíóú... and remove only certain chars like: :!"#$%&/()=?¿¡...
Any ideas?
Use $text = preg_replace("/[^\p{L}\p{N}]/u", ' ', $text);
This will match all characters that are not letters or numbers and will treat Unicode letters appropriately.
I don't know if this id the place to ask this question so be kind if I am wrong.
I was wondering if someone can explain to me in detail what the following 3 code snippets below do.
Snippet 1
if($str !== mb_convert_encoding(mb_convert_encoding($str, 'UTF-32', 'UTF-8'), 'UTF-8', 'UTF-32')){
$str = mb_convert_encoding($str, 'UTF-8');
}
Snippet 2
$str = preg_replace('`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i', '\\1', $str);
Snippet 3
$str = preg_replace(array('`[^a-z0-9]`i','`[-]+`'), '-', $str);
Here is the full code below for reference.
function to_permalink($str){
if($str !== mb_convert_encoding(mb_convert_encoding($str, 'UTF-32', 'UTF-8'), 'UTF-8', 'UTF-32')){
$str = mb_convert_encoding($str, 'UTF-8');
}
$str = htmlentities($str, ENT_NOQUOTES, 'UTF-8');
$str = preg_replace('`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i', '\\1', $str);
$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');
$str = preg_replace(array('`[^a-z0-9]`i','`[-]+`'), '-', $str);
$str = strtolower(trim($str, '-'));
return $str;
}
Snippet 1 makes sure the string is in UTF-8 encoding.
Snippet 2 converts all special characters to their base form (ie, 'é' -> 'e').
Snippet 3 will convert spaces to hyphens (-).
All in all, taking into account the function's name and content, I'd say it is used to make URL friendly links, for example, convert
I discovered a new french word: église
to
i-discovered-a-new-french-word-eglise
Usually used for SEO.
Many of your questions can be answered by looking up what the functions do in your code.
Go here to get started: http://php.net/docs.php
Snippet #1: Checking if the string is valid UTF-8 data by round-trip converting it from source-> UTF-32 -> UTF-8. If the result is NOT the same as the input, then try to let the MB library determine the input encoding and output as UTF-8 regardless. Seems to be rather much work for little gain.
Snippet #2: Looks for a series of potential character entities (accented characters, in this case), and strips off the leading & and trailing ; if it matches and adds a backslash. So Æ becomes \AElig.
Snippet #3: Converts any character which is NOT a-z or 0-9 or a sequence of 1 or more - into a single -.