PHP remove white space not working because of encoding? - php

I've got a string, that is UTF-8 encoding according to mb_detect_encoding(). I want to trim like this:
$string = trim($string);
But it has no effect.
When I look at the string with urlencode($string) it displays:
"++++++++++++++++String+more+text++++++++++++"
According to: https://markushedlund.com/dev/trim-unicodeutf-8-whitespace-in-php/ I tried this code, but no effect:
preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $string);
How do i trim this?
How can I find what the space character stands for and then replace it. All I know is urlencode, but this just tells me it's a space by showing +++.
Update:
Thanks to #Stefanov.sm in the comments below, I learned that you can output the string to hex with: bin2hex($string); Then I see a whole lot of 20202020 and I see 20 stands for space in UTF-8 encoding.
Strange though the trim won't work, but what does is:
$string = str_replace("\x20","",$string);
Maybe I can figure this out why. But at least the objective to get rid of them is completed.

the "+" signs remains for white-space.
What you should try to do is to use mb_detect_encoding function to be sure of the encoding. https://www.php.net/manual/fr/function.mb-detect-encoding.php
<?php
mb_detect_encoding($str, 'UTF-8', true); // Will tell you TRUE or FALSE
?>

Try explicitly naming "+" for removal:
%string = trim($string, "+ ");
Note the space after "+", which means "remove both spaces and plus-signs".
Encoding has probably nothing to do with his, unless those pluses are a misrepresentation of some other character.

You could try this multibyte trim function:
function mb_trim($str) {
return preg_replace("/^\s+|\s+$/u", "", $str);
}
No guarantee it will solve the problem, but it can't hurt.
I found it here: Multibyte trim in PHP?

Related

PHP trim and space not working

I have some data imported from a csv. The import script grabs all email addresses in the csv and after validating them, imports them into a db.
A client has supplied this csv, and some of the emails seem to have a space at the end of the cell. No problem, trim that sucker off... nope, wont work.
The space seems to not be a space, and isn't being removed so is failing a bunch of the emails validation.
Question: Any way I can actually detect what this erroneous character is, and how I can remove it?
Not sure if its some funky encoding, or something else going on, but I dont fancy going through and removing them all manually! If I UTF-8 encode the string first it shows this character as a:
Â
If that "space" is not affected by trim(), the first step is to identify it.
Use urlencode() on the string. Urlencode will percent-escape any non-printable and a lot of printable characters besides ASCII, so you will see the hexcode of the offending characters instantly. Depending on what you discover, you can act accordingly or update your question to get additional help.
I had a similar problem, also loading emails from CSVs and having issues with "undetectable" whitespaces.
Resolved it by replacing the most common urlencoded whitespace chars with ''. This might help if can't use mb_detect_encoding() and/or iconv()
$urlEncodedWhiteSpaceChars = '%81,%7F,%C5%8D,%8D,%8F,%C2%90,%C2,%90,%9D,%C2%A0,%A0,%C2%AD,%AD,%08,%09,%0A,%0D';
$temp = explode(',', $urlEncodedWhiteSpaceChars); // turn them into a temp array so we can loop accross
$email_address = urlencode($row['EMAIL_ADDRESS']);
foreach($temp as $v){
$email_address = str_replace($v, '', $email_address); // replace the current char with nuffink
}
$email_address = urldecode($email_address); // undo the url_encode
Note that this does NOT strip the 'normal' space character and that it removes these whitespace chars from anywhere in the string - not just start or end.
Replace all UTF-8 spaces with standard spaces and then do the trim!
$string = preg_replace('/\s/u', ' ', $string);
echo trim($string)
This is it.
In most of the cases a simple strip_tags($string) will work.
If the above doesn't work, then you should try to identify the characters resorting to urlencode() and then act accordingly.
I see couples of possible solutions
1) Get last char of string in PHP and check if it is a normal character (with regexp for example). If it is not a normal character, then remove it.
$length = strlen($string);
$string[($length-1)] = '';
2) Convert your character from UTF-8 to encoding of you CSV file and use str_replace. For example if you CSV is encoded in ISO-8859-2
echo iconv('UTF-8', 'ISO-8859-2', "Â");

How to decode hex content?

I have $_SERVER['REDIRECT_SSL_CLIENT_S_DN'] content that has somekind of hex data. How can i decode it?
$_SERVER['REDIRECT_SSL_CLIENT_S_DN'] = '../CN=\x00M\x00\xC4\x00,\x00I\x00S\x00,\x004\x000\x003\x001\x002\x000\x000\x002/SN=..';
$pattern = '/CN=(.*)\\/SN=/';
preg_match($pattern, $_SERVER['REDIRECT_SSL_CLIENT_S_DN'], $server_matches);
print_r($server_matches[1]);
The result is:
\x00M\x00\xC4\x00,\x00I\x00S\x00,\x004\x000\x003\x001\x002\x000\x000\x002
The result i need is:
MÄ,IS,40312002
I tried to decode it with chr(hexdec($value)); and it almost works, but in html input i see lot of question marks.
EDIT:
Additional test with results. Not yet perfect. Array reveals some errors: http://pastebin.com/BC4xxqmE
After using utf8_encode, you now have a multibyte string. This means you need to use PHP's multibyte (mb_) functions.
So, str_split won't work anymore. You need to use either mb_split or preg_split with the u flag.
$splitted = preg_split('//u', $string);
Here's a demo showing that your code is now working: http://ideone.com/nqeC0U
Have you tried unicode equivalent of chr()? chr mod 256 all the input that's why you see all those question marks.
The code below is from one of the post in chr php manual
function unichr($u) {
return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}
Update
//New function
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
I test with xC4=196 it gives me an Ä
http://codepad.viper-7.com/3htuwW
Your input is in UTF-8 using that conversion is similar to utf8_decode which will convert to ISO-8859-1. UTF-8 though supports more characters than ISO-8859-1. This is why xC4 shows up as a question mark for you.
Try using something more powerful like iconv.

Is replacing a line break UTF-8 safe?

If I have a UTF-8 string and want to replace line breaks with the HTML <br> , is this safe?
$var = str_replace("\r\n", "<br>", $var);
I know str_replace isn't UTF-8 safe but maybe I can get away with this. I ask because there isn't an mb_strreplace function.
UTF-8 is designed so that multi-byte sequences never contain an anything that looks like an ASCII-character. That is, any time you encounter a byte with a value in the range 0-127, you can safely assume it to be an ASCII character.
And that means that as long as you only try to replace ASCII characters with ASCII characters, str_replace should be safe.
str_replace() is safe for any ascii-safe character.
Btw, you could also look at the nl2br()
1st: Use the code-sample markup for code in your questions.
2nd: Yes, it is save.
3rd: It may not be what you want to archieve. This could be better:
$var = str_replace(array("\r\n", "\n", "\r"), "<br/>", $var);
Don't forget that different operating systems handle line breaks different. The code above should replace all line breaks, no matter where they come from.

How to filter a Font Character in php

I have an arial character giving me a headache. U+02DD turns into a question mark after I turn its document into a phpquery object. What is an efficient method for removing the character in php by referring to it as 'U+02DD'?
You can use iconv() to convert character sets and strip invalid characters.
<?PHP
/* This will convert ISO-8859-1 input to UTF-8 output and
* strip invalid characters
*/
$output = iconv("ISO-8859-1", "UTF-8//IGNORE", $input);
/* This will attempt to convert invalid characters to something
* that looks approximately correct.
*/
$output = iconv("ISO-8859-1", "UTF-8//TRANSLIT", $input);
?>
See the iconv() documentation at http://php.net/manual/en/function.iconv.php
Use preg_replace and do it like this:
$str = "your text with that character";
echo preg_replace("#\x{02DD}#u", "", $str); //EDIT: inserted the u tag for unicode
To refer to large unicode ranges, you can use preg_replace and specify the unicode character with \x{abcd} pattern. The second parameter is an empty string that. This will make preg_replace to replace your character with nothing, effectively removing it.
[EDIT] Another way:
Did you try doing htmlentities on it. As it's html-entity is ˝, doing that OR replacing the character by ˝ may solve your issue too. Like this:
echo preg_replace("#\x{02DD}#u", "˝", $str);

Replace diacritic characters with "equivalent" ASCII in PHP?

Related questions:
How to replace characters in a java String?
How to replace special characters with their equivalent (such as " á " for " a") in C#?
As in the questions above, I'm looking for a reliable, robust way to reduce any unicode character to near-equivalent ASCII using PHP. I really want to avoid rolling my own look up table.
For example (stolen from 1st referenced question): Gračišće becomes Gracisce
The iconv module can do this, more specifically, the iconv() function:
$str = iconv('Windows-1252', 'ASCII//TRANSLIT//IGNORE', "Gracišce");
echo $str;
//outputs "Gracisce"
The main hassle with iconv is that you just have to watch your encodings, but it's definitely the right tool for the job (I used 'Windows-1252' for the example due to limitations of the text editor I was working with ;) The feature of iconv that you definitely want to use is the //TRANSLIT flag, which tells iconv to transliterate any characters that don't have an ASCII match into the closest approximation.
I found another solution, based on #zombat's answer.
The issue with his answer was that I was getting:
Notice: iconv() [function.iconv]: Wrong charset, conversion from `UTF-8' to `ASCII//TRANSLIT//IGNORE' is not allowed in D:\www\phpcommand.php(11) : eval()'d code on line 3
And after removing //IGNORE from the function, I got:
Gr'a'e~a~o^O"ucisce
So, the š character was translated correctly, but the other characters weren't.
The solution that worked for me is a mix between preg_replace (to remove everything but [a-zA-Z0-9] - including spaces) and #zombat's solution:
preg_replace('/[^a-zA-Z0-9.]/','',iconv('UTF-8', 'ASCII//TRANSLIT', "GráéãõÔücišce"));
Output:
GraeaoOucisce
My solution is to create two strings - first with not wanted letters and second with letters that will replace firsts.
$from = 'čšć';
$to = 'csc';
$text = 'Gračišće';
$result = str_replace(str_split($from), str_split($to), $text);
Try this:
function normal_chars($string)
{
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
$string = preg_replace(array('~[^0-9a-z]~i', '~-+~'), ' ', $string);
return trim($string);
}
Examples:
echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA
Based on the selected answer in this thread: URL Friendly Username in PHP?
You should also try:
transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', "ÀÖØöøįĴőŔžǍǰǴǵǸțȞȟȤȳɃɆɏ");
//Will output
aooooijorzajggnthhzybey
I found this from here:
https://www.php.net/manual/en/transliterator.transliterate.php#111939

Categories