I have some whitespace at the begining of a paragraph in a text field in MySQL.
Using trim($var_text_field) in PHP or TRIM(text_field) in MySQL statements does absolutely nothing. What could this whitespace be and how do I remove it by code?
If I go into the database and backspace it out, it saves properly. It's just not being removed via the trim() functions.
function UberTrim($s) {
$s = preg_replace('/\xA0/u', ' ', $s); // strips UTF-8 NBSP: "\xC2\xA0"
$s = trim($s);
return $s;
}
The UTF-8 character encoding for a no-break space, Unicode (U+00A0), is the 2-byte sequence C2 A0. I tried to make use of the second parameter to trim() but that didn't do the trick. Example use:
assert("abc" === UberTrim(" \r\n \xc2\xa0 abc \t \xc2\xa0 "));
A MySQL replacement for TRIM(text_field) that also removes UTF no-break spaces, thanks to #RudolfRein's comment:
TRIM(REPLACE(text_field, '\xc2\xa0', ' '))
UTF-8 checklist:
(more checks here)
Make sure your PHP source code editor is in UTF-8 mode without BOM. Or set in the preferences.
Make sure your MySQL client is set for UTF-8 character encoding (more here and here), e.g.
$pdo = new PDO('mysql:host=...;dbname=...;charset=utf8',$userid,$password);
$pdo->exec("SET CHARACTER SET utf8");
Make sure your HTTP server is set for UTF-8, e.g. for Apache:
AddDefaultCharset UTF-8
Make sure the browser expects UTF-8.
header('Content-Type: text/html; charset=utf-8');
or
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
If the problem is with UTF-8 NBSP, another simple option is:
REPLACE(the_field, UNHEX('C2A0'), ' ')
The best solution is a combination of a few things mentioned to you already.
First run ORD() on the string in question. In my case I had to run a reverse first because my problem character was at the end of the string.
ORD(REVERSE([col name]))
Once you discover the problematic char, run a
REPLACE([col_name], char([char_value_returned]), char(32))
Finally, call a proper
TRIM([col_name])
This will completely eradicate the problem character from all aspects of the string, and trim off the leading (in my case trailing) character.
Try using the MySQL ORD() function on the text_field to check the character code of the left-most character. It can be a non-whitespace characters that appears like whitespace.
you have to detect these "whitespace" characters first. if it's some HTML entity, like no trimming function would help, of course.
I'd suggest to print it out like this
echo urlenclde($row['field']);
and see what it says
Well as its A0 (or 160 decimal) non-breaking space character, you can convert it to ordinal space first:
<pre><?php
$str = urldecode("%A0")."bla";
var_dump(trim($str));
$str = str_replace(chr(160)," ",$str);
$str = trim($str);
var_dump($str);
and, ta-dam! -
string(4) " bla"
string(3) "bla"
Try to check what character each "whitespace" is by writing the charactercode out - It might be a non-visible charactertype that isn't removed by trim.
Trim only removes a few characters such as whitespace, tab, newline, CR and NUL but there exist other non-visible characters that might cause this problem.
try
str_ireplace(array("\r", "\n", "\t"), $var_text_field
Related
I am retrieving text data from a database which includes bullets and newlines. I have successfully removed the newlines and converted them to <br /> using the nl2br() function in PHP, but the bullets act weird and display "•" instead of "•" (see screenshot).
I have tried using htmlspecialchars() function in PHP but it still displays the same output.
I have used htmlentities() now instead of htmlspecialchars. I have solved my own problem but I hope this thread will help others in the future.
The Unicode character U+2022 (BULLET) is encoded in UTF-8 as the octets E2 80 A2. If your page contains these octets, and the page is incorrectly interpreted using a different character encoding, such as Windows-1252, the resulting page will display the three characters â, €, ¢.
To properly display the bullet character, you need to declare the correct character encoding for your document:
header ('Content-Type: text/html; charset=utf-8');
If it is not feasible to use the UTF-8 encoding, you can convert the string using htmlentities(), which should convert the bullet characters, and other undisplayable characters, into HTML character references (•):
$s = "Bullet \xe2\x80\xa2 character";
echo htmlentities ($s), "\n";
Or, if PHP's character encoding is not configured correctly:
$s = "Bullet \xe2\x80\xa2 character";
echo htmlentities ($s, ENT_NOQUOTES, 'utf-8'), "\n";
I have some data imported from a csv. The import script grabs all email addresses in the csv and after validating them, imports them into a db.
A client has supplied this csv, and some of the emails seem to have a space at the end of the cell. No problem, trim that sucker off... nope, wont work.
The space seems to not be a space, and isn't being removed so is failing a bunch of the emails validation.
Question: Any way I can actually detect what this erroneous character is, and how I can remove it?
Not sure if its some funky encoding, or something else going on, but I dont fancy going through and removing them all manually! If I UTF-8 encode the string first it shows this character as a:
Â
If that "space" is not affected by trim(), the first step is to identify it.
Use urlencode() on the string. Urlencode will percent-escape any non-printable and a lot of printable characters besides ASCII, so you will see the hexcode of the offending characters instantly. Depending on what you discover, you can act accordingly or update your question to get additional help.
I had a similar problem, also loading emails from CSVs and having issues with "undetectable" whitespaces.
Resolved it by replacing the most common urlencoded whitespace chars with ''. This might help if can't use mb_detect_encoding() and/or iconv()
$urlEncodedWhiteSpaceChars = '%81,%7F,%C5%8D,%8D,%8F,%C2%90,%C2,%90,%9D,%C2%A0,%A0,%C2%AD,%AD,%08,%09,%0A,%0D';
$temp = explode(',', $urlEncodedWhiteSpaceChars); // turn them into a temp array so we can loop accross
$email_address = urlencode($row['EMAIL_ADDRESS']);
foreach($temp as $v){
$email_address = str_replace($v, '', $email_address); // replace the current char with nuffink
}
$email_address = urldecode($email_address); // undo the url_encode
Note that this does NOT strip the 'normal' space character and that it removes these whitespace chars from anywhere in the string - not just start or end.
Replace all UTF-8 spaces with standard spaces and then do the trim!
$string = preg_replace('/\s/u', ' ', $string);
echo trim($string)
This is it.
In most of the cases a simple strip_tags($string) will work.
If the above doesn't work, then you should try to identify the characters resorting to urlencode() and then act accordingly.
I see couples of possible solutions
1) Get last char of string in PHP and check if it is a normal character (with regexp for example). If it is not a normal character, then remove it.
$length = strlen($string);
$string[($length-1)] = '';
2) Convert your character from UTF-8 to encoding of you CSV file and use str_replace. For example if you CSV is encoded in ISO-8859-2
echo iconv('UTF-8', 'ISO-8859-2', "Â");
If I have a UTF-8 string and want to replace line breaks with the HTML <br> , is this safe?
$var = str_replace("\r\n", "<br>", $var);
I know str_replace isn't UTF-8 safe but maybe I can get away with this. I ask because there isn't an mb_strreplace function.
UTF-8 is designed so that multi-byte sequences never contain an anything that looks like an ASCII-character. That is, any time you encounter a byte with a value in the range 0-127, you can safely assume it to be an ASCII character.
And that means that as long as you only try to replace ASCII characters with ASCII characters, str_replace should be safe.
str_replace() is safe for any ascii-safe character.
Btw, you could also look at the nl2br()
1st: Use the code-sample markup for code in your questions.
2nd: Yes, it is save.
3rd: It may not be what you want to archieve. This could be better:
$var = str_replace(array("\r\n", "\n", "\r"), "<br/>", $var);
Don't forget that different operating systems handle line breaks different. The code above should replace all line breaks, no matter where they come from.
I'm having a problem where PHP (5.2) cannot find the character 'Â' in a string, though it is clearly there.
I realize the underlying problem has to do with character encoding, but unfortunately I have no control over the source content. I receive it as UTF-8, with those characters already in the string.
I would simply like to remove it from the string. strpos(), str_replace(), preg_replace(), trim(), etc. Cannot correctly identify it.
My string is this:
"Â Â Â A lot of couples throughout the World "
If I do this:
$string = str_replace('Â','',$string);
I get this:
"� � � A lot of couples throughout the World"
I even tried utf8_encode() and utf8_decode() before the str_replace, with no luck.
What's the solution? I've been throwing everything I can find at it...
$string = str_replace('Â','',$string);
How is this 'Â' encoded? If your script file is saved as iso-8859-1 the string 'Â' is encoded as the one byte sequence xC2 while the (/one) utf-8 representation is xC3 x82. php's str_replace() works on the byte level, i.e. it only "knows" single-byte characters.
see http://docs.php.net/intro.mbstring
I use this:
function replaceSpecial($str){
$chunked = str_split($str,1);
$str = "";
foreach($chunked as $chunk){
$num = ord($chunk);
// Remove non-ascii & non html characters
if ($num >= 32 && $num <= 123){
$str.=$chunk;
}
}
return $str;
}
From the PHP Manual Comment Page:
http://www.php.net/manual/en/function.preg-replace.php#96847
And from StackOverflow:
Remove accents without using iconv
I have problems displaying the Unicode character of U+009A.
It should look like "š", but instead looks like a rectangular block with the numbers 009A inside.
Converting it to the entity "" displays the character correctly, but I don't want to store entities in the database.
The encoding of the webpage is in UTF-8.
The character is URL-encoded as "%C2%9A".
Reproduce:
# php -E 'echo urldecode("%C2%9A");' > /tmp/test ; less /tmp/test
This gives me <U+009A> in less or <9A> in vim.
The Unicode character "š" is U+0161, not U+009A
I suspect that it's 0x9A in another character set.
The box with 009A is usually shown when you don't have a font installed with that character.
If you’re using UTF-8 as your input encoding, then you can simply use the plain š. Or you could use the hexadecimal representation "\xC2\x9A" (in double quotes) that’s independent from the input encoding. Or utf8_encode("\x9A") since the first 256 characters of Unicode and ISO 8859-1 are identical.
If I do a hexdump of the output of echo urldecode("%C2%9A"); I get c2 9a, which is the correct UTF-8 encoding for character 0x9a.
You get that same encoding from the output of utf8_encode("\x9A")
When I try to view Unicode char 0x9a, I get a square box too - suspect it's not the char you think it should be (Aha: as Azquelt has posted, unicode character "š" is U+0161, not U+009A)
Codeigniter have utf-8 character input data save issue in some hosting servers like Etisalat. system/core/Utf8.php have function to detect illegal char in input data(post/get). In some cases utf-8 char is consider as illegal and save function will fail. For avoid data saving issue do the following in clean_string() function of Utf8.php at line 85.
$str = !mb_detect_encoding($str, 'UTF-8', TRUE) ? utf8_encode($str) : $str;
$str = #iconv('UTF-8', 'UTF-8//IGNORE', $str);