If I have a UTF-8 string and want to replace line breaks with the HTML <br> , is this safe?
$var = str_replace("\r\n", "<br>", $var);
I know str_replace isn't UTF-8 safe but maybe I can get away with this. I ask because there isn't an mb_strreplace function.
UTF-8 is designed so that multi-byte sequences never contain an anything that looks like an ASCII-character. That is, any time you encounter a byte with a value in the range 0-127, you can safely assume it to be an ASCII character.
And that means that as long as you only try to replace ASCII characters with ASCII characters, str_replace should be safe.
str_replace() is safe for any ascii-safe character.
Btw, you could also look at the nl2br()
1st: Use the code-sample markup for code in your questions.
2nd: Yes, it is save.
3rd: It may not be what you want to archieve. This could be better:
$var = str_replace(array("\r\n", "\n", "\r"), "<br/>", $var);
Don't forget that different operating systems handle line breaks different. The code above should replace all line breaks, no matter where they come from.
Related
I want to do a search & replace in PHP with a symbol.
This is the symbol: ➤
I want to replace it with a dash, but that doesn't work. The problem looks like that the symbol cannot be found, even though it's there.
Other 'normal' search and replace operations work as expected. But replacing this symbol does not.
Any ideas how to address this symbol, so that the search and replace function actually can find it and replace it?
Your problem is (almost certainly) related to text/character encoding.
Special characters such as the ➤ you are referring to, are not part of the classical ISO-8859-1 character set; they are however part of Unicode family (codepoint U+27A4 to be exact). This means that, in order to use this (multibyte)character, you have to use a unicode character set, which generally means UTF-8.
All the basic characters (think A-Z, numbers, spaces, ...) overlap between UTF-8 and ISO-8859-1 (which is effectively the default character set), so when you don't use any special characters, you could use the wrong charset and things will pretty much continue to work just fine; that is until you try to use a character that is not part of the basic set.
Since your problem takes place entirely on the server side (inside PHP), and doesn't really touch upon the HTTP and HTML layers, we won't have to go into utf-8 content-type headers and the like, but you should be aware of them for future issues (if you weren't already).
The issue you have should be resolved once you meet 2 criteria:
Not all PHP functions are multibyte-aware; I'm not 100% sure, but i think str_replace is one of those which is not. The preg_replace function with its u flag enabled definitely is multibyte aware, and can serve the exact same function.
The text editor or IDE that you used to create the .php file may or may not be set to UTF-8 encoding, if it wasn't then you should switch that in order to be able to use such characters literally inside the source code.
Something like this should function correctly assuming the .php-file is stored in UTF-8 format:
$output = preg_replace('#➤#u', '-', $input);
Most likely you did not set the header of your PHP script to use the UTF-8 character set. Consider the following:
header('Content-type: text/plain; charset=utf-8');
$input = "This is the symbol: ➤";
$output = str_replace("➤", "-", $input);
echo $input . "\n" . $output;
This prints:
This is the symbol: ➤
This is the symbol: -
as that is simply replaceable using builtin php str_replace function, so that would be better if you can share us your code to check it more.
$str = "hey same let's change this to a dash: ➤";
echo "before: $str \n";
echo "after: ".str_replace("➤", "-", $str);
before: hey same let's change this to a dash: ➤
after: hey same let's change this to a dash: -
example
Currently, I'm facing an issue of reading a file that contains non-English characters. I need to read that file line by line using the following code:
while(!feof($handle)) {
$line = fgets($handle);
}
The case is this file has 1711 lines, but the strange thing is it shows 1766 lines when I tried traversing that file.
$text = file_get_contents($filePath);
$numOfLines = count(explode(PHP_EOL, $text));
I would appreciate so much if anyone can help me out this issue.
You've tagged 'character-encoding', so at least you know what the start of the problem is. You've got some ... probably ... UTF8 characters in there and I'm betting some are multi-byte wide. You are counting your 'lines' by exploding on the PHP_EOL character, which I'm guessing is 0x0A. Some of your multi-byte-wide characters contain 0x0A as a single byte of their 'character', so explode (acting on bytes and not multi-byte characters) is treating that as the end of a 'line'. var_dump your exploded array and you'll see the issue easily enough.
Try count(mb_split('(\r?\n)', $text)) and see what you get. My regex is poor though and that might not work. I would see this question for more help on the regex you need to split on a new line:
Match linebreaks - \n or \r\n?
Remember that your line ending might possibly be \u0085, but I doubt it as PHP_EOL is being too aggressive.
If mb_split works, remember that you'll need to be using PHP's mb_ functions for all of your string manipulations. PHP's standard string functions assume single-byte characters and provide the separate mb_ functions to handle multi-byte wide characters.
I've got a string, that is UTF-8 encoding according to mb_detect_encoding(). I want to trim like this:
$string = trim($string);
But it has no effect.
When I look at the string with urlencode($string) it displays:
"++++++++++++++++String+more+text++++++++++++"
According to: https://markushedlund.com/dev/trim-unicodeutf-8-whitespace-in-php/ I tried this code, but no effect:
preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $string);
How do i trim this?
How can I find what the space character stands for and then replace it. All I know is urlencode, but this just tells me it's a space by showing +++.
Update:
Thanks to #Stefanov.sm in the comments below, I learned that you can output the string to hex with: bin2hex($string); Then I see a whole lot of 20202020 and I see 20 stands for space in UTF-8 encoding.
Strange though the trim won't work, but what does is:
$string = str_replace("\x20","",$string);
Maybe I can figure this out why. But at least the objective to get rid of them is completed.
the "+" signs remains for white-space.
What you should try to do is to use mb_detect_encoding function to be sure of the encoding. https://www.php.net/manual/fr/function.mb-detect-encoding.php
<?php
mb_detect_encoding($str, 'UTF-8', true); // Will tell you TRUE or FALSE
?>
Try explicitly naming "+" for removal:
%string = trim($string, "+ ");
Note the space after "+", which means "remove both spaces and plus-signs".
Encoding has probably nothing to do with his, unless those pluses are a misrepresentation of some other character.
You could try this multibyte trim function:
function mb_trim($str) {
return preg_replace("/^\s+|\s+$/u", "", $str);
}
No guarantee it will solve the problem, but it can't hurt.
I found it here: Multibyte trim in PHP?
I have some data imported from a csv. The import script grabs all email addresses in the csv and after validating them, imports them into a db.
A client has supplied this csv, and some of the emails seem to have a space at the end of the cell. No problem, trim that sucker off... nope, wont work.
The space seems to not be a space, and isn't being removed so is failing a bunch of the emails validation.
Question: Any way I can actually detect what this erroneous character is, and how I can remove it?
Not sure if its some funky encoding, or something else going on, but I dont fancy going through and removing them all manually! If I UTF-8 encode the string first it shows this character as a:
Â
If that "space" is not affected by trim(), the first step is to identify it.
Use urlencode() on the string. Urlencode will percent-escape any non-printable and a lot of printable characters besides ASCII, so you will see the hexcode of the offending characters instantly. Depending on what you discover, you can act accordingly or update your question to get additional help.
I had a similar problem, also loading emails from CSVs and having issues with "undetectable" whitespaces.
Resolved it by replacing the most common urlencoded whitespace chars with ''. This might help if can't use mb_detect_encoding() and/or iconv()
$urlEncodedWhiteSpaceChars = '%81,%7F,%C5%8D,%8D,%8F,%C2%90,%C2,%90,%9D,%C2%A0,%A0,%C2%AD,%AD,%08,%09,%0A,%0D';
$temp = explode(',', $urlEncodedWhiteSpaceChars); // turn them into a temp array so we can loop accross
$email_address = urlencode($row['EMAIL_ADDRESS']);
foreach($temp as $v){
$email_address = str_replace($v, '', $email_address); // replace the current char with nuffink
}
$email_address = urldecode($email_address); // undo the url_encode
Note that this does NOT strip the 'normal' space character and that it removes these whitespace chars from anywhere in the string - not just start or end.
Replace all UTF-8 spaces with standard spaces and then do the trim!
$string = preg_replace('/\s/u', ' ', $string);
echo trim($string)
This is it.
In most of the cases a simple strip_tags($string) will work.
If the above doesn't work, then you should try to identify the characters resorting to urlencode() and then act accordingly.
I see couples of possible solutions
1) Get last char of string in PHP and check if it is a normal character (with regexp for example). If it is not a normal character, then remove it.
$length = strlen($string);
$string[($length-1)] = '';
2) Convert your character from UTF-8 to encoding of you CSV file and use str_replace. For example if you CSV is encoded in ISO-8859-2
echo iconv('UTF-8', 'ISO-8859-2', "Â");
I have some whitespace at the begining of a paragraph in a text field in MySQL.
Using trim($var_text_field) in PHP or TRIM(text_field) in MySQL statements does absolutely nothing. What could this whitespace be and how do I remove it by code?
If I go into the database and backspace it out, it saves properly. It's just not being removed via the trim() functions.
function UberTrim($s) {
$s = preg_replace('/\xA0/u', ' ', $s); // strips UTF-8 NBSP: "\xC2\xA0"
$s = trim($s);
return $s;
}
The UTF-8 character encoding for a no-break space, Unicode (U+00A0), is the 2-byte sequence C2 A0. I tried to make use of the second parameter to trim() but that didn't do the trick. Example use:
assert("abc" === UberTrim(" \r\n \xc2\xa0 abc \t \xc2\xa0 "));
A MySQL replacement for TRIM(text_field) that also removes UTF no-break spaces, thanks to #RudolfRein's comment:
TRIM(REPLACE(text_field, '\xc2\xa0', ' '))
UTF-8 checklist:
(more checks here)
Make sure your PHP source code editor is in UTF-8 mode without BOM. Or set in the preferences.
Make sure your MySQL client is set for UTF-8 character encoding (more here and here), e.g.
$pdo = new PDO('mysql:host=...;dbname=...;charset=utf8',$userid,$password);
$pdo->exec("SET CHARACTER SET utf8");
Make sure your HTTP server is set for UTF-8, e.g. for Apache:
AddDefaultCharset UTF-8
Make sure the browser expects UTF-8.
header('Content-Type: text/html; charset=utf-8');
or
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
If the problem is with UTF-8 NBSP, another simple option is:
REPLACE(the_field, UNHEX('C2A0'), ' ')
The best solution is a combination of a few things mentioned to you already.
First run ORD() on the string in question. In my case I had to run a reverse first because my problem character was at the end of the string.
ORD(REVERSE([col name]))
Once you discover the problematic char, run a
REPLACE([col_name], char([char_value_returned]), char(32))
Finally, call a proper
TRIM([col_name])
This will completely eradicate the problem character from all aspects of the string, and trim off the leading (in my case trailing) character.
Try using the MySQL ORD() function on the text_field to check the character code of the left-most character. It can be a non-whitespace characters that appears like whitespace.
you have to detect these "whitespace" characters first. if it's some HTML entity, like no trimming function would help, of course.
I'd suggest to print it out like this
echo urlenclde($row['field']);
and see what it says
Well as its A0 (or 160 decimal) non-breaking space character, you can convert it to ordinal space first:
<pre><?php
$str = urldecode("%A0")."bla";
var_dump(trim($str));
$str = str_replace(chr(160)," ",$str);
$str = trim($str);
var_dump($str);
and, ta-dam! -
string(4) " bla"
string(3) "bla"
Try to check what character each "whitespace" is by writing the charactercode out - It might be a non-visible charactertype that isn't removed by trim.
Trim only removes a few characters such as whitespace, tab, newline, CR and NUL but there exist other non-visible characters that might cause this problem.
try
str_ireplace(array("\r", "\n", "\t"), $var_text_field