non-breaking utf-8 0xc2a0 space and preg_replace strange behaviour

non-breaking utf-8 0xc2a0 space and preg_replace strange behaviour - php

In my string I have utf-8 non-breaking space (0xc2a0) and I want to replace it with something else.
When I use
$str=preg_replace('~\xc2\xa0~', 'X', $str);
it works OK.
But when I use
$str=preg_replace('~\x{C2A0}~siu', 'W', $str);
non-breaking space is not found (and replaced).
Why? What is wrong with second regexp?
The format \x{C2A0} is correct, also I used u flag.

Actually the documentation about escape sequences in PHP is wrong. When you use \xc2\xa0 syntax, it searches for UTF-8 character. But with \x{c2a0} syntax, it tries to convert the Unicode sequence to UTF-8 encoded character.
A non breaking space is U+00A0 (Unicode) but encoded as C2A0 in UTF-8. So if you try with the pattern ~\x{00a0}~siu, it will work as expected.

I've aggegate previous answers so people can just copy / paste following code to choose their favorite method :
$some_text_with_non_breaking_spaces = "some text with 2 non breaking spaces at the beginning";
echo 'Qty non-breaking space : ' . substr_count($some_text_with_non_breaking_spaces, "\xc2\xa0") . '<br>';
echo $some_text_with_non_breaking_spaces . '<br>';
# Method 1 : regular expression
$clean_text = preg_replace('~\x{00a0}~siu', ' ', $some_text_with_non_breaking_spaces);
# Method 2 : convert to bin -> replace -> convert to hex
$clean_text = hex2bin(str_replace('c2a0', '20', bin2hex($some_text_with_non_breaking_spaces)));
# Method 3 : my favorite
$clean_text = str_replace("\xc2\xa0", " ", $some_text_with_non_breaking_spaces);
echo 'Qty non-breaking space : ' . substr_count($clean_text, "\xc2\xa0"). '<br>';
echo $clean_text . '<br>';

The two codes do different things in my opinion: the first \xc2\xa0 will replace TWO characters, \xc2 and \xa0 with nothing.
In UTF-8 encoding, this happens to be the codepoint for U+00A0.
Does \x{00A0} work? This should be the representation for \xc2\xa0.

I did not work this variant ~\x{c2a0}~siu.
Varian \x{00A0} works. I have not tried the second option and here is the result:
I tried to convert it to hex and replace no-break space 0xC2 0xA0 (c2a0) to space 0x20 (20).
Code:
$hex = bin2hex($item);
$_item = str_replace('c2a0', '20', $hex);
$item = hex2bin($_item);

/\x{00A0}/, /\xC2\xA0/ and $clean_hex2bin-str_replace-bin2hex worked and didn't work. If I printed it out to the screen, it's all good, but if I tried to save it to a file, the file would be blank!
I ended up using iconv('UTF-8', 'ISO-8859-1//IGNORE', $str);

Related

Robustly detect dash in PHP string [duplicate]

preg_replace does not return desired result when I use it on string fetched from database.
$result = DB::connection("connection")->select("my query");
foreach($result as $row){
//prints run-d.m.c.
print($row->artist . "\n");
//should print run.d.m.c
//prints run-d.m.c
print(preg_replace("/-/", ".", $row->artist) . "\n");
}
This occurs only when i try to replace - (dash). I can replace any other character.
However if I try this regex on simple string it works as expected:
$str = "run-d.m.c";
//prints run.d.m.c
print(preg_replace("/-/", ".", $str) . "\n");
What am I missing here?

It turns out you have Unicode dashes in your strings. To match all Unicode dashes, use
/[\p{Pd}\xAD]/u
See the regex demo
The \p{Pd} matches any hyphen in the Unicode Character Category 'Punctuation, Dash' but a soft hyphen, \xAD, hence it should be combined with \p{Pd} in a character class.
The /u modifier makes the pattern Unicode aware and makes the regex engine treat the input string as Unicode code point sequence, not a byte sequence.

PHP trim special character destroys other special character

I'm using this function to clean strings for elastic search:
function cleanString($string){
$string = mb_convert_encoding($string, "UTF-8");
$string = str_ireplace(array('<', '>'), array(' <', '> '), $string);
$string = strip_tags($string);
$string = filter_var($string, FILTER_SANITIZE_STRING);
$string = str_ireplace(array("\t", "\n", "\r", " "," ",":"), ' ', $string);
$string = str_ireplace(array("","«","»","£"), '', $string);
return trim($string, ",;.:-_*+~#'\"´`!§$%&/()=?«»")
}
It does all sorts of stuff, but the problem I am facing has to do with the trim function at the very end. It is supposed to trim away whitespaces and special characters, and worked fine until recently, when I added two more special character to trim away from string: « and ». This caused problems with another special character:
When I pass the word België into the function, the ë gets corrupted and elastic throws an error.
Why does trim corrupt a completely different character?
How can I fix
that, so that I parse out « and » and preserve ë?

trim is not encoding aware and just looks at individual bytes. If you tell it to trim '«»', and that's encoded in UTF-8, it will look for the bytes C2 AB C2 BB (where C2 is redundant, so AB BB C2 are the actual search terms). "ë" in UTF-8 is C3 AB, so half of it gets removed and the character is thereby broken.
You'll need to use an encoding aware functions to safely remove multibyte characters, e.g.:
preg_replace('/^[«»]+|[«»]+$/u', '', $str)

PHP regular expression not working with string from database

preg_replace does not return desired result when I use it on string fetched from database.
$result = DB::connection("connection")->select("my query");
foreach($result as $row){
//prints run-d.m.c.
print($row->artist . "\n");
//should print run.d.m.c
//prints run-d.m.c
print(preg_replace("/-/", ".", $row->artist) . "\n");
}
This occurs only when i try to replace - (dash). I can replace any other character.
However if I try this regex on simple string it works as expected:
$str = "run-d.m.c";
//prints run.d.m.c
print(preg_replace("/-/", ".", $str) . "\n");
What am I missing here?

It turns out you have Unicode dashes in your strings. To match all Unicode dashes, use
/[\p{Pd}\xAD]/u
See the regex demo
The \p{Pd} matches any hyphen in the Unicode Character Category 'Punctuation, Dash' but a soft hyphen, \xAD, hence it should be combined with \p{Pd} in a character class.
The /u modifier makes the pattern Unicode aware and makes the regex engine treat the input string as Unicode code point sequence, not a byte sequence.

Delete spaces php

I need delete all tags from string and make it without spaces.
I have string
"<span class="left_corner"> </span><span class="text">Adv</span><span class="right_corner"> </span>"
After using strip_tags I get string
" Adv "
Using trim function I can`t delete spaces.
JSON string looks like "\u00a0...\u00a0".
Help me please delete this spaces.

Solution of this problem
$str = trim($str, chr(0xC2).chr(0xA0))

You should use preg_replace(), to make it in multibyte-safe way.
$str = preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', $str);
Notes:
this will fix initial #Андрей-Сердюк's problem: it will trim \u00a0, because \s matches Unicode non-breaking spaces too
/u modifier (PCRE_UTF8) tells PCRE to handle subject as UTF8-string
\x00 matches null-byte characters to mimic default trim() function behavior
Accepted #Андрей-Сердюк trim() answer will mess with multibyte strings.
Example:
// This works:
echo trim(' Hello ', ' '.chr(0xC2).chr(0xA0));
// > "Hello"
// And this doesn't work:
echo trim(' Solidarietà ', ' '.chr(0xC2).chr(0xA0));
// > "Solidariet?" -- invalid UTF8 character sequense
// This works for both single-byte and multi-byte sequenses:
echo preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', ' Hello ');
// > "Hello"
echo preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', ' Solidarietà ');
// > "Solidarietà"

How about:
$string = '" Adv "';
$noSpace = preg_replace('/\s/', '', $string);
?
http://php.net/manual/en/function.preg-replace.php

I was using the accepted solution for years and I've been wrong all this time. If I can find this solution in 2022, others too, so please change the accepted solution to the one from #e1v who was right all this time.
You SHOULD NOT DO THIS!
echo trim('Au delà', ' '.chr(0xC2).chr(0xA0));
As it corrupts the UTF-8 encoding:
Au del�
Note that a "modern" (PHP 7) way to write this could be:
echo trim('Au delà', " \u{a0}");//This is WRONG, don't do it!
Personally, when I have to deal with non breakable spaces (Unicode 00A0, UTF8 C2A0) in strings, I replace the trailing/ending ones by regular spaces (Unicode 0020, UTF8 20), and then trim the string. Like this:
echo trim(preg_replace('/^\s+|\s+$/u', ' ', "Au delà\u{a0}"));
(I would have post a comment or just vote the answer up, but I can't).

$str = '<span class="left_corner"> </span><span class="text">Adv</span><span class="right_corner"> </span>';
$rgx = '#(<[^>]+>)|(\s+)#';
$cleaned_str = preg_replace( $rgx, '' , $str );
echo '['. $cleaned_str .']';

Why is my attempt to replace a character in a string failing?

I have a string (taken from a MySQL database if it makes any difference) which looks normal enough:
Manufacture: Blah
The problem is that the space between Manufacture: and the <a> tag has a charcode of 194, not 32 as I would expect.
This is causing a preg_match with the following pattern to fail (please ignore the attempts to parse HTML with regex, I know it's not a good idea but this particular dataset is predictable enough to get away with it):
/Manufacture: *(<a[^>]*>([A-Za-z- 0-9]+)<\/a>)/i
If I replace the rogue space with a normal space character in a text editor and try again, the expression matches as expected, but I need to alter it programatically.
I tried str_replace:
$text = str_replace(chr(194), ' ', $text);
But the preg_match still fails. I then tried preg_replace:
$text = preg_replace('/[\xC2]/', ' ', $text);
But that doesn't work either, even though running that same pattern through preg_match does contain the expected match.
Does anyone have any ideas?

Can you please check the structure of the MySQL table where you get the contents of $text from? If the collation is utf8_general_ci or something like that then your string most likely contains a double-byte UNICODE character.
If that is the case then the PHP function iconv should do the trick. Here's the example from the PHP manual. The IGNORE option should remove the UNICODE character from the string.
<?php
$text = "This is the Euro symbol '€'.";
echo 'Original : ', $text, PHP_EOL;
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
echo 'Plain : ', iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
?>
The above example will output something similar to:
Original : This is the Euro symbol '€'.
TRANSLIT : This is the Euro symbol 'EUR'.
IGNORE : This is the Euro symbol ''.
Plain :
Notice: iconv(): Detected an illegal character in input string in .\iconv-example.php on line 7
This is the Euro symbol '

what if you try to match any whitespace character?
like so:
/Manufacture:\s*(<a[^>]*>([A-Za-z- 0-9]+)<\/a>)/i

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

non-breaking utf-8 0xc2a0 space and preg_replace strange behaviour - php

The two codes do different things in my opinion: the first \xc2\xa0 will replace TWO characters, \xc2 and \xa0 with nothing. In UTF-8 encoding, this happens to be the codepoint for U+00A0. Does \x{00A0} work? This should be the representation for \xc2\xa0.

/\x{00A0}/, /\xC2\xA0/ and $clean_hex2bin-str_replace-bin2hex worked and didn't work. If I printed it out to the screen, it's all good, but if I tried to save it to a file, the file would be blank! I ended up using iconv('UTF-8', 'ISO-8859-1//IGNORE', $str);

Related

Robustly detect dash in PHP string [duplicate]

PHP trim special character destroys other special character

PHP regular expression not working with string from database

Delete spaces php

Why is my attempt to replace a character in a string failing?

Categories

Resources