PHP: Multibyte unicode convertion - php

I've been googling for a bit, also search here but can find a solution. I'm using PHP. I'm reading a text string (part of X509 cert) and it encoded é to \xC3\xA9 (André => Andr\xC3\xA9).
I've tried MonkeyPhysics's solution:
preg_replace("#(\\\x[0-9A-F]{2})#ei", "chr(hexdec('\\1'))", $string);
but then I get André
I've played around with the replacement part;
mb_convert_encoding('&#' . hexdec('\\1') . ';', 'ISO-8859-1', 'UTF-8')
(Also the to_encoding and from_encoding)
I've also looked at How to transliterate non-latin scripts? but got no closer.
Surely this should be a standard conversion?

Use of e modifier is deprecated in PHP now. You need to use preg_replace_callback instead with /u modifier for handling unicode strings.
$string = 'His nickname was \xE2\x80\x98the Angel\xE2\x80\x99,
which is kind of a clich\xC3\xA9 in my opinion.';
$repl = preg_replace_callback("#(\\\x[0-9A-F]{2})#ui",
function ($m) { return chr(hexdec($m[1])); }, $string);
OUTPUT:
His nickname was ‘the Angel’,
which is kind of a cliché in my opinion.

Related

str_replace for UTF-16 characters

I have some strings containing characters such as \x{1f601} which I want to replace with some text.
When I do this using preg_replace, it would be something like:
preg_replace('/\x{1f601}/u', '######', $str)
However, this doesn't seem to work with str_replace:
str_replace("\x{1f601}", '######', $str)
How can I make such replacements work with str_replace?
preg_replace is a Regex parser/replacer, which is a Perl Regular expression engine, but str_replace is NOT and replaces things with a plaintext method
The Preg_replace you have got can be seen here in regex101, stating that:
matches the character 😁 with position 0x1f601 (128513 decimal or 373001 octal) in the character set
But this could be transferable to a non-regex find and replace,by copy and pasting that face smiley symbol into the str_replace directly.
$str = str_replace("😁", '######', $str)
Or, by reading deceze's comment which gives you a clean, small solution.
Additional:
You are using a character set that is non-standard so it may be useful for you to explore Mb_Str_replace (gitHub) which is an accompanyment (but not directly from) the mb_string collection of PHP functions.
Finally:
Why do you need to do string replace whe you are already doing regex preg_replace? Also please read the manual which states all of this fairly clearly.

preg_replace with :alnum: and UTF-8

I discovered that using the u modifier is helping sometimes when working with UTF-8 strings but on my Linux server it replaces the umlaut with - instead leaving it like on my Windows server.
mb_internal_encoding('UTF-8');
function clean($string) {
return preg_replace('/[^[:alnum:]]/ui', '-', $string);
}
echo clean("Test: föG");
Linux:
Test--f-G
Windows (as it should):
Test--föG
From the PHP documentation of the PCRE module:
In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes.
This is probably because of efficiency reasons: there are many Unicode characters. You can write your regular expression using the Unicode character properties instead of the POSIX character class. This will be somewhat slower though.
<?php
mb_internal_encoding('UTF-8');
function clean($string) {
return preg_replace('/[^\\p{L}\\p{N}]/ui', '-', $string);
}
echo clean("Test: föG");

PHP: How to encode U+FFFD in order to do a replace?

I'm trying to display a data feed on a page. We're experiencing encoding issues with a weird character. For some reason, in the feed there's the U+FFFD character. And htmlentities() will not escape the character, so I need to replace it manually. (I'm using PHP 5.3)
I've tried the following:
$string = str_replace( "\xFFFD", "_", $string );
$string = str_replace( "\XFFFD", "_", $string );
$string = str_replace( "\uFFFD", "_", $string );
$string = str_replace("\x{FFFD}", "_", $string );
$string = str_replace("\X{FFFD}", "_", $string );
$string = str_replace("\P{FFFD}", "_", $string );
$string = str_replace("\p{FFFD}", "_", $string );
None of the above work.
After reading this page - http://php.net/manual/en/regexp.reference.unicode.php - I'm not sure what I'm doing wrong. Do I need to compile UTF-8 support into PCRE?
You should attempt to fix the original problem, FFFD (The unicode replacement character) is not in most cases meant to be a real text character but a sign that something was attempted to be decoded in an UTF encoding but that something was not actually encoded in an UTF encoding. It is an alternative to silently discarding invalid bytes or completely halting the decoding process, either way, if you see it, there was an error.
There is no way to know what the original character was. Especially with your solution, since you replace the character with _, you cannot even know that the original source was decoded incorrectly. You should go back to the source and decode it properly.
Note: It's possible for a source text to use � as a literal, normal character, for instance when talking about it, and there is no error then. I am excluding this possibility in my answer.
Use preg_replace instead like this:
$string = preg_replace('#\x{FFFD}#u', '_', $string);
UTF-8 '�' is U+EFBFBD
to replace UTF you have to use multi hex char to replace it
xEF xBF xBD
$string = str_replace("\xEF\xBF\xBD",'X','My ��� some text');

PHP Regex Problem:

$string1 = preg_replace('/[^A-Za-z0-9äöü!&_=\+-]/', ' ', $string4);
This Regex shouldn't replace the chars äöü.
In Ruby it worked as expected.
But in PHP it replaces also the ä ö and ü.
Can someone give me a hint how to fix it?
Set the u pattern modifier (to tell php to treat the regex as a UTF-8 string).
'/[^A-Za-z0-9äöü!&_=\+-]/u'
i think this should work:
$string1 = preg_replace('/\[^A-Za-z0-9\pL!&_=\+-]/u', ' ', $string4 );
Unicode support is one of the features promised for PHP 6.
Currently in php5
use the multibyte string functions like mb_ereg
PHP will interpret '/regex/u' as a UTF-8 string, with preg_match,preg_replace

How to get rid of "®" and "™" in a string?

I have a string like "Welcome to McDonalds®: I'm loving it™" ... I want to get rid of ":", "'", ® and ™ symbols. I have tried the following so far:
$string = "Welcome to McDonalds®: I'm loving it™";
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);
But on the output I receive:
"Welcome to McDonaldsreg Im loving ittrade"... so preg_replace somehow converts ® to 'reg' and ™ to 'trade', which is not good for me and I cannot understand, why such a conversion happens at all.
How do I get rid of this conversion?
Solved: Thanks for ideas, guys. I solved the problem:
$string = preg_replace(
array('/[^a-zA-Z0-9 -]/', '/&[^\s]*;/'),
'',
preg_replace(
array('/&[^\s]*;/'),
'',
htmlentities($string)
)
);
You're probably having the special characters in entity form, i.e. ® is really ® in your string. So it's not seen by the replacement operation.
To fix this, you could filter for the &SOMETHING; substring, and remove them. There might be built-in methods to do this, perhaps html_entity_decode.
If you are looking to replace only the mentioned characters, use
$cleaned = str_replace(array('®','™','®','™', ":", "'"), '', $string);
Regular string replacement methods are usually faster and there is nothing in your example you want to replace that would need the pattern matching power of the Regular Expression engine.
EDIT due to comments:
If you need to replace character patterns (as indicated by the solution you gave yourself), a Regex is indeed more appropriate and practical.
In addition, I'm sure McD requires both symbols to be in place if that slogan is used on any public website
® is ®, and ™ is ™. As such, you'll want to remove anything that followsthe pattern &[#0-9a-z]+; before-hand:
$input = "Remove all ™ and ® symbols, please.";
$string = preg_replace("/&[#0-9a-z]+;/i", "", $input);

Categories