Php regular expressions character encoding issue - php

My regular expression wont consider accented characters thus not finding any matches when I am searching words containing ü,õ,ö or ä characters.
$data is HTML data stripped from HTML tags using strip_tags and containing words with ü, õ, ö and ä characters loaded via CURL from website with character encoding UTF-8 (as returned headers tell me);
$data = strip_tags( curl_exec('my_website_url') );
$match = preg_match( '/ü/' , $data , $matches );
I have tried using following (also with 'ISO-8859-1'):
mb_internal_encoding("UTF-8");
mb_regex_encoding('UTF-8');
or:
$data = utf8_decode($data)
Not success yet.

Make sure your PHP source file is UTF-8 encoded as well.
If it's for example ISO-8859-1, the ü in your preg_match directive will be a different character from the üs in your UTF-8 data.

You should tell PRCE that you are using UTF-8 which is done by adding u modifier -> '/ü/u'. But if possible do not put these characters directly into source code. If you change (or your editor will) encoding of the file, your code will stop working and tracing this down would be quite PITA. I'd suggest, instead of using '/ü/' directly to replace character in question with its code: '/\x{c3bc}/u' - the 0xc3bc is your letter.

Related

Change encoding from windows-1251 to utf-8

I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?

PHP Unicode Character Detection

I'm trying to get contents from a certain webpage , and replace the next mark : ’ with another substring. It's not a regular apostrophe and even substr_count($content,"’") return 0.
It seems like I cannot detect that mark, and therefor can't replace him using substr_replace.
How could I handle this problem?
Thanks in advance.
Most likely the $content and the ’ character in your source code are simply not in the same encoding. substr_count compares byte by byte. The ’ character in your source code has the byte representation of however your PHP file is encoded. The $content has the encoding of whatever encoding it's in. If the two don't match, the substring won't be found.
Convert the $content to some standardized encoding you're working in.
Read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
If you are working with unicode characters. it's wise to use the multibyte string functions
http://www.php.net/manual/en/function.mb-substr-count.php

PHP: html_entity_decode removing/not showing character

I am having a problem with  character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.
You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.
Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

Replace unicode character

I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.

Route-problem regarding Url-encoded Umlauts (using the Zend-framework)

Today I stumbled about a Problem which seems to be a bug in the Zend-Framework. Given the following route:
<test>
<route>citytest/:city</route>
<defaults>
<controller>result</controller>
<action>test</action>
</defaults>
<reqs>
<city>.+</city>
</reqs>
</test>
and three Urls:
mysite.local/citytest/Berlin
mysite.local/citytest/Hamburg
mysite.local/citytest/M%FCnchen
the last Url does not match and thus the correct controller is not called. Anybody got a clue why?
Fyi, where are using Zend-Framework 1.0 ( Yeah, I know that's ancient but I am not in charge to change that :-/ )
Edit: From what I hear, we are going to upgrade to Zend 1.5.6 soon, but I don't know when, so a Patch would be great.
Edit: I've tracked it down to the following line (Zend/Controller/Router/Route.php:170):
$regex = $this->_regexDelimiter . '^' .
$part['regex'] . '$' .
$this->_regexDelimiter . 'iu';
If I change that to
$this->_regexDelimiter . 'i';
it works. From what I understand, the u-modifier is for working with asian characters. As I don't use them, I'm fine with that patch for know. Thanks for reading.
Please its working perfect for me
/^[\p{L}-. ]*$/u
^ Start of the string
[ ... ]* Zero or more of the following:
\p{L} Unicode letter characters
– dashes
. periods
spaces
$ End of the string
/u Enable Unicode mode in PHP
EXAMPLE:
$str= ‘Füße’;
if (!preg_match(“/^[\p{L}-. ]*$/u”, $str))
{
echo ‘error’;
}
else
{
echo “success”;
}
The problem is the following:
Using the /u pattern modifier prevents
words from being mangled but instead
PCRE skips strings of characters with
code values greater than 127.
Therefore, \w will not match a
multibyte (non-lower ascii) word at
all (but also won’t return portions of
it). From the pcrepattern man page;
In UTF-8 mode, characters with values
greater than 128 never match \d, \s,
or \w, and always match \D, \S, and
\W. This is true even when Unicode
character property support is
available.
From Handling UTF-8 with PHP.
Therefore it's actually irrelevant if your URL is ISO-8859-1 encoded (mysite.local/citytest/M%FCnchen) or UTF-8 encoded (mysite.local/citytest/M%C3%BCnchen), the default regex won't match.
I also made experiments with umlauts in URLs in Zend Framework and came to the conclusion that you wouldn't really want umlauts in your URLs. The problem is, that you cannot rely on the encoding used by the browser for the URL. Firefox (prior to 3.0) for example does not UTF-8 encode URLs entered into the address textbox (if not specified in about:config) and IE does have a checkbox within its options to choose between regular and UTF-8 encoding for its URLs. But if you click on links within a page both browsers use the URL in the given encoding (UTF-8 on an UTF-8 page). Therefore you cannot be sure in which encoding the URLs are sent to your application - and detecting the encoding used is not that trivial to do.
Perhaps it's better to use transliterated parameters in your URLs (e.g. change Ä to Ae and so on). There is a really simple way to this (I don't know if this works with every language but I'm using it with German strings and it works quite well):
function createUrlFriendlyName($name) // $name must be an UTF-8 encoded string
{
$name=mb_convert_encoding(trim($name), 'HTML-ENTITIES', 'UTF-8');
$name=preg_replace(
array('/ß/', '/&(..)lig;/', '/&([aouAOU])uml;/', '/&(.)[^;]*;/', '/\W/'),
array('ss', '$1', '$1e', '$1', '-'),
$name);
$name=preg_replace('/-{2,}/', '-', $name);
return trim($name, '-');
}
The u modifier makes the regexp expect utf-8 input. This would suggest that ZF expects utf-8 encoded input, and not ISO-8859-1 (I'm not too familiar with ZF, so I'm just guessing here).
If that's the case, you'll have to utf-8 encode the ü before using it in a URL. It would then become: mysite.local/citytest/M%C3%BCnchen
Note that since the rest of your application probably speaks ISO-8859-1 (Which is default for PHP <= 5), you will have to explicitly decode the variable with utf8_decode, before you can use it.

Categories