Route-problem regarding Url-encoded Umlauts (using the Zend-framework) - php

Today I stumbled about a Problem which seems to be a bug in the Zend-Framework. Given the following route:
<test>
<route>citytest/:city</route>
<defaults>
<controller>result</controller>
<action>test</action>
</defaults>
<reqs>
<city>.+</city>
</reqs>
</test>
and three Urls:
mysite.local/citytest/Berlin
mysite.local/citytest/Hamburg
mysite.local/citytest/M%FCnchen
the last Url does not match and thus the correct controller is not called. Anybody got a clue why?
Fyi, where are using Zend-Framework 1.0 ( Yeah, I know that's ancient but I am not in charge to change that :-/ )
Edit: From what I hear, we are going to upgrade to Zend 1.5.6 soon, but I don't know when, so a Patch would be great.
Edit: I've tracked it down to the following line (Zend/Controller/Router/Route.php:170):
$regex = $this->_regexDelimiter . '^' .
$part['regex'] . '$' .
$this->_regexDelimiter . 'iu';
If I change that to
$this->_regexDelimiter . 'i';
it works. From what I understand, the u-modifier is for working with asian characters. As I don't use them, I'm fine with that patch for know. Thanks for reading.

Please its working perfect for me
/^[\p{L}-. ]*$/u
^ Start of the string
[ ... ]* Zero or more of the following:
\p{L} Unicode letter characters
– dashes
. periods
spaces
$ End of the string
/u Enable Unicode mode in PHP
EXAMPLE:
$str= ‘Füße’;
if (!preg_match(“/^[\p{L}-. ]*$/u”, $str))
{
echo ‘error’;
}
else
{
echo “success”;
}

The problem is the following:
Using the /u pattern modifier prevents
words from being mangled but instead
PCRE skips strings of characters with
code values greater than 127.
Therefore, \w will not match a
multibyte (non-lower ascii) word at
all (but also won’t return portions of
it). From the pcrepattern man page;
In UTF-8 mode, characters with values
greater than 128 never match \d, \s,
or \w, and always match \D, \S, and
\W. This is true even when Unicode
character property support is
available.
From Handling UTF-8 with PHP.
Therefore it's actually irrelevant if your URL is ISO-8859-1 encoded (mysite.local/citytest/M%FCnchen) or UTF-8 encoded (mysite.local/citytest/M%C3%BCnchen), the default regex won't match.
I also made experiments with umlauts in URLs in Zend Framework and came to the conclusion that you wouldn't really want umlauts in your URLs. The problem is, that you cannot rely on the encoding used by the browser for the URL. Firefox (prior to 3.0) for example does not UTF-8 encode URLs entered into the address textbox (if not specified in about:config) and IE does have a checkbox within its options to choose between regular and UTF-8 encoding for its URLs. But if you click on links within a page both browsers use the URL in the given encoding (UTF-8 on an UTF-8 page). Therefore you cannot be sure in which encoding the URLs are sent to your application - and detecting the encoding used is not that trivial to do.
Perhaps it's better to use transliterated parameters in your URLs (e.g. change Ä to Ae and so on). There is a really simple way to this (I don't know if this works with every language but I'm using it with German strings and it works quite well):
function createUrlFriendlyName($name) // $name must be an UTF-8 encoded string
{
$name=mb_convert_encoding(trim($name), 'HTML-ENTITIES', 'UTF-8');
$name=preg_replace(
array('/ß/', '/&(..)lig;/', '/&([aouAOU])uml;/', '/&(.)[^;]*;/', '/\W/'),
array('ss', '$1', '$1e', '$1', '-'),
$name);
$name=preg_replace('/-{2,}/', '-', $name);
return trim($name, '-');
}

The u modifier makes the regexp expect utf-8 input. This would suggest that ZF expects utf-8 encoded input, and not ISO-8859-1 (I'm not too familiar with ZF, so I'm just guessing here).
If that's the case, you'll have to utf-8 encode the ü before using it in a URL. It would then become: mysite.local/citytest/M%C3%BCnchen
Note that since the rest of your application probably speaks ISO-8859-1 (Which is default for PHP <= 5), you will have to explicitly decode the variable with utf8_decode, before you can use it.

Related

How to replace a symbol in a text string in PHP?

I want to do a search & replace in PHP with a symbol.
This is the symbol: ➤
I want to replace it with a dash, but that doesn't work. The problem looks like that the symbol cannot be found, even though it's there.
Other 'normal' search and replace operations work as expected. But replacing this symbol does not.
Any ideas how to address this symbol, so that the search and replace function actually can find it and replace it?
Your problem is (almost certainly) related to text/character encoding.
Special characters such as the ➤ you are referring to, are not part of the classical ISO-8859-1 character set; they are however part of Unicode family (codepoint U+27A4 to be exact). This means that, in order to use this (multibyte)character, you have to use a unicode character set, which generally means UTF-8.
All the basic characters (think A-Z, numbers, spaces, ...) overlap between UTF-8 and ISO-8859-1 (which is effectively the default character set), so when you don't use any special characters, you could use the wrong charset and things will pretty much continue to work just fine; that is until you try to use a character that is not part of the basic set.
Since your problem takes place entirely on the server side (inside PHP), and doesn't really touch upon the HTTP and HTML layers, we won't have to go into utf-8 content-type headers and the like, but you should be aware of them for future issues (if you weren't already).
The issue you have should be resolved once you meet 2 criteria:
Not all PHP functions are multibyte-aware; I'm not 100% sure, but i think str_replace is one of those which is not. The preg_replace function with its u flag enabled definitely is multibyte aware, and can serve the exact same function.
The text editor or IDE that you used to create the .php file may or may not be set to UTF-8 encoding, if it wasn't then you should switch that in order to be able to use such characters literally inside the source code.
Something like this should function correctly assuming the .php-file is stored in UTF-8 format:
$output = preg_replace('#➤#u', '-', $input);
Most likely you did not set the header of your PHP script to use the UTF-8 character set. Consider the following:
header('Content-type: text/plain; charset=utf-8');
$input = "This is the symbol: ➤";
$output = str_replace("➤", "-", $input);
echo $input . "\n" . $output;
This prints:
This is the symbol: ➤
This is the symbol: -
as that is simply replaceable using builtin php str_replace function, so that would be better if you can share us your code to check it more.
$str = "hey same let's change this to a dash: ➤";
echo "before: $str \n";
echo "after: ".str_replace("➤", "-", $str);
before: hey same let's change this to a dash: ➤
after: hey same let's change this to a dash: -
example

Which middot character is this?

$string = 'Single · Female'
I copied it from facebook.
In html source its just that dot, how did they type it?
While echoing in php its A with circumflex (Â) concatenated with that same dot.
How can i explode this string with that dot?
It is U+00B7 MIDDLE DOT, a character used for many purposes, e.g. as a separator between links, alternatives, or other items.
If your code displays it as ·, then the reason is that the UTF-8 encoded form of U+00B7, namely 0xC2 0xB7, is being misinterpreted as being ISO-8859-1 or Windows-1252 encoded. You should fix this basic problem (instead of trying to deal with some of its symptoms). See UTF-8 all the way through.
Regarding the question “how did they type it?”, we cannot really know, and we need not know. There are zillions of ways to type characters, and anyone can invent a few more. (On my keyboard, I use AltGr Shift X. If I needed to type “·” on a Windows computer with vanilla settings, I would use Alt 0183.)
I believe this is an interpunct. It can be used through the HTML entities · or · and in PHP with the unicode value U+00B7.
If you want to echo the unicode character without HTML entities, you can set the character encoding to UTF-8. Splitting is done through explode("·", $textToSplit) given that your PHP file is using UTF-8 as character encoding.

PHP: html_entity_decode removing/not showing character

I am having a problem with  character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.
You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.
Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

Cakephp sanitizing and special characters

i'm using sanitize::paranoid on a string but i need to exclude a few special characters but it doesn't seem to work.
$content=sanitize::paranoid($content,array('à',' '));
I've changed the encoding of my file from ansi to utf8 but cakephp doesn't really like it so i need to find another way.
That array should contain the list of characters to exclude from sanitization, but it keep removing the "à" and i want those character in the final string.
Sanitize:paranoid is a simple preg_replace ($allow is just additional characters, escaped):
preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
As you can see, paranoid is quite paranoid... doesn't accept non-ascii letters by default.
The file where you had the à was probably saved in another encoding (working on windows?)
Anyway, if you want you can write a better filter by using /[^\p{L}]/u, which excludes letters in any lanaguage.
Taken from the Sanitize::paranoid function:
cleaned = preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
Because your character (à) is not in this range it will not be returned.
If you're using Cake 2.x you can override the Sanitize class in your app folder
and replace all occurrences of:
a-zA-Z0-9
with:
\w
This should return the accented character (it does for me). You can also look at the
multibyte functions if you like but that might be a problem if you're building a CMS.
it must be some special encoding problems that cakephp paranoid doesnt know
Sanitize::paranoid($badString, array(' ', '#')); # is the allowed char
it should be working. i tried this example myself

\w in PHP preg_replace covers only second byte of UTF-8 chars

we have this code:
$value = preg_replace("/[^\w]/", '', $value);
where $value is in utf-8. After this transformation first byte of multibyte characters is stripped. How to make \w cover UTF-8 chars completely?
Sorry, i am not very well in PHP
You could try with the /u modifier:
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
If that won't do, try
mb_ereg_replace - Replace regular expression with multibyte support
instead.
There is this nasty u modifier to pcre patterns in PHP. It states that the regex is encoded in UTF8, but I found that it treats the input as UTF8, too.
Append u to regex, to turn on the multibyte unicode mode of PCRE:
$value = preg_replace("/[^\w]/u", '', $value);
Corollary
In unicode mode, PCRE expects everything is multibyte and if it is not then there will be problems meeting deadlines. Therefore, to convert anything to UTF-8 (and drop any unconvertible junk), we first use:
$value = iconv( 'ISO-8859-1', 'UTF-8//IGNORE//TRANSLIT', $i );
to clean and prep the input.
Because everything can be encoded into ISO-8859-1 (even if some obscure characters appear incorrectly), and since most web browsers run natively in 8859 (unless told to use UTF-8), we've found this function as a general, safe, effective method to 'take anything, drop any junk, and convert into UTF-8'.
mb_ereg_* is deprecated as of 5.3.0 -- so using those functions is not the right way to go.
try this function instead...http://php.net/manual/en/function.mb-ereg-replace.php
Use [^\w]+ instead of [^\w]
You can also use \W in place of [^\w]

Categories