How to replace a symbol in a text string in PHP? - php

I want to do a search & replace in PHP with a symbol.
This is the symbol: ➤
I want to replace it with a dash, but that doesn't work. The problem looks like that the symbol cannot be found, even though it's there.
Other 'normal' search and replace operations work as expected. But replacing this symbol does not.
Any ideas how to address this symbol, so that the search and replace function actually can find it and replace it?

Your problem is (almost certainly) related to text/character encoding.
Special characters such as the ➤ you are referring to, are not part of the classical ISO-8859-1 character set; they are however part of Unicode family (codepoint U+27A4 to be exact). This means that, in order to use this (multibyte)character, you have to use a unicode character set, which generally means UTF-8.
All the basic characters (think A-Z, numbers, spaces, ...) overlap between UTF-8 and ISO-8859-1 (which is effectively the default character set), so when you don't use any special characters, you could use the wrong charset and things will pretty much continue to work just fine; that is until you try to use a character that is not part of the basic set.
Since your problem takes place entirely on the server side (inside PHP), and doesn't really touch upon the HTTP and HTML layers, we won't have to go into utf-8 content-type headers and the like, but you should be aware of them for future issues (if you weren't already).
The issue you have should be resolved once you meet 2 criteria:
Not all PHP functions are multibyte-aware; I'm not 100% sure, but i think str_replace is one of those which is not. The preg_replace function with its u flag enabled definitely is multibyte aware, and can serve the exact same function.
The text editor or IDE that you used to create the .php file may or may not be set to UTF-8 encoding, if it wasn't then you should switch that in order to be able to use such characters literally inside the source code.
Something like this should function correctly assuming the .php-file is stored in UTF-8 format:
$output = preg_replace('#➤#u', '-', $input);

Most likely you did not set the header of your PHP script to use the UTF-8 character set. Consider the following:
header('Content-type: text/plain; charset=utf-8');
$input = "This is the symbol: ➤";
$output = str_replace("➤", "-", $input);
echo $input . "\n" . $output;
This prints:
This is the symbol: ➤
This is the symbol: -

as that is simply replaceable using builtin php str_replace function, so that would be better if you can share us your code to check it more.
$str = "hey same let's change this to a dash: ➤";
echo "before: $str \n";
echo "after: ".str_replace("➤", "-", $str);
before: hey same let's change this to a dash: ➤
after: hey same let's change this to a dash: -
example

Related

Manipulating Thai Characters in PHP

I'm struggling getting Thai characters and PHP working together. This is what I'd like to do:
<?php
mb_internal_encoding('UTF-8');
$string = "ทาง";
echo $string[0];
?>
But instead of giving me the first character of $string (ท), I just get some messed up output. However, displaying $string itself works fine.
File itself is of course UTF-8 as well. Content-Type in Header is also set to UTF-8. I changed the neccessary lines in php.ini according to this site.
utf8_encoding() and utf8_decoding() also don't help. Maybe any of you has an idea?
In PHP When you access a string with $string[0] it doesn't return the fist character, but the first byte.
You should use mb_substr instead. For example:
mb_substr($string, 0, 1, 'UTF-8');
Note: Since you are using mb_internal_encoding('UTF-8'); you may as well ignore the last parameter.
This happens because PHP is not aware of the encoding a string is in (that is: the encoding is not stored in the string object). So it will treat it as ANSI/ASCII by default. If you don't want that, then you must use the Multibyte String Function (mb_*).
When you set mb_internal_encoding('UTF-8'); you are telling it to use UTF-8 for all the Multibyte String Function, but not for anything else.

PHP: html_entity_decode removing/not showing character

I am having a problem with  character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.
You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.
Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

Correct character encoding

I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .
Currently, I use the following code to convert various HTML entities from the scraped data:
htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)
Is there a better way to handle this sort of thing?
HTML entities have two goals:
Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.
They are not exactly an encoding tool.
If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).
You don't want to use htmlentities right away, I would use that on the data at the last point before you store it. One of the problems you'll run into is people don't always encode their entities properly anyway. Not everyone uses ™ they just copy the trademark in. If you put some logic in to try and grab whatever they put in and encode it properly you may be better off. For Example:
$patterns = array();
$patterns[0] = '/—/';
$patterns[1] = '/&nsbsp;/';
$patterns[2] = '/®/';
$replacements = array();
$replacements[2] = '&151;';
$replacements[1] = '&160;';
$replacements[0] = '&174;';
$ourhtml = preg_replace($patterns, $replacements, $html);
You could find all the "gotcha" characters like dashes and single quotes, apostrophes etc and encode them by hand, as well as use a set standard to the entities (text or numeric).
You could also use regular expressions to do the same thing, and would probably be a more elegant solution. But my suggestion would be to take some time filtering out what you don't want by hand, and then you know your data will be prepared exactly how you like.
It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?
Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)
First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?
One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.
Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.
I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &amp;).

Route-problem regarding Url-encoded Umlauts (using the Zend-framework)

Today I stumbled about a Problem which seems to be a bug in the Zend-Framework. Given the following route:
<test>
<route>citytest/:city</route>
<defaults>
<controller>result</controller>
<action>test</action>
</defaults>
<reqs>
<city>.+</city>
</reqs>
</test>
and three Urls:
mysite.local/citytest/Berlin
mysite.local/citytest/Hamburg
mysite.local/citytest/M%FCnchen
the last Url does not match and thus the correct controller is not called. Anybody got a clue why?
Fyi, where are using Zend-Framework 1.0 ( Yeah, I know that's ancient but I am not in charge to change that :-/ )
Edit: From what I hear, we are going to upgrade to Zend 1.5.6 soon, but I don't know when, so a Patch would be great.
Edit: I've tracked it down to the following line (Zend/Controller/Router/Route.php:170):
$regex = $this->_regexDelimiter . '^' .
$part['regex'] . '$' .
$this->_regexDelimiter . 'iu';
If I change that to
$this->_regexDelimiter . 'i';
it works. From what I understand, the u-modifier is for working with asian characters. As I don't use them, I'm fine with that patch for know. Thanks for reading.
Please its working perfect for me
/^[\p{L}-. ]*$/u
^ Start of the string
[ ... ]* Zero or more of the following:
\p{L} Unicode letter characters
– dashes
. periods
spaces
$ End of the string
/u Enable Unicode mode in PHP
EXAMPLE:
$str= ‘Füße’;
if (!preg_match(“/^[\p{L}-. ]*$/u”, $str))
{
echo ‘error’;
}
else
{
echo “success”;
}
The problem is the following:
Using the /u pattern modifier prevents
words from being mangled but instead
PCRE skips strings of characters with
code values greater than 127.
Therefore, \w will not match a
multibyte (non-lower ascii) word at
all (but also won’t return portions of
it). From the pcrepattern man page;
In UTF-8 mode, characters with values
greater than 128 never match \d, \s,
or \w, and always match \D, \S, and
\W. This is true even when Unicode
character property support is
available.
From Handling UTF-8 with PHP.
Therefore it's actually irrelevant if your URL is ISO-8859-1 encoded (mysite.local/citytest/M%FCnchen) or UTF-8 encoded (mysite.local/citytest/M%C3%BCnchen), the default regex won't match.
I also made experiments with umlauts in URLs in Zend Framework and came to the conclusion that you wouldn't really want umlauts in your URLs. The problem is, that you cannot rely on the encoding used by the browser for the URL. Firefox (prior to 3.0) for example does not UTF-8 encode URLs entered into the address textbox (if not specified in about:config) and IE does have a checkbox within its options to choose between regular and UTF-8 encoding for its URLs. But if you click on links within a page both browsers use the URL in the given encoding (UTF-8 on an UTF-8 page). Therefore you cannot be sure in which encoding the URLs are sent to your application - and detecting the encoding used is not that trivial to do.
Perhaps it's better to use transliterated parameters in your URLs (e.g. change Ä to Ae and so on). There is a really simple way to this (I don't know if this works with every language but I'm using it with German strings and it works quite well):
function createUrlFriendlyName($name) // $name must be an UTF-8 encoded string
{
$name=mb_convert_encoding(trim($name), 'HTML-ENTITIES', 'UTF-8');
$name=preg_replace(
array('/ß/', '/&(..)lig;/', '/&([aouAOU])uml;/', '/&(.)[^;]*;/', '/\W/'),
array('ss', '$1', '$1e', '$1', '-'),
$name);
$name=preg_replace('/-{2,}/', '-', $name);
return trim($name, '-');
}
The u modifier makes the regexp expect utf-8 input. This would suggest that ZF expects utf-8 encoded input, and not ISO-8859-1 (I'm not too familiar with ZF, so I'm just guessing here).
If that's the case, you'll have to utf-8 encode the ü before using it in a URL. It would then become: mysite.local/citytest/M%C3%BCnchen
Note that since the rest of your application probably speaks ISO-8859-1 (Which is default for PHP <= 5), you will have to explicitly decode the variable with utf8_decode, before you can use it.

Categories