preg_replace UTF-8 doesn't work [duplicate] - php

This question already has answers here:
Matching Unicode letter characters in PCRE/PHP
(5 answers)
Closed 4 years ago.
I've got the following code which works fine on my offline test version but it fails on the online server.
$names = "dimitris giannIs micHalis";
echo preg_replace("/s\b/", "w", mb_convert_case($names, MB_CASE_TITLE, "UTF-8"));
The result I get is Dimitriw Gianniw Michaliw.
But instead of English chars/words I've got UTF-8 ones. If I place the above example as it is (in English) it works fine so I'm guessing I'm doing something wrong here with UTF-8

Typically (but see the note below the Edit), you need to use the u modifier on your regex to make it work with UTF-8 characters. e.g.
$words = "qθαεqθε γραεcισ cονσεcτε";
echo preg_replace("/ε\b/u", "α", mb_convert_case($words, MB_CASE_TITLE, "UTF-8"));
Output:
Qθαεqθα Γραεcισ Cονσεcτα
This example on rextester demonstrates the use of the u modifier (note that rextester doesn't support mb_convert_case but that doesn't really affect the result).
Edit
As was pointed out by #CasimiretHippolyte, it is possible to compile the PCRE extension (used by PHP for regex) to handle unicode characters by default with the --enable-unicode-properties option. This may explain the difference between the results on the offline test version and online server.

Related

Writing source code in PHP without special characters [duplicate]

This question already has answers here:
Unicode character in PHP string
(8 answers)
Closed 4 years ago.
Is There a way to print special characters in PHP using only source code with ascii characters?
For example, in javascript, we can use \u00e1 in the middle of text.
In Java we can use \u2202 for example.
And in PHP? How can I use it?
I don't want to include special chars in my source code.
I found 3 ways for this.
Php Documentation: http://php.net/manual/en/language.types.string.php#language.types.string.syntax.double
A good explanation in portuguese: https://pt.stackoverflow.com/questions/293500/escrevendo-c%C3%B3digo-em-php-sem-caracteres-especiais
Sintax added only in PHP7:
\u{[0-9A-Fa-f]+}
the sequence of characters matching the regular expression is a Unicode codepoint.
which will be output to the string as that codepoint's UTF-8 representation
examples:
<?php
echo "\u{00e1}\n";
echo "\u{2202}\n";
echo "\u{aa}\n";
echo "\u{0000aa}\n";
echo "\u{9999}\n";
Sintax for PHP7 and old PHP versions:
\x[0-9A-Fa-f]{1,2}
the sequence of characters matching the regular expression,
is a character in hexadecimal notation
examples:
<?php
echo "\xc3\xa1\n";
echo "\u{00e1}\n";
Using int to binary convertion functions:
<?php
printf('%c%c', 0xC3, 0xA1);
echo chr(0xC3) . chr(0xA1);
printf() Extended Unicode Characters?
http://phptester.net/

character encoding for mixed data [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 8 years ago.
I'm having an issue with getting the correct character encoding for data being POSTed which is built up from multiple sources (I get the data as a single POST variable). I think they're not in the same character encoding...
For instance, take the symbol £. If I do nothing to the character encoding I get two results:
a = £ and b = £
I've tried using various configurations of iconv() like so;
$data = iconv('UTF-8', 'windows-1252//TRANSLIT', $_POST['data']);
The above results in a = £ and b = �
I've also tried utf8_encode/decode as well as html_entity_decode, as I think there's a possibility that one of the pound symbols are being generated using html_entities.
I've tried setting the character encoding in the header which didn't work. I just can't get both instances to work at the same time.
I'm not sure what to try next, any ideas?
I've managed to work around this issue by finding the content that was causing an issue when everything else was in utf8 by using utf8_encode().
This appears to work for the £ symbol. I've not found any other characters causing an issue so far.
Note, I am still using iconv() in conjunction with this.

preg_match with cyrillic text [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to match Cyrillic characters with a regular expression
I have a simple php script which uses preg_match to compare a string against some cyrillic text inside a variable (e.g. $var = 'страница').
However when I input the cyrilic text into the variable it comes up as ???????? in my code.
$var1 = '/?????????????/';
I get the folowing warning when I run the script:
preg_match(): Compilation failed: nothing to repeat at offset 0
Can anyone suggest a solution?
thanks very much.
Change encoding of your scripts or all project source files on UTF for example in your IDE.
Use modifier for unicode
preg_match('/abcdef/u',$some_string)
Maybe it’s because of invalid codepage, which codepage has your interpreter and which codepage uses connection to a database (if any?)

Unknown character � after importing excel to MySQL, how to avoid it? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Problem in utf-8 encoding PHP + MySQL
I've imported about 1000 records into MySQL from an excel file. But now I'm seeing � between some texts. It seems they were double quotes.
How can I avoid this while importing data?
Can I use str_replace() function to handle this issue while printing data in web page?
Use preg_replace to do a regex replacement of all unrecognized characters.
Example:
$data = preg_replace("/[^a-zA-Z0-9]/", "", $data);
This example will replace all non alpha-numeric characters (anything that is not a-z, A-Z, 0-9).
http://php.net/manual/en/function.preg-replace.php
If your database is simple enough (no serialised values and no gigabytes in size), you could export it entirely (e.g. using PhpMyAdmin), open in a text editor, do search-replace and import it back.
str_replace('“', '"', $original_string);
there's a few characters word does this with, so you will want to probably also do:
str_replace("‘", "'", $original_string);
if you see other characters causing the same issue, you can open up the doc in word, and copy/paste the offending character into your editor and do a similar replacement.
Since you are most likely looking to replace the character with an equivalent version, you probably do not want to do a regex like suggested in another answer. str_replace is faster than preg_replace for type of use.

Convert special character (i.e. Umlaut) to most likely representation in ascii [duplicate]

This question already has answers here:
PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
(7 answers)
Closed 9 years ago.
i am looking for a method or maybe a conversion table that knows how to convert Umlauts and special characters to their most likely representation in ascii.
Example:
Ärger = aerger
Bôhme = bohme
Søren = soeren
pjérà = pjera
Anyone any idea?
Update:
Apart from the good accepted Answer, i also found PECLs Normalizer to be quite interesting, though i can not use it due to the server not having it and not being changed for me.
Also do check out this Question if the Answers here do not help you enough.
I find iconv completely unreliable, and I dislike preg_match solutions and big arrays ... so my favorite way is ...
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}

Categories