I'm looking for way to convert chars like āžšķūņrūķīš to azskunrukis. In other words, to replace ā with a, ž with z and so. Is there anything built-in, or I should create my own "library" of from-to symbols?
Take a look at iconv's transliteration capabilities:
<?php
$text = "This is the Euro symbol '€'.";
echo 'Original : ', $text, PHP_EOL;
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
echo 'Plain : ', iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
?>
The above example will output something similar to:
Original : This is the Euro symbol '€'.
TRANSLIT : This is the Euro symbol 'EUR'.
IGNORE : This is the Euro symbol ''.
Plain :
Notice: iconv(): Detected an illegal character in input string in .\iconv-example.php on line 7
This is the Euro symbol '
Your example text can be tranliterated using:
$translit = iconv('UTF-8', 'US-ASCII//TRANSLIT', 'āžšķūņrūķīš');
Here's an example with the text you provided: http://ideone.com/MJHvf
I'm not sure of any functions that do this directly, but there are some implementations of translation tables that do something like that in the comments on strtr's documentation page. They end up using a table that directly translates each character to its equivalent, i.e. "ž" => "z".
As an alternative to iconv, you could check out the Normalize functions of the intl extension (if available).
Related
In my string I have utf-8 non-breaking space (0xc2a0) and I want to replace it with something else.
When I use
$str=preg_replace('~\xc2\xa0~', 'X', $str);
it works OK.
But when I use
$str=preg_replace('~\x{C2A0}~siu', 'W', $str);
non-breaking space is not found (and replaced).
Why? What is wrong with second regexp?
The format \x{C2A0} is correct, also I used u flag.
Actually the documentation about escape sequences in PHP is wrong. When you use \xc2\xa0 syntax, it searches for UTF-8 character. But with \x{c2a0} syntax, it tries to convert the Unicode sequence to UTF-8 encoded character.
A non breaking space is U+00A0 (Unicode) but encoded as C2A0 in UTF-8. So if you try with the pattern ~\x{00a0}~siu, it will work as expected.
I've aggegate previous answers so people can just copy / paste following code to choose their favorite method :
$some_text_with_non_breaking_spaces = "some text with 2 non breaking spaces at the beginning";
echo 'Qty non-breaking space : ' . substr_count($some_text_with_non_breaking_spaces, "\xc2\xa0") . '<br>';
echo $some_text_with_non_breaking_spaces . '<br>';
# Method 1 : regular expression
$clean_text = preg_replace('~\x{00a0}~siu', ' ', $some_text_with_non_breaking_spaces);
# Method 2 : convert to bin -> replace -> convert to hex
$clean_text = hex2bin(str_replace('c2a0', '20', bin2hex($some_text_with_non_breaking_spaces)));
# Method 3 : my favorite
$clean_text = str_replace("\xc2\xa0", " ", $some_text_with_non_breaking_spaces);
echo 'Qty non-breaking space : ' . substr_count($clean_text, "\xc2\xa0"). '<br>';
echo $clean_text . '<br>';
The two codes do different things in my opinion: the first \xc2\xa0 will replace TWO characters, \xc2 and \xa0 with nothing.
In UTF-8 encoding, this happens to be the codepoint for U+00A0.
Does \x{00A0} work? This should be the representation for \xc2\xa0.
I did not work this variant ~\x{c2a0}~siu.
Varian \x{00A0} works. I have not tried the second option and here is the result:
I tried to convert it to hex and replace no-break space 0xC2 0xA0 (c2a0) to space 0x20 (20).
Code:
$hex = bin2hex($item);
$_item = str_replace('c2a0', '20', $hex);
$item = hex2bin($_item);
/\x{00A0}/, /\xC2\xA0/ and $clean_hex2bin-str_replace-bin2hex worked and didn't work. If I printed it out to the screen, it's all good, but if I tried to save it to a file, the file would be blank!
I ended up using iconv('UTF-8', 'ISO-8859-1//IGNORE', $str);
I have a string (taken from a MySQL database if it makes any difference) which looks normal enough:
Manufacture: Blah
The problem is that the space between Manufacture: and the <a> tag has a charcode of 194, not 32 as I would expect.
This is causing a preg_match with the following pattern to fail (please ignore the attempts to parse HTML with regex, I know it's not a good idea but this particular dataset is predictable enough to get away with it):
/Manufacture: *(<a[^>]*>([A-Za-z- 0-9]+)<\/a>)/i
If I replace the rogue space with a normal space character in a text editor and try again, the expression matches as expected, but I need to alter it programatically.
I tried str_replace:
$text = str_replace(chr(194), ' ', $text);
But the preg_match still fails. I then tried preg_replace:
$text = preg_replace('/[\xC2]/', ' ', $text);
But that doesn't work either, even though running that same pattern through preg_match does contain the expected match.
Does anyone have any ideas?
Can you please check the structure of the MySQL table where you get the contents of $text from? If the collation is utf8_general_ci or something like that then your string most likely contains a double-byte UNICODE character.
If that is the case then the PHP function iconv should do the trick. Here's the example from the PHP manual. The IGNORE option should remove the UNICODE character from the string.
<?php
$text = "This is the Euro symbol '€'.";
echo 'Original : ', $text, PHP_EOL;
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
echo 'Plain : ', iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
?>
The above example will output something similar to:
Original : This is the Euro symbol '€'.
TRANSLIT : This is the Euro symbol 'EUR'.
IGNORE : This is the Euro symbol ''.
Plain :
Notice: iconv(): Detected an illegal character in input string in .\iconv-example.php on line 7
This is the Euro symbol '
what if you try to match any whitespace character?
like so:
/Manufacture:\s*(<a[^>]*>([A-Za-z- 0-9]+)<\/a>)/i
I want to remove all HTML codes like " € á ... from a string using REGEX.
String: "This is a string " € á &"
Output Required: This is a string
you can try
$str="This is a string " € á &";
$new_str = preg_replace("/&#?[a-z0-9]+;/i",'',$str);
echo $new_str;
i hope this may work
DESC:
& - starting with
# - some HTML entities use the # sign
?[a-z0-9] - followed by
;- ending with a semi-colon
i - case insensitive.
If you're trying to totally remove entities (ie: not decoding them) then try this:
$string = 'This is a string " € á &';
$pattern = '/&([#0-9A-Za-z]+);/';
echo preg_replace($pattern, '', $string);
$str = preg_replace_callback('/&[^; ]+;/', function($matches){
return html_entity_decode($matches[0], ENT_QUOTES) == $matches[0] ? $matches[0] : '';
}, $str);
This will work, but won't strip € since that is not an entity in HTML 4. If you have PHP 5.4 you can use the flags ENT_QUOTES | ENT_HTML5 to have it work correctly with HTML5 entities like €.
preg_replace('#&[^;]+;#', '', "This is a string " € á &");
Try this:
preg_replace('/[^\w\d\s]*/', '', htmlspecialchars_decode($string));
Although it might remove some things you don't want removed. You may need to modify the regex.
str_replace does not replace accented letters by letters without accent. What's wrong with that?
This returns the expected result:
<?php
$string = get_post_custom_values ("text");
// Say get_post_custom_values ("text") equals "José José"
$string = str_replace(" ", "-", $string);
echo $string [0];
// Output "José-José"
?>
This does not work:
<?php
$string = get_post_custom_values ("text");
// Say get_post_custom_values ("text") equals "Joseph Joseph"
$string = str_replace("é", "e", $string);
echo $string [0];
// Output "José José". Nothing has changed
?>
Note: Translated from the Portuguese language with GoogleTranslate.
The easy, safe way to remove every accented letters is by using iconv :
setlocale(LC_ALL, "fr_CA.utf8"); // for instance
$output = iconv("utf-8", "ascii//TRANSLIT", $input);
Your current problem is most likely caused by a different encoding.
The character é as saved in your source code is not in the same encoding as the data you get back from get_post_custom_values. Encoding doesn't match → not recognized as the same character → not replaced.
OK I have read many threads and have found some options that work but now I am just more curious than anything...
When trying to remove characters like: Â é as google does not like them in the XML product feed.
Why does this work:
But neither of these 2 do?
$string = preg_replace("/[^[:print:]]+/", ' ', $string);
$string = preg_replace("/[^[:print:]]/", ' ', $string);
To put it all in context here is the full function:
// Remove all unprintable characters
$string = ereg_replace("[^[:print:]]", ' ', $string);
// Convert back into HTML entities after printable characters removed
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
// Decode back
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
// Return the UTF-8 encoded string
$string = strip_tags(stripslashes($string));
// Return the UTF-8 encoded string
return utf8_encode($string);
}
The reason that code doesn't work is because it removes characters that are not in the posix :print: character group which is comprised of printable characters. á É, etc are all printable.
You can find more about posix sets here.
Also, removing accentuated characters might not always be the best option... Check out this question for alternatives.