I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?
Related
I have a file, which contains some cyrillic characters. When I open this file in Notepad++ I see, that it has ANSI encoding. If I manually encode it into UTF-8 using Notepad++, then everything is absolutely ok - I can use this file in my parsers and get results. But what I want is to do it programmatically, using PHP. This is what I tried after searching through SO and documentation:
file_put_contents($file, utf8_encode(file_get_contents($file)));
In this case when my algorithm parses the resulting files, it meets such letters as "è", "í" , "â". In other words, in this case I get some rubbish. I also tried this:
file_put_contents($file, iconv('WINDOWS-1252', 'UTF-8', file_get_contents($file)));
But it produces the very same rubbish. So, I really wonder how can I achive programmatically what Notepad++ does. Thanks!
Notepad++ may report your encoding as ANSI but this does not necessarily equate to Windows-1252. 1252 is an encoding for the Latin alphabet, whereas 1251 is designed to encode Cyrillic script. So use
file_put_contents($file, iconv('WINDOWS-1251', 'UTF-8', file_get_contents($file)));
to convert from 1251 to utf-8 with iconv.
So today I was updating some code I made that took some data from a webpage and emailed it to people for convenience. However, I noticed that whoever was typing the text used a program which used some other encoding which had a weird ’ character which was 0xD5 (213) in the Mac Roman set. But when they uploaded it to their website, it came out as Õ. So I used php and did this:
$parsed = str_ireplace("Õ", "'", $parsed);
So I did this and tested it, but it didn't seem to work. Can anyone help me? Thanks!
If this is just a single anomaly you're correcting you can specify it with a hex escape sequence like:
$parsed = str_replace("\xD5", "'", $parsed);
The reason just "Õ" isn't working is the encoding of your PHP file doesn't represent Õ as 0xD5. Strings are just byte sequences and what you're giving str_ireplace don't match. (Well, that and str_ireplace is gonna do funky things with it, str_replace is preferred here.)
More appropriate to handle the problem in general would be to use iconv to convert the input string from whatever its source encoding is into the output encoding you need.
Examples:
$parsed = iconv('MACINTOSH', 'UTF-8', $parsed);
or
$parsed = iconv('MACINTOSH', 'ASCII//TRANSLIT', $parsed);
The //TRANSLIT here means that when a character can't be represented in the target charset, it'll be approximated through one or several similarly looking characters. There's a lot ASCII (and others) can't represent, so transliteration can come in handy if you're not outputting UTF-8 (which would be ideal.)
I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.
I create a folder as follows.
function create(){
if ($this->input->post('name')){
...
...
$folder = $this->input->post('name');
$folder = strtolower($folder);
$forbidden = array(" ", "å", "ø", "æ", "Å", "Ø", "Æ");
$folder = str_replace($forbidden, "_", $folder);
$folder = 'images/'.$folder;
$this->_create_path($folder);
...
However it does not replace Norwegian character with _ (under bar)
For example, Åtest øre will create a folder called ã…test_ã¸re.
I have
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
in a header.
I am using PHP/codeigniter on XAMPP/Windows Vista.
How can I solve this problem?
You have to remember to save your PHP file in the correct encoding. Try saving it in ISO-8859-1 or UTF8. Also remember to reopen it after saving, so that you'll see if it is saved correctly or if the characters were converted. Your IDE may convert them to bytes (weird characters) without displaying the change in the editor.
When you write out your file, Save As..
filename.php and below it should say Encoding. Here you should choose ISO-8859-1 (or Latin-1) or UTF8. If you use Notepad this won't be an option, you need to get a proper editor.
Apply the same encoding to all other PHP files in that application. I think ISO-8859-1 will do it, but UTF8 is a good default, so choose it if that works for this.
Try explicitly setting the internal encoding used by PHP:
mb_internal_encoding('UTF-8');
Edit: actually, now that I think about it... I'd advise using strtr. It has support for multibyte characters and would be a good deal faster:
$from = ' åøæÅØÆ';
$to = '_______';
$fixed = strtr($string, $from, $to);
Most of the normal string functions don't handle Unicode chars well, if at all.
In this situation, you could use a regular expression to work around that.
<?php
$string = 'Åtest øre';
$regexp = '/( |å|ø|æ)/iu';
$replace_char = '_';
echo preg_replace($regexp, $replace_char, $string)
?>
Returns:
_test__re
The interface you get to the Windows filesystem from PHP is the C standard library one. Windows maps its Unicode filesystem naming scheme into bytes for PHP using the system default codepage. Probably your system default codepage is 1252 Western European if you are in Norway, but that's a deployment detail that can change when you move to put it on a live server and it's not something that's easy to fix.
Your page/site encoding is UTF-8. Unfortunately whilst modern Linux servers typically use UTF-8 as their filesystem access encoding, Windows can't because the default code page is never UTF-8. You can convert a UTF-8 string into cp1252 using iconv; naturally all characters that don't fit in this code page will be lost or mangled. The alternative would be to make the whole site use charset=iso-8859-1, which can (for most cases) be stored in cp1252. It's a bit backwards to be using a non-UTF-8 charset though and of course it'll still break if you deploy it to a machine using a different default code page.
For this reason and others, filenames are hard. You should do everything you can to avoid making a filename out of an arbitrary string. There are many more characters you would need to block to make a string fit in a filename on Windows and avoid directory traversal attacks. Much better to store an ID like 123.jpeg on the filesystem, and use scripted-access or URL rewriting if you want to make it appear under a different string name.
If you must make a Windows-friendly filename from an arbitrary string, it would be easiest to do something similar to slug generation: preg_replace away all characters (Unicode or otherwise) that don't fit known-safe ones like `[A-Za-z0-9_-], check the result isn't empty and doesn't match one of the bad filenames (if so, prepend an underscore) and finally add the extension.
Use this.
$string = $this->input->post('name');
$regexp = '/( |å|ø|æ|Å|Ø|Æ|Ã¥|ø|æ|Ã…|Ø|Æ)/iU';
$replace_char = '_';
So I'm working on a project that is taking data from a file, in the file some lines require utf8 symbols but are encoded oddly, they are \xC6 for example rather than being \Æ
If I do as follows:
$name = "\xC6ther";
$name = preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
echo utf8_encode($name);
It works fine. I get this:
Æther
But if I pull the same data from MySQL, and do as follows:
$name = $row['OracleName'];
$name = preg_replace('/x([a-fA-F0-9]{2})/', '\&#$1;', $name);
$name = utf8_encode($name);
Then I receive this as output:
\&#C6;ther
Anyone know why this is?
As requested, vardump of $row['OracleName'];
string(15) "xC6ther Barrier"
on your second preg_replace why there is a \
preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
ok I think there is some confusion here. you regular expression is matching something like x66 and would replace that by 'B', which seems to be some html entities encoding to me but you are using utf8_encode which do that (from manual):
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
so the things would never get converted ... (or to be more precise the 'B' would remains 'B' since they are all same characters in ISO-8859-1 and UTF-8)
also to be noted on your first snippet you use \xC6 but this would never get caught by the preg_replace since it's already encoded character. The \x means the next hex number (0x00 ~ 0xFF) would be drop in the string as is. it won't make a string xC6
So I am kind of confused of what you really wanna do. what the preg_replace is all about?
if you want to convert HTML entities to UTF-8 look into mb_convert_encoding (manual), if you want to do the reverse, code in HTML entities from some UTF-8 look into htmlentities (manual)
and if it has nothing to do with all of that and you want to simply change encoding mb_convert_encoding is still there.
Figured out the problem, on the SQL pull I missed an 'x' in the preg_replace
preg_replace('/x([a-fA-F0-9]{2})/', '&#x$1;', $name);
Once I added in the x, it worked like a charm.