PHP: How to properly split a unicode Korean string?

PHP: How to properly split a unicode Korean string? - php

I have a problem where I can't seem to be able to write "certain" Korean characters. Let me try to explain. These are the steps I take.
MS Access DB file (US version) has a table with Korean in it. I export this table as a text file with UTF-8 encoding. Let's call it "A.txt"
When A.txt is read, stored in an array, then written to a new file (B.txt), all characters display properly. I'm using header("Content-Type: text/plain; charset=UTF-8"); at the very beginning of my PHP script. I simply use fwrite($fh, $someStr).
WHen I read B.txt in another script and write to yet a new file (C.txt), there's a certain column (obvisouly in the PHP code, I'm not working with a table or matrix, but effectively speaking when outputted back to the original text file format) that causes the characters to show up something like this: ¸ì¹˜ ì–´ëœíŠ¸ ë‚˜ì¼ë¡. This entire column has broken characters, so if I have 5 columns in a text file, delimited by commas and encapsulated with double quotes, this column will break all of the other columns' Korean characters. If I omit this column in writing the text file, all is well.
Now, I noticed that certain PHP functions/operations break the Unicode characters. For example, if I use preg_replace() for any of the Korean strings and try to fwrite() that, it will break. However, I'm not performing anything that I'm not already doing on other fields/columns (speaking in terms of text file format), and other sections are not broken.
Does anyone have any idea on how to rectify this? I've tried utf8_encode() and mb_convert_encoding() in different ways with no success. I'm reading utf8_encode() wouldn't even be necessary if my file is UTF-8 to begin with. I've tried setting my computer language to Korean as well..
I've spent 2 days on this already, and it's becoming a huge waste of time. Please help!
UPDATE:
I think I may have found the culprit. In the script that creates B.txt, I split a long Korean string into two (using string ...<br /><br />... as indicator) and assign them to different columns. I think this splitting operation is ultimately causing the problem.
NEW QUESTION:
How do I go about splitting this long string into two while preserving the unicode? Previsouly, I had used strpos() and substr(), but I am reading that the mb_*() function might be what I need.. Testing now.

Try the unicode modifier (u) for preg
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Related

preg_replace PREG_BAD_UTF8_ERROR

i have an annoying problem with preg_replace and charsets. I'm doing a couple preg_replace in a row but unfortunate the first time any special character like äöüß is inserted by preg_replace i'm getting PREG_BAD_UTF8_ERROR on subsequent calls.
Beside that the special characters inserted are displayed just fine, they just break any subsequent preg_replace call. Is preg_ utf-8 only?
The text preg_replace is working on is coming from MySQL Database, also the replacement is crafted in the php file with values from MySQL. mb_detect_encoding() says ASCII for the text until the first replacement with special characters, it then detects UTF-8, so it changes and this might be the problem.
For your information i'm working with iso-8859-1 encoding (PHP, MySQL, meta-charset). Furthermore i have a workaround with htmlentities on the replacement string that is working for now.
Any ideas on how to solve it?

What you are looking for is probably mb_ereg_replace. It handles multibyte encodings and should perform fine with differrent ones. Be sure to use mb_regex_encoding along with it.

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.

Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.

There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

PHP - Can't remove strange character

I'd really appreciate some help with this. I've wasted days on this problem and none of the suggestions I have found online seem to give me a fix.
I have a CSV file from a supplier. It appears to have been exported from an Microsoft system.
I'm using PHP to import the data into MySQL (both latest versions).
I have one particular record which contains a strange character that I can't get rid of. Manual editing to remove the character is possible, but I would prefer an automated solution as this will happen multiple times a day.
The character appears to be an interpretation of a “smart quote”. A hex editor tells me that the character codes are C2 and 92. In the hex editor it looks like a weird A followed by a smart quote. In other editors and Calc, Writer etc it just appears as a box. ﾒ
I'm using mb_detect_encoding to determine the encoding. All records in the CSV file are returned as ASCII, except the one with the strange character, which is returned as UTF-8.
I can insert the offending record into MySQL and it just appears in Workbench as a square.
MySQL tables are configured to utf-8 – utf8_unicode_ci and other unusual UTF characters (eg fractions) are ok.
I've tried lots of solutions to this...
How to detect malformed utf-8 string in PHP?
Remove non-utf8 characters from string
Removing invalid/incomplete multibyte characters
How to detect malformed utf-8 string in PHP?
How to replace Microsoft-encoded quotes in PHP
etc etc but none of them have worked for me.
All I really want to do is remove or replace the offending character, ideally with a search and replace for the hex values but none of the examples I have tried have worked.
Can anyone help me move forward with this one please?
EDIT:
Can't post answer as not enough reputation:
Thanks for your input. Much appreciated.
I'm just going to go with the hex search and replace:
$DodgyText = preg_replace("/\xEF\xBE\x92/", "" ,$DodgyText);
I know it's not the elegant solution, but I need a quick fix and this works for me.

Another solution is:
$contents = iconv('UTF-8', 'Windows-1251//IGNORE',$contents);
$contents = iconv('Windows-1251', 'UTF-8//IGNORE',$contents);
Where you can replace Windows-1251 to your local encoding.

At a quick glance, this looks like a UTF-8 file. (UTF-8 is identical with the first 128 characters in the ASCII table, hence everything is detected as ASCII except for the special character.)
It should work if your database connection is also UTF-8 encoded (which it may not be by default).
How to do that depends on your database library, let us know which one you're using if you need help setting the connection encoding.

updated code based on established findings
You can do search & replace on strings using hexadecimal notation:
str_replace("\xEF\xBE\x92", '', $value);
This would return the value with the special code removed
That said, if your database table is UTF-8, you shouldn't need that conversion; instead you could look at the connection (or session) character set (i.e. SET NAMES utf8;). Configuring this depends on what library you use to connect to your database.
To debug the value you could use bin2hex(); this usually helps in doing searches online.

PHP and XML fusion, correctly encoding special characters

I've got a database that outputs a great deal of information.
I'm currently building a PHP application to build this database into an XML format for another application to read.
I'm a little stuck with special characters.
In the database, some characters are printing strangely:
Ø becomes Ã˜
° becomes Â°
I'm using fwrite() to write the XML file in the PHP and I think the error resides there somehow.
I need a way to overcome this, perhaps by detecting where an occurrance of these characters occur and replacing them appropriately.
I'm using PHP and I'm not sure how to replace these characters on an individual basis, and more importantly, I'm not sure what to replace them with!
Can someone help?

Ø becomes Ã˜, ° becomes Â°
Looks like that UTF-8 encoded characters are passed to some display device and it's told the display device that those are ISO-8859-X or Windows-125X encoded characters.
Tell the display device that this is indeed UTF-8 (which is by default the standard encoding for XML).

Special characters in Flex

I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam

It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.