i have an annoying problem with preg_replace and charsets. I'm doing a couple preg_replace in a row but unfortunate the first time any special character like äöüß is inserted by preg_replace i'm getting PREG_BAD_UTF8_ERROR on subsequent calls.
Beside that the special characters inserted are displayed just fine, they just break any subsequent preg_replace call. Is preg_ utf-8 only?
The text preg_replace is working on is coming from MySQL Database, also the replacement is crafted in the php file with values from MySQL. mb_detect_encoding() says ASCII for the text until the first replacement with special characters, it then detects UTF-8, so it changes and this might be the problem.
For your information i'm working with iso-8859-1 encoding (PHP, MySQL, meta-charset). Furthermore i have a workaround with htmlentities on the replacement string that is working for now.
Any ideas on how to solve it?
What you are looking for is probably mb_ereg_replace. It handles multibyte encodings and should perform fine with differrent ones. Be sure to use mb_regex_encoding along with it.
Related
I want to be able to store every character possible (Chinese, Arabic, these kind of characters: ☺♀☻) in a MySQL database and also be able to use them in PHP and HTML. how do I do this?
Edit: when I use the function htmlspecialchars() with those characters: ☺♀☻ like this: htmlspecialchars('☺♀☻', ENT_SUBSTITUTE, 'UTF-8'); it returns some seemingly random characters. how do I solve this?
Use UTF-8 character encoding for all text/var fields in your database, as well as page encoding. Be sure to use multibyte (mb_*) forms of text functions, such as mb_substr().
Pick a character set that has the characters you want. utf-8 is very broad most commonly used.
Storing the characters is not so much a problem since it's all just binary data. If you also want the text to be searchable then picking the right collation is useful. utf8_general_ci is fine.
The Problem
I'm developing a PHP application which displays Wingdings and Webdings characters. If I Just put a font-tag around it, the character gets displayed correctly. Though, once it gets copy-pasted it reverts to the character it was before like "a".
What I think would be the Solution
This problem could be solved by escaping every wingdings character on the page by the UTF-8 equivalent. UTF-8 holds so many characters, so I'm guessing that Wingings characters and the like are also on that list.
Question
How can I map/create UTF-8 characters from Wingdings characters?
Here is a list of equivalent unicode characters to wingdings.
Is this, what you are looking for?
http://www.alanwood.net/demos/wingdings.html
I have a problem where I can't seem to be able to write "certain" Korean characters. Let me try to explain. These are the steps I take.
MS Access DB file (US version) has a table with Korean in it. I export this table as a text file with UTF-8 encoding. Let's call it "A.txt"
When A.txt is read, stored in an array, then written to a new file (B.txt), all characters display properly. I'm using header("Content-Type: text/plain; charset=UTF-8"); at the very beginning of my PHP script. I simply use fwrite($fh, $someStr).
WHen I read B.txt in another script and write to yet a new file (C.txt), there's a certain column (obvisouly in the PHP code, I'm not working with a table or matrix, but effectively speaking when outputted back to the original text file format) that causes the characters to show up something like this: ¸ì¹˜ ì–´ëœíŠ¸ 나ì¼ë¡. This entire column has broken characters, so if I have 5 columns in a text file, delimited by commas and encapsulated with double quotes, this column will break all of the other columns' Korean characters. If I omit this column in writing the text file, all is well.
Now, I noticed that certain PHP functions/operations break the Unicode characters. For example, if I use preg_replace() for any of the Korean strings and try to fwrite() that, it will break. However, I'm not performing anything that I'm not already doing on other fields/columns (speaking in terms of text file format), and other sections are not broken.
Does anyone have any idea on how to rectify this? I've tried utf8_encode() and mb_convert_encoding() in different ways with no success. I'm reading utf8_encode() wouldn't even be necessary if my file is UTF-8 to begin with. I've tried setting my computer language to Korean as well..
I've spent 2 days on this already, and it's becoming a huge waste of time. Please help!
UPDATE:
I think I may have found the culprit. In the script that creates B.txt, I split a long Korean string into two (using string ...<br /><br />... as indicator) and assign them to different columns. I think this splitting operation is ultimately causing the problem.
NEW QUESTION:
How do I go about splitting this long string into two while preserving the unicode? Previsouly, I had used strpos() and substr(), but I am reading that the mb_*() function might be what I need.. Testing now.
Try the unicode modifier (u) for preg
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
I am using jQuery editor with PHP it works fine for plane text (text with out special characters)
but if I try to post text which contain special characters then it does not store these special characters in to db table..
and when I tried to replace any special character with HTML codes it works fine.
But it is too difficult to replace all special character one by one..
Is there any script which replace all special characters from a string...?
Do you mean something like PHP's str_replace()?
http://php.net/manual/en/function.str-replace.php
Is there any script which replace all special characters from a string...?
This is the wrong approach. You need to get your character sets right, so will be no need to replace anything.
I don't know what you're doing, but if you are transmitting data through Ajax, it is probably UTF-8 encoded. If your database is in a different character set, you may need to convert it.
Basic (deep) reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
For more specific information, you will need to provide more details about your situation. Here are a few questions that deal with the subject, maybe one of them already helps:
Special characters in PHP / MySQL
How to store characters like ♥☆ to DB?
I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.