I've got an excel importer for my website, which seems to be working fine - up until I found a row which has apostrophes, and it's trying to save the information into the database using �.
Example:
Branches in Vava’u, Haapai, ‘Eua and Niuatoputapu
Changes to:
Branches in Vava�u, Haapai, �Eua and Niuatoputapu
Is there any way I can fix this easily within php?
Try to replace the � by ' before saving in database. Sometimes MS Excel uses other chars with different codes for special chars (non printable ASCII codes).
Vava’u - contains 0x19 char code, use 0x27 instead
‘Eua - contains 0x18 char code, use 0x27 instead
Related
I am investigating an issue where the browser is sending data to Apache(2.4) / PHP (7.2 Mac) and PHP is unable to decode some bytes into a printable character. The character is '-' (the hexidecimal value 2D is given when the character is copied and pasted into https://www.online-toolz.com/tools/text-hex-convertor.php and ASCII hex translated here - https://ascii.cl/) but is displayed as ��� by PHP.
MariaDB displays the character fine and reports the length of the data source's column value as 250 characters. The data is collected by PHP PDO and passed to an HTML form and used as a value for a text input form. The character displays fine in the HTML dom. However, when the POST data is submitted back through Apache to PHP, PHP says the string length is 251 characters, and then subsequently breaks my string length sanitizer.
I found a short Python command to see the binary. I copied and pasted the character out of Sequel Pro and put it into this script.
import binascii
bin(int(binascii.hexlify('-'), 16))
'0b101101'
The history of the encoding is that it was from a Google Docs document, downloaded as .txt, opened in Mac Text Edit and saved with 'UTF-8' encoding, then passed through python into a MySQL database, back out through PHP to HTML and submitted back to PHP.
I have replaced the character in the database with another character '–' (hex value e28093) with binary output below, and everything works fine.
bin(int(binascii.hexlify('–'), 16))
'0b111000101000000010010011'
Any ideas on why PHP fails to correctly recognize original character and reports the string length as +1 compared to MySQL? I assume that PHP should be able to handle all ASCII characters properly.
UPDATE:
When I print the original string (that is unprintable) out in the HTML dom (before posting back to PHP) the string length is reported as 249 characters and the '-' character is printable.
This '–' is – or U-2013. If it is delivered as ASCII, than 3 ASCII chars are send: 0xe2 0x80 0x93. The first code is â in ASCII 8bit, but undefined in standard ASCII (7bit). The other 2 chars are controls in ACII 8-bit. So 3 "?" are ok.
Anyway, you said, that the standard munis sign is also delivered as 3 "?". That is very unusual. Please proof this again.
I have an existing program (codes) to generate PDF file via TCPDF. It works fine even contain non-English characters in most cases, but now, when the content has either two simplified Chinese characters 喆 (unicode number: 21894) or 旻 (unicode number: 26107), all Chinese characters will be converted to rectangle (invalid character).
I tried to check the uni2cid_ag15.php, and I can find the mapping of those two words and the mapped cids are correct. Is anyone know the reason for converting the Chinese characters incorrectly with that specific character(s)?
References:
https://raw.githubusercontent.com/adobe-type-tools/cmap-resources/master/cmapresources_gb1-5/cid2code.txt
https://github.com/tecnickcom/TCPDF/blob/master/fonts/uni2cid_ag15.php
Thanks for the advice in advance.
I found out the solution by using new encoding "GB18030" for php function mb_convert_encoding, instead of "GB2312". Those characters can be generated in the PDF without problem.
I'm working on a script that builds an XML feed using strings from the database. The strings are user-entered image captions from Facebook Open Graph API. The strings are supposed to be all UTF8 according to facebook. So i import the captions into the database and store them as utf8-unicode (i also tried utf8-bin)
But i always have the same error when trying to display the output XML feed, because one of the caption have a weird whitespace character
This page contains the following errors:
error on line 63466 at column 14: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0B 0x54 0x68 0x6F
Below is a rendering of the page up to the first error.
In the database (phpmyadmin) and in the page source code (using chrome), the problematic characters appear as empty square symbol.
Now if i copy and paste the problematic character in an converter it gives me Hexadecimal 000B
What's the easiest way to fix this ?
I'd also like to understand in the first place, why Facebook Graph API is giving me non-utf8 characters when it's not supposed to
Failed attemps:
utf8_encode() isn't working because the rest of the strings are UTF8 valid.
I also tried multiple different ways of stripping out all non-utf8 characters, but it doesn't filter out this specific character. Same when trying to filter out all non-latin.
htmlentities() htmlspecialchars() or the same isn't encoding the problematic characters
charactericonv(mb_detect_encoding()) will not detect the string as invalid utf8
str_replace() or preg_replace() is of no help, if i try to copy and paste the character in Visual Studio Code, nothing is pasted, not even a whitespace
str_replace("\0", "", ) ...nope
Here is a list of what we have found and/or worked through with the original poster:
MySQL's utf-8 is not a proper implementation of utf-8 - utf8mb4 is;
additional information on character sets and collation differences;
changes that happen to existing data if collation is changed.
We have checked the above and discovered that the initial problem was caused by vertical tabulation symbols creeping into the text fields. A good way to remove said symbols is by running $str = str_replace("\x0b", "", $str);, where $str is the string that is going to be inserted into the text field. It's important to not replace \v, as that might not be desired.
If the 0B is always at the beginning of a string, then trace the strings back to their source and see if they are "BOM" encoded. Wikipedia on BOM .
At least come back with the various steps the data takes, so we can help with deducing the source of the problem.
Note: although needed for Emoji and Chinese, switching to utf8mb4 will not deal with BOM if that is the 'real' problem.
(using str_replace is just a bandaid)
I have an ods spreadsheet (managed with OpenOffice). Several cells contain multiple lines. The data table contents are used for display on a website.
When I import the file with phpmyadmin, these cells are truncated at the first newline character.
In the ods file, the newline character is char(10). In my case this has to be replaced with the string <br/>,the HTML newline tag. Writing a php program that does the replacement makes no sense since the newline character is already cut after import. For the moment I run a pc program that patches the char(10) with the '|' character in the ods file. After import, I replace the '|' with <br/> using php. Terrible! Is there a way to prevent the import by phpmyadmin to truncate on char(10)?
Thanks, Chris.
I had the same problem. My solution is not the perfect one but did the job for me.
What I did was, I replaced new line character in ODS so I can replace it back in PHP.
Open ODS file, open search&replace box then search \n and replace it some unique char where u can locate in PHP.
in my case I did something like -EOL-
in my php script replaced -EOL- with
I know it's not shortcut but a solution...
Hope it works for u as well
how to remove special characters
Montréal-Rive-Sud
the above one is city name this i'm getting from xml feeds. It is directly inserting into my mysql database. now i want to remove this kind of special characters.
Looks like you got an encoding problem there, that string appears to be encoded in UTF-8 and interpreted as ASCII four times:
Montréal-Rive-Sud
Montréal-Rive-Sud
Montréal-Rive-Sud
Montréal-Rive-Sud
Reminds me of:
You need the mysql_real_escape_string()
More info