I'm using Amazon's API to obtain the description of books. The API returns XML responses and the description is marked up (with HTML) very poorly. To deal with this poorly marked up description, which oftentimes breaks the layout of my site, I'm trying to use HTML Tidy to "clean it up."
In order to prevent "weird" characters from being displayed on my web page, I think I need to tell Tidy what the input encoding is and what the desired output encoding is. I know I want the output to be UTF8. However, I'm not sure how to determine the encoding of the input (Amazon's book description).
I've tried something like this:
mb_detect_encoding($amazon_description);
It's helped, but I'm still occasionally getting weird characters (a black diamond with a question mark in it: �). My guess is that I'm not detecting the encoding properly.
Any suggestions what I need to do?
EDIT:
This is my current solution:
$sanitized_amazon_markup = preg_replace('/[^\w`~!##$%^&*()-=_+[\]{}|;\':",.\/<>? ]/', '', $sanitized_amazon_markup);
I'm not sure about this as this may delete stuff that I should be keeping.
Can you provide your tidy repairString call?
If you tried to use input-encoding and output-encoding from tidy options, try to not use these and use the third argument or repairString instead, something like this :
$oTidy = new tidy();
$page_content = $oTidy->repairString($page_content,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);
Edit :
After doing some tests, what I said before cannot work if you don't have utf8 encoding in $page_content already before calling repairString
But you will mostly end up with ISO-8859-1 (latin1) encoding if not UTF-8 already.
May I suggest you try :
$charset = mb_detect_encoding($amazon_description, 'UTF-8, ISO-8859-1');
if ($charset == "ISO-8859-1") {
$amazon_description = utf8_encode($amazon_description);
}
$oTidy = new tidy();
$amazon_description = $oTidy->repairString($amazon_description,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);
Related
For some reason my special characters got encoded as the following string in a mysql database:
Ã?
Which shows up as:
Ã?
But actually should show up as:
Ö
What went wrong here? I use UTF-8 everywhere.
How can I fix this without recreating all content?
I executed the following in PHP:
<?php
echo str_replace("&", "&", htmlentities("Ö", 0, "ISO-8859-1")) , '<br />';
echo str_replace("&", "&", htmlentities("Ö", 0, "UTF-8")), "</br>";
?>
The str_replace is just there to reveal any HTML mnemonics, which would otherwise
be translated by the browser to the original character, which I don't want to happen.
You will get this as output:
�
Ö
You'll recognise the first value as what you found in the database, and the second one
is a bit like you wanted it to be.
Add to this the fact that the default value for the third argument to htmlentities
depends on your PHP version and is ISO-9959-1 in the case of version 5.3, the one you use.
Also realise that HTML documents which do not specify a character encoding will
by default post form data in ISO-8859-1 format.
Combining all this might give a clue about the cause of your problem:
My guess is that the data is correctly posted as UTF-8 to the server, but then htmlentities interprets this as a non-UTF-8, single byte encoding, and so turns one, multi-byte character into two single byte characters.
Now to the measures to take that this does not continue to happen:
First make sure that your HTML form has the UTF-8 encoding, because this determines the
default encoding that a form will use for sending its data to the server:
<head>
<meta charset="UTF-8">
</head>
Make sure this is not overruled by another encoding in the form tag's accept-charset
attribute.
Then, skip the htmlentities call. You should not turn characters into their
HTML mnemonic when storing them in the database. MySql
supports UTF-8 characters, so just store them like that.
For the second question, you'll have to find all cases and bulk replace them as you find
new instances. You could get get a little help by producing some SQL statements
with a PHP script like the following:
<?php
// list all your non-ASCII characters here. Do not use str_split.
$chars = ["Ö","õ","Ũ","ũ"];
foreach ($chars as $ch) {
$bad = str_replace("&", "&", htmlentities($ch, 0, "ISO-8859-1"));
echo "update mytable set myfield = replace(myfield, '$bad', '$ch')
where instr(myfield, '$bad') > 0;<br />";
}
?>
The output of this script will look like this:
update mytable set myfield = replace(myfield, 'Ã�', 'Ö') where instr(myfield, 'Ã�') > 0;
update mytable set myfield = replace(myfield, 'õ', 'õ') where instr(myfield, 'õ') > 0;
update mytable set myfield = replace(myfield, 'Ũ', 'Ũ') where instr(myfield, 'Ũ') > 0;
update mytable set myfield = replace(myfield, 'Å©', 'ũ') where instr(myfield, 'Å©') > 0;
Of course, you could decide to make a PHP script that will even do the updates itself.
Hopefully you can use this information to fix the issues.
For PDO, use something like
$db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
Ã? is two or three things going wrong, not just one!
C396 is the utf8 hex for Ö or the latin1 hex for the two characters Ö. It requires something else to go wrong to get ? or the black diamond.
Let's see what is in the table; do
SELECT col, HEX(col) FROM tbl WHERE ...
(If you have already done the previously suggested replace(), then the table may be in an even worse mess. Or it might be fixed.)
I'm trying to "decode" an XML file (and transforming it with XSLT), but I'm having trouble decoding both files. The scenario is as follows:
I have a site for data entry which is all encoded in ISO-8859-1 (our Oracle database is in that format, so I can't change it). The problem is, I have those 2 files (an XML to show the data entry form and and XSLT to transform it into HTML). Both files are saved in ISO-8859-1 encoding, and both have the corresponding header, i. e., , and whenever I read the files and show them in the browser, the special characters (ñ, á, ¿) are shown either as UTF-8 or as a question mark (depending on the method I use for showing), but never as the "normal" representation.
My code for showing the XML file is:
<?php
$xslString = file_get_contents("catalog.xsl");
$xslString = utf8_decode($xslString);
$xslDoc = simplexml_load_string($xslString);
$xmlString = file_get_contents("questionnaire.xml");
$xmlString = utf8_decode($xmlString);
$xmlDoc = simplexml_load_string($xmlString);
$proc = new XSLTProcessor();
$proc->importStylesheet($xslDoc);
?>
I already tried several combinations of DOMDocument, iconv, mb_convert_encoding, but they show the XML file as unencoded UTF, a question mark or a double question mark.
On the other hand, this also messes up my data entry, since if I want to enter one of those characters, they either show as ? or ?? on the corresponding data field on the DB, or they get truncated at the first special char (if I use iconv).
What am I missing? Is there a workaround? I can't convert anything to UTF-8 because of the database.
I hope I'm being clear enough, please excuse my English.
Thanks in advance!
Hope this helps others. In the end, there were two things:
1) I was reading the XML/XSL files like this (in my original script):
<?php
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xmlFile);
$xmlDoc->load("xmlfile.xml");
?>
which effectively changed the encoding to UTF-8. I changed the lines to:
<?php
$xmlString = file_get_contents("xmlfile.xml");
$xmlDoc = simplexml_load_string($xmlString);
?>
removing the utf_decode statement, and it worked like a charm. Now I get my special chars on screen as they're intended. As a side effect, the data entered in the form is now saved correctly to my database, so I got two birds in one shot.
I'm building a PHP web application, and it works in UTF-8. The database is UTF-8, the pages are served as UTF-8 and I set the charset using a meta tag to UTF-8. Of course, with users using Internet Explorer, and copying & pasting from Microsoft Office, I somehow manage to get not UTF-8 input occasionally.
The ideal solution would be to throw an HTTP 400 Bad Request error, but obviously I can't do that. The next best thing is converting $_GET, $_POST and $_REQUEST to UTF-8. Is there anyway to see what character encoding the input is in so I can pass it off to iconv? If not, what's the best solution for doing this?
Check out mb_detect_encoding() Example:
$utf8 = iconv(mb_detect_encoding($input), 'UTF-8', $input);
There's also utf8_encode() if you guarantee that the string is input as ISO-8859-1.
In some cases using just utf8_encode or general checks are ok but you might lose some characters within the string. If you can build out a basic array/string list based on various types, this example being windows, you can salvage quite a bit more.
if(!mb_detect_encoding($fileContents, "UTF-8", true)){
$checkArr = array("windows-1252", "windows-1251");
$encodeString = '';
foreach($checkArr as $encode){
if(mb_check_encoding($fileContents, $encode)){
$encodeString .= $encode.",";
}
}
$encodeString = substr($encodeString, 0, -1);
$fileContents = mb_convert_encoding($fileContents, "UTF-8", $encodeString);
}
I have a PHP application using Gettext as the i18n engine. The translation works fine, the only problem is that I'm having encoding issues with UTF8 characters. My PHP code to load gettext is something like this:
bindtextdomain( $domain, PATH_BASE . DS . "language" . DS );
$this->utf8Encode = strtolower($encoding) == "utf-8";
bind_textdomain_codeset($domain, $encoding);
textdomain($domain);
My templates render the pages using the utf8 charset and I've tried just about anything to load the proper charset. For the current locale I'm loading SL_sl, the names appear correctly but have issues with UTF8 chars, so where it should appear Država, it shows up Dr?ava
So, it has happened before, and now it happened again, I found the solution myself! The problem was that like I said to #bozdoz, I was converting UTF8 text already, but I didn't realized that the gettext function returned a UTF8 string, so if you do this:
$encoded = utf8_encode($utf8String);
Then you'll have a really nasty bug when $utf8String is an actual UTF8 string. Therefore I did some modifications to my code and the translation method (simplified) ended up like this:
$translation = gettext($singular);
$encoded = $this->utf8Encode ? $this->Utf8Encode($translation) : $translation;
And the Utf8Encode method is like this:
private function Utf8Encode( $text )
{
if ( mb_check_encoding($text, "utf8") == TRUE ){
return $text;
return utf8_encode($text);
}
I hope that if somebody has the same error this can help!
From the partial information I can suggest you take a look at the actual mo/po files, in poedit there are several warnings about utf8 encoding. Assuming that everything else is correct (meta, headers, etc) it's the only thing left to check
Try encoding it with utf8_encode(). I can't really tell from your code, but perhaps it could be implemented like this:
utf8_encode($domain);
I am scraping a list of RSS feeds by using cURL, and then I am reading and parsing the RSS data with SimpleXML. The sorted data is then inserted into a mySQL database.
However, as notice on http://dansays.co.uk/research/MNA/rss.php I am having several issues with characters not displaying correctly.
Examples:
âGuitar Hero: Van Halenâ Trailer And Tracklist Available
NV 10/10/09 – Salt Lake City, UT 10/11/09 – Denver, CO 10/13/09 –
I have tried using htmlentities and htmlspecialchars on the data before inserting them into the database, but it doesn't seem to help resolve issue.
How could I possibly resolve this issue I am having?
Thanks for any advices.
Updated
I've tried what Greg suggested, and the issue is still here...
Here is the code I used to do SET NAMES in PDO:
$dbh = new PDO($dbstring, $username, $password);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$dbh->query('SET NAMES "utf8"');
I did a bit of echo'ing with the simplexml data before it is sorted and inserted into the database, and I now believe it is something to do with the cURL...
Here is what I have for cURL:
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');
$data = curl_exec($ch);
curl_close($ch);
$doc = new SimpleXmlElement($data, LIBXML_NOCDATA);
Issue Resolved
I had to set the content charset in the RSS/HTML page to "UTF-8" to resolve this issue. I guess this isn't a real fix as the char problems are still there in the raw data. Looking forward to proper support for it in PHP6!
Your page is being served as UTF-8 so I'd point my finger at the database.
Make sure the connection is in UTF-8 before any SELECTs or INSERTS - in MySQL:
SET NAMES "utf8"
Just a quick note about CURLOPT_ENCODING : it's the Accept-Encoding header, which is not the same at all as character encoding. Supported accept encodings are "identity", "deflate", and "gzip".
Like all debugging, you start by isolating the problem:
I am scraping a list of RSS feeds by using cURL, - look at the xml from the RSS feed that's giving the problem (there's more than one feed, so it's possible for some feeds to be right and for the feeds that are wrong to be wrong in different ways)
and then I am reading and parsing the RSS data with SimpleXML. - print out the field that SimpleXML read out - is it ok or does a problem show up?
The sorted data is then inserted into a mySQL database. - print out hex(field), length(field), and char_length(field) for the piece of data that's giving the problem.
EDIT
Take the feed http://hangout.altsounds.com/external.php?type=RSS2 , put it into the validator http://validator.w3.org/feed/ . They're declaring their content type as iso-8859-1 but some of the actual content, such as the quotes, is in something like cp1252 - for example they're using the byte 0x93 to represent the left quote - http://www.fileformat.info/info/unicode/char/201C/charset_support.htm .
What's annoying about this is that this doesn't show up in some tools - Firefox seems to guess what's going on and show the quotes correctly, and more to the point, SimpleXML converts the 0x93 into utf8, so it comes out as 0xc293, which exacerbates the problem.
EDIT 2
A workaround to get that feed to read a bit more correctly is to replace "ISO-8859-1" by "Windows-1252" before passing to Simple XML. It won't work 100% because it turns out that some parts of the feed are in UTF8.
The general approach, assuming that you can't get everyone in the world to correct their feeds, is to isolate whatever workarounds you require to the interface with the external system that's emitting the malformed data, and to pass in pure clear utf8 to the hub of your system. Save a dated copy of the raw external feed so you can remember in future why the workaround was required, separate off and comment the code lines that implement the workaround so it's easy to get at and change if and when the external organisation corrects its feed (or breaks it in a different way), and check it again from time to time. Unfortunately instead of programming to a spec you're programming to the current state of a bug, so there's no permanent, clean solution - the best you can do is isolate, document, and monitor.
It may have to do with the XML prologue, which looks like this for that particular feed you linked to:
<?xml version="1.0" encoding="ISO-8859-1" ?>
As far as I know libxml, on which SimpleXML is based, looks for this kind of things. I'm not sure about XML files but I'm sure that with HTML strings it looks for META elements that specify the charset.
Try stripping the XML prologue (I solved a similar problem once by stripping the HTML META tags) and don't forget to utf8_encode() the data before feeding it to SimpleXMLElement.