html2text result deletes some special caracters

html2text result deletes some special caracters - php

I am trying to display a message using html2text function, the result in encoded in utf-8, the only problem is that for some cases, caracters are deleted from the words.
Example: instead of n'hésitez I get nhsitez, here is my code
$h2t = new html2text($leMessage);
$altBody = $h2t->get_text();
logMessagePreformate($id_dossier, utf8_decode($sujet),$altBody, $pour1, $pour2);
I tried to utf8_encode and mb_convert_encoding but it didn't work, any suggestions ?

For those who face the same problem, I added html_entity_decode() function to my code in order to decode the data I send to the database :
$h2t = new html2text(html_entity_decode($leMessage));
then to display it I used:
mb_convert_encoding($h2t),"HTML-ENTITIES", 'UTF-8')

Related

DOMDocument and UTF8. MySQL says: Incorrect string value

I am trying to load the meta description of this website (which has a German character) via the following script in PHP:
$page_content = file_get_contents($uri);
$dom_obj = new \DOMDocument();
$dom_obj->loadHTML(mb_convert_encoding($page_content, 'HTML-ENTITIES', 'UTF-8'));
However, while trying to write it into the MySQL db, Laravel says it ran into troubles trying to write that into the db: incorrect string value "\xC3" (which is the German character)
When I simply do the following, writing to the db works. But the character is not displayed correctly (Ã¼ instead of ü)
$dom_obj->loadHTML($page_content)
This problem only occurs with this website so far, others I tried with the same character do work. Can you think of a possible reason and fix? Thank you!
Edit:
It works fine, when I use PHPs "utf8_decode" to decode the meta description that I get via $dom_obj without mb_convert_encoding. When I do this, all other sites that worked before lead to errors (like this: Incorrect string value: '\xE4t')

I found the error. I was using substr to shorten the description. Apparently substr cut off one of those special characters and this is why it wasnt working.
foreach($dom_obj->getElementsByTagName('meta') as $meta) {
if($meta->getAttribute('name')=='description'){
substr($meta->getAttribute('content'), 0, 156);
This is a workaround:
mb_substr($foo,0,156,"UTF-8");

Decoding XML from UTF-8 to ISO-8859-1 in PHP

I'm trying to "decode" an XML file (and transforming it with XSLT), but I'm having trouble decoding both files. The scenario is as follows:
I have a site for data entry which is all encoded in ISO-8859-1 (our Oracle database is in that format, so I can't change it). The problem is, I have those 2 files (an XML to show the data entry form and and XSLT to transform it into HTML). Both files are saved in ISO-8859-1 encoding, and both have the corresponding header, i. e., , and whenever I read the files and show them in the browser, the special characters (ñ, á, ¿) are shown either as UTF-8 or as a question mark (depending on the method I use for showing), but never as the "normal" representation.
My code for showing the XML file is:
<?php
$xslString = file_get_contents("catalog.xsl");
$xslString = utf8_decode($xslString);
$xslDoc = simplexml_load_string($xslString);
$xmlString = file_get_contents("questionnaire.xml");
$xmlString = utf8_decode($xmlString);
$xmlDoc = simplexml_load_string($xmlString);
$proc = new XSLTProcessor();
$proc->importStylesheet($xslDoc);
?>
I already tried several combinations of DOMDocument, iconv, mb_convert_encoding, but they show the XML file as unencoded UTF, a question mark or a double question mark.
On the other hand, this also messes up my data entry, since if I want to enter one of those characters, they either show as ? or ?? on the corresponding data field on the DB, or they get truncated at the first special char (if I use iconv).
What am I missing? Is there a workaround? I can't convert anything to UTF-8 because of the database.
I hope I'm being clear enough, please excuse my English.
Thanks in advance!

Hope this helps others. In the end, there were two things:
1) I was reading the XML/XSL files like this (in my original script):
<?php
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xmlFile);
$xmlDoc->load("xmlfile.xml");
?>
which effectively changed the encoding to UTF-8. I changed the lines to:
<?php
$xmlString = file_get_contents("xmlfile.xml");
$xmlDoc = simplexml_load_string($xmlString);
?>
removing the utf_decode statement, and it worked like a charm. Now I get my special chars on screen as they're intended. As a side effect, the data entered in the form is now saved correctly to my database, so I got two birds in one shot.

html decimal coded string

I'm parsing html from a website using simplehtmldom_1_5, when i echo the parsed text to the screen it's printed correctly but when i try to save it to a file using file_put_contents i've my string coded to html decimal code :
&#40&#98&#46&#32&#97&#110&#100&#101&#114&#115&#115&#111&#110&#44&#32
i've already tried all possible combination of utf8_encode, utf8_decode, htmlentities... but nothing worked, same problem when i try to insert to mysql table.
mb_detect_encoding for the parsed text returns ASCII.
Any suggestions ?
header('Content-Type: text/html; charset=utf-8');
ini_set('max_execution_time', 0);
include 'simplehtmldom_1_5/simple_html_dom.php';
$html = file_get_html($curr_url);
$texts = $html->find('div[id=content_h]');
foreach($texts as $text) {
file_put_contents('queries.txt', $text->innertext . "\n", FILE_APPEND);
}

Did you also try html_entity_decode ( http://de1.php.net/html_entity_decode ) ?
Thats the function converting entities back to clear type text
*edit
I just tested this to verify it's working.
Yes it works, BUT:
your data is incorrect !
Every single entity is missing a semicolon at its end!
Thats why decoding only works in loose browser-render engines...
Your data shall be looking like this:
(b.
and not like this
&#40&#98&#46
See the difference?

Finally this worked for me
preg_replace('/&#(\d+)/me',"chr(\\1)", $text)

php ActiveRecord and json_encode æøå encoding issue

Now I, in my own opionion, have tried everything there is on this encoding problem, looked through a lot of answered quistions but nothing worked for me, so here I go.
I have a MySQL database with a Users table. This table has a column for "firstname" which collation is set to utf8_general_ci (all varchar columns is). I have then inserted a row where the firstname-column is set to "Løw", with the scandinavian special character "ø".
I now use the php-ActiveRecord library, where the connection string is to ";charset=utf8", to retrieve the row and afterwards outputs the user as json, like so:
$user = User::find($ID);
$userArr = $user->to_array();
header('Content-Type: application/json; charset=utf-8');
print(json_encode($userArr));
Now the wired things starts. The firstname is now NOT "Løw" as displayed in the MySQL Database , but "L\u00f8w". I then tried to see if this was also the case without the json_encode function, like so:
$user = User::find($ID);
$userArr = $user->to_array();
header('Content-Type: text/plain; charset=utf-8');
print_r($userArr);
But here the output was correct, firstname was "Løw". I then tried to encode the fields in the array to utf-8, since everybody told me if the strings was utf-8 it should work, like so:
$return[] = array_map('utf8_encode', $userArr);
print_r(json_encode($return));
But this gave me "L\u00c3\u00b8w", so that didn't work. I then tried, since i was out of ideas to utf8_decode it:
$return[] = array_map('utf8_decode', $userArr);
print_r(json_encode($return));
But that made the string return as "null". I then tried to check what encoding my vars was when they came out of the database, like so:
header('Content-Type: text/plain; charset=utf-8');
print(mb_detect_encoding($userArr['firstname']));
But this returned UTF-8.
So as you, hopefully, can see, i have tried everything and i still don't know why my json_encode, changes the "ø" charcter to "\u00f8". Please help, i don't want to make my own json_encode-method.

Ok found an answer pretty quick, but ill let other scandinavian people know, since i coulden't find anything on the subject.
I solved the problem by adding the following to the json_encode method:
print(json_encode($userArr,JSON_UNESCAPED_UNICODE));
This tells the method NOT to escape unicode chars (i think) or as it says in the PHP doc:
JSON_UNESCAPED_UNICODE (integer)
Encode multibyte Unicode characters literally (default is to escape as
\uXXXX). Available since PHP 5.4.0.

Error in encoding mysql -> How can I reconvert it to something else?

I started a website some time ago using the wrong CHARSET in my DB and site. The HTML was set to ISO... and the DB to Latin... , the page was saved in Western latin... a big mess.
The site is in French, so I created a function that replaced all accents like "é" to "é". Which solved the issue temporarily.
I just learned a lot more about programming, and now my files are saved as Unicode UTF-8, the HTML is in UTF-8 and my MySQL table columns are set to ut8_encoding...
I tried to move back the accents to "é" instead of the "é", but I get the usual charset issues with the (?) or weird characters "Ã¢" both in MySQL and when the page is displayed.
I need to find a way to update my sql, through a function that cleans the strings so that it can finally go back to normal. At the moment my function looks like this but doesn't work:
function stripAcc3($value){
$ent = array(
'à'=>'à',
'â'=>'â',
'ù'=>'ù',
'û'=>'û',
'é'=>'é',
'è'=>'è',
'ê'=>'ê',
'ç'=>'ç',
'Ç'=>'Ç',
"î"=>'î',
"Ï"=>'ï',
"ö"=>'ö',
"ô"=>'ô',
"ë"=>'ë',
"ü"=>'ü',
"Ä"=>'ä',
"€"=>'€',
"′"=> "'",
"Ã©"=> "é"
);
return strtr($value, $ent);
}
Any help welcome. Thanks in advance. If you need code, please tell me which part.
UPDATE
If you want the bounty points, I need detailed instructions on how to do it. Thanks.

Try using the following function instead, it should handle all the issues you described:
function makeStringUTF8($data)
{
if (is_string($data) === true)
{
// has html entities?
if (strpos($data, '&') !== false)
{
// if so, revert back to normal
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
}
// make sure it's UTF-8
if (function_exists('iconv') === true)
{
return #iconv('UTF-8', 'UTF-8//IGNORE', $data);
}
else if (function_exists('mb_convert_encoding') === true)
{
return mb_convert_encoding($data, 'UTF-8', 'UTF-8');
}
return utf8_encode(utf8_decode($data));
}
else if (is_array($data) === true)
{
$result = array();
foreach ($data as $key => $value)
{
$result[makeStringUTF8($key)] = makeStringUTF8($value);
}
return $result;
}
return $data;
}
Regarding the specific instructions of how to use this, I suggest the following:
export your old latin database (I hope you still have it) contents as an SQL/CSV dump *
use the above function on the file contents and save the result on another file
import the file you generated in the previous step into the UTF-8 aware schema / database
* Example:
file_put_contents('utf8.sql', makeStringUTF8(file_get_contents('latin.sql')));
This should do it, if it doesn't let me know.

You might want to investigate what is used to fix WP database encoding issues:
http://codex.wordpress.org/Converting_Database_Character_Sets
To cut a long story short, most old WP sites were created with Swedish/Latin1 collated tables, which were used to store UTF8 strings. To collate the tables properly, the approach is to change the column to binary type, and then to change it to UTF8 text.
This avoids that the text gets wrangled when converting from Latin1 to UTF8 directly.

You will need to convert the offending rows using for example iconv. The challenge for you will be to know what rows are already UTF-8 and which are latin-1.

I'm not completely sure I understand your question, but
if you have
a UTF-8 database
all special characters in there stored as HTML entities
then a
html_entity_decode($string, ENT_QUOTES, "UTF-8");
should do the trick and turn all entities back into their UTF-8 native characters.

Make sure, not just your tables use utf-8, your database connection should use utf-8 as well.
$this->db = mysql_connect(MYSQL_SERVER,DB_LOGIN,DB_PASS);
mysql_set_charset ('utf8',$this->getConnection());

If you want to discuss with your database in UTF-8 you have to tell the Database that the connexion flow is a UTF-8 flow. You have to sent a request before each request you make to the database, this request in the following :
"SET NAMES utf8";
Personnaly I use that in the connect.inc.php files which create the connection to the database. Which this statement the database know that your sending UTF-8 encoded string and works perfectly !
mysql_set_charset function isn't working well, i tried this function in the past but the truth is that it don't do the trick.
For your complete issue, if you want to convert latin1 string to UTF-8, you have to convert first the latin1 string to a binary string format. Then convert the binary string into UTF-8 string, all can be done inside the database with database commands. See that artile (in french) : http://www.noidea.ca/2009/06/15/comment-convertir-une-db-de-latin1-a-utf8/
I can tell you that this method works because i used it to transform data from a database I've created.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

html2text result deletes some special caracters - php

For those who face the same problem, I added html_entity_decode() function to my code in order to decode the data I send to the database : $h2t = new html2text(html_entity_decode($leMessage)); then to display it I used: mb_convert_encoding($h2t),"HTML-ENTITIES", 'UTF-8')

Related

DOMDocument and UTF8. MySQL says: Incorrect string value

Decoding XML from UTF-8 to ISO-8859-1 in PHP

html decimal coded string

php ActiveRecord and json_encode æøå encoding issue

Error in encoding mysql -> How can I reconvert it to something else?

Categories

Resources