Zend_Cache And UTF-8 Problem - php

I'm trying to save UTF-8 characters with Zend_Cache (like Ť, š etc) but Zend_Cache is messing them up and saves them as Å, ¾ and other weird characters.
Here is a snippet of my code that saves the data to the cache (the UTF-8 characters are messed up only online, when I try it on my PC on localhost it works ok):
// cache the external data
$data = array('nextRound' => $nextRound,
'nextMatches' => $nextMatches,
'leagueTable' => $leagueTable);
$cache = Zend_Registry::get('cache');
$cache->save($data, 'externalData');
Before I save the cached data, I purify it with HTMLPurifier and do some parsing with DOM, something like this:
// fetch the HTML from external server
$html = file_get_contents('http://www.example.com/test.html');
// purify the HTML so we can load it with DOM
include BASE_PATH . '/library/My/htmlpurifier-4.0.0-standalone/HTMLPurifier.standalone.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Doctype', 'XHTML 1.0 Strict');
$purifier = new HTMLPurifier($config);
$html = $purifier->purify($html);
$dom = new DOMDocument();
// hack to preserver UTF-8 characters
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
$dom->preserveWhiteSpace = false;
// some parsing here
Here is how I initialize Zend_Cache in the bootstrap file:
protected function _initCache()
{
$frontend= array('lifetime' => 7200,
'automatic_serialization' => true);
$backend= array('cache_dir' => 'cache');
$this->cache = Zend_Cache::factory('core',
'File',
$frontend,
$backend);
}
Any ideas? It works on localhost (where I have support for the foreign language used in the HTML) but not on the server.

I had a similar problem with a FPDF deployment. Here, the html space character &nbsp was being converted into that same Å character that you're getting here. It was fine on my local windows, but did not work in my linux server environment.
Try this:
$str = iconv('UTF-8', 'windows-1252', html_entity_decode($str));

Related

Library Mpdf (php): Set utf-8 and use WriteHTML with utf-8

I need help with the php Mpdf library. I am generating content for a pdf, it is in a div tag, and sent by jquery to the php server, where Mpdf is used to generate the final file.
In the generated pdf file the utf-8 characters go wrong, for example "generación" instead of "generación".
I detail how they are implemented:
HTML
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Sending content for pdf (jquery)
$('#pdf').click(function() {
$.post(
base_url,
{
contenido_pdf: $("#div").html(),
},
function(datos) {
}
);
});
Reception content (PHP)
$this->pdf = new mPDF();
$this->pdf->allow_charset_conversion = true;
$this->pdf->charset_in = 'iso-8859-1';
$contenido_pdf = this->input->post('contenido_pdf');
$contenido_pdf_formateado = mb_convert_encoding($contenido_pdf, 'UTF-8', 'windows-1252');
$this->m_pdf->pdf->WriteHTML($contenido_pdf_formateado);
Other tested options:
1.
$this->pdf->charset_in = 'UTF-8';
Get error:
Severity: Notice --> iconv(): Detected an illegal character in input string
2.
$contenido_pdf_formateado = mb_convert_encoding($contenido_pdf, 'UTF-8', 'UTF-8');
or
3.
$contenido_pdf_formateado = utf8_encode($contenido_pdf);
Get incorrect characters, like the original case.
What is wrong or what is missing to see the text well? Thanks
Solution
$contenido_pdf_formateado = utf8_decode($contenido_pdf);
$this->m_pdf->pdf->WriteHTML($contenido_pdf_formateado);
I had used the mode on object creation
$mpdf = new Mpdf(['mode' => 'UTF-8']);
Use this if you are sure that your html is utf-8
A combination of this
$mpdf = new Mpdf(['mode' => 'UTF-8']);
and the below worked for me.
$mpdf->autoScriptToLang = true;
$mpdf->autoLangToFont = true;
the only thing you have to do to active utf-8 is to add a defult font :
i did this and worket very well so try it and let the others knows if it's a good solotion or not.
just add a defult font and see..
$mpdf = new \Mpdf\Mpdf([
'default_font_size' => 9,
'default_font' => 'Aegean.otf' ]);

DOMDocument->saveHTML() vs urlencode with commercial at symbol (#)

Using DOMDocument(), I'm replacing links in a $message and adding some things, like [#MERGEID]. When I save the changes with $dom_document->saveHTML(), the links get "sort of" url-encoded. [#MERGEID] becomes %5B#MERGEID%5D.
Later in my code I need to replace [#MERGEID] with an ID. So I search for urlencode('[#MERGEID]') - however, urlencode() changes the commercial at symbol (#) to %40, while saveHTML() has left it alone. So there is no match - '%5B#MERGEID%5D' != '%5B%40MERGEID%5D'
Now, I know can run str_replace('%40', '#', urlencode('[#MERGEID]')) to get what I need to locate the merge variable in $message.
My question is, what RFC spec is DOMDocument using, and why is it different than urlencode or even rawurlencode? Is there anything I can do about that to save a str_replace?
Demo code:
$message = 'Google';
$dom_document = new \DOMDocument();
libxml_use_internal_errors(true); //Supress content errors
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'));
$elements = $dom_document->getElementsByTagName('a');
foreach($elements as $element) {
$link = $element->getAttribute('href'); //http://www.google.com?ref=abc
$tag = $element->getAttribute('data-tag'); //thebottomlink
if ($link) {
$newlink = 'http://www.example.com/click/[#MERGEID]?url=' . $link;
if ($tag) {
$newlink .= '&tag=' . $tag;
}
$element->setAttribute('href', $newlink);
}
}
$message = $dom_document->saveHTML();
$urlencodedmerge = urlencode('[#MERGEID]');
die($message . ' and url encoded version: ' . $urlencodedmerge);
//<a data-tag="thebottomlink" href="http://www.example.com/click/%5B#MERGEID%5D?url=http://www.google.com?ref=abc&tag=thebottomlink">Google</a> and url encoded version: %5B%40MERGEID%5D
I believe that those two encoding serve different purposes. urlencode() encodes "a string to be used in a query part of a URL", while $element->setAttribute('href', $newlink); encodes a complete URL to be used as an URL.
For example:
urlencode('http://www.google.com'); // -> http%3A%2F%2Fwww.google.com
This is convenient for encoding the query part, but it cannot be used on <a href='...'>.
However:
$element->setAttribute('href', $newlink); // -> http://www.google.com
will properly encode the string so that it is still usable in href. The reason that it cannot encode # because it cannot tell whether # is a part of the query or is it part of the userinfo or email url (for example: mailto:invisal#google.com or invisal#127.0.0.1)
Solution
Instead of using [#MERGEID], you can use ##MERGEID##. Then, you replace that with your ID later. This solution does not require you to even use urlencode.
If you insist to use urlencode, you can just use %40 instead of #. So, your code will be like this $newlink = 'http://www.example.com/click/[%40MERGEID]?url=' . $link;
You can also do something like $newlink = 'http://www.example.com/click/' . urlencode('[#MERGEID]') . '?url=' . $link;
urlencode function and rawurlencode are mostly based on RFC 1738. However, since 2005 the current RFC in use for URIs standard is RFC 3986.
On the other hand, The DOM extension uses UTF-8 encoding, which is based on RFC 3629 . Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
The generic URI syntax mandates that new URI schemes that provide for
the representation of character data in a URI must, in effect,
represent characters from the unreserved set without translation, and
should convert all other characters to bytes according to UTF-8, and
then percent-encode those values.
Here is a function to decode URLs according to RFC 3986.
<?php
function myUrlEncode($string) {
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "#", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
return str_replace($entities, $replacements, urldecode($string));
}
?>
PHP Fiddle.
Update:
Since UTF8 has been used to encode $message:
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'))
Use urldecode($message) when returning the URL without percents.
die(urldecode($message) . ' and url encoded version: ' . $urlencodedmerge);
The root cause of your problem has been very well explained from a technical point of view.
In my opinion, however, there is a conceptual flaw in your approach, and it created the situation that you are now trying to fix.
By processing your input $message through a DomDocument object, you have moved to a higher level of abstraction. It is wrong to manipulate as a unique plain string something that has been "promoted" to a HTML stream.
Instead of trying to reproduce DomDocument's behaviour, use the library itself to locate, extract and replace the values of interest:
$token = 'blah blah [#MERGEID]';
$message = '<a id="' . $token . '" href="' . $token . '"></a>';
$dom = new DOMDocument();
$dom->loadHTML($message);
echo $dom->saveHTML(); // now we have an abstract HTML document
// extract a raw value
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('href');
// do the low-level fiddling
$newstring = str_replace($token, 'replaced', $rawstring);
// push the new value back into the abstract black box.
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', $newstring);
// less code written, but works all the time
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('id');
$newstring = str_replace($token, 'replaced', $rawstring);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', $newstring);
echo $dom->saveHTML();
As illustrated above, today we are trying to fix the problem when your token is inside a href, but one day we may want to search and replace the tag elsewhere in the document. To account for this case, do not bother making your low-level code HTML-aware.
(an alternative option would be not loading a DomDocument until all low-level replacements are done, but I am guessing this is not practical)
Complete proof of concept:
function searchAndReplace(DOMNode $node, $search, $replace) {
if($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
$input = $attribute->nodeValue;
$output = str_replace($search, $replace, $input);
$attribute->nodeValue = $output;
}
}
if(!$node instanceof DOMElement) { // this test needs double-checking
$input = $node->nodeValue;
$output = str_replace($search, $replace, $input);
$node->nodeValue = $output;
}
if($node->hasChildNodes()) {
foreach ($node->childNodes as $child) {
searchAndReplace($child, $search, $replace);
}
}
}
$token = '<>&;[#MERGEID]';
$message = '<a/>';
$dom = new DOMDocument();
$dom->loadHTML($message);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', "foo$token");
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', "http://foo#$token");
$textNode = new DOMText("foo$token");
$dom->getElementsByTagName('a')->item(0)->appendchild($textNode);
echo $dom->saveHTML();
searchAndReplace($dom, $token, '*replaced*');
echo $dom->saveHTML();
If you use saveXML() it won't mess with the encoding the way saveHTML() does:
PHP
//your code...
$message = $dom_document->saveXML();
EDIT: also remove the XML tag:
//this will add an xml tag, so just remove it
$message=preg_replace("/\<\?xml(.*?)\?\>/","",$message);
echo $message;
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Google</body></html>
Notice that both still correctly convert & to &
Would it not make sense to just urlencode the original [#mergeid] whan saving it in the first place as well? Your search should then match without the need for the str_replace?
$newlink = 'http://www.example.com/click/'.urlencode('[#MERGEID]').'?url=' . $link;
I know this does not answer the first post of the question, but you cannot post code in comments as far as I can tell.

Get html source of external webpage without header/encode

I just want to know if its possible to extract content encoded (in utf-8) from a html file without encoding header.
My specific case is this website:
http://www.metal-archives.com/band/discography/id/203/tab/all
I want to extract all the info but, as you can see, this word for example, looks bad:
Motörhead
I tried to use file_get_html, htmlentities, utf_decode, utf_encode and mix of them with different options but I cant find a solution...
Edit:
I just want to see the same website with correct format with this simple code:
$html_discos = file_get_html("http://www.metal-archives.com/band/discography/id/223/tab/all");
//some transform/decode here
print_r($html_discos);
I want the content in correct format in a string or DOM object to get some parts later.
Edit 2:
$file_get_html is a function of "simple html dom" library:
http://simplehtmldom.sourceforge.net/
That have this code:
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
The Content-Type of the URL
http://www.metal-archives.com/band/discography/id/203/tab/all
is:
Content-Type: text/html
This will default to ISO-8859-1. But instead you want to use UTF-8. Change the Content-Type so this is correctly signaled:
Content-Type: text/html; charset=utf-8
See: Setting the HTTP charset parameter
header('Content-Type: text/html; charset=utf-8');
echo file_get_contents('http://www.metal-archives.com/band/discography/id/203/tab/all');
As long as you are emitting as UTF-8, the raw data will work properly.
Try using html_eneity_decode http://php.net/manual/en/function.html-entity-decode.php (the source of that page has encoded characters)

XML signature with smlseclibs returning invalid data:data and digest do not match

I'm trying to submit a signed XML document (with xmlseclibs), but the signature is turning itself to be wrong.
The code I'm using looks like this:
// input variables:
$tout = __DIR__ . "/" . $firmacert2;
$certBuffer = file_get_contents($tout);
$certTempFile = __DIR__ . '/temp/temp.xml';
$xmlBuffer = base64_decode($xmlstrBase64);
// document creation and loading
$doc = new DOMDocument();
$doc->loadXML($xmlBuffer);
$objDSig = new XMLSecurityDSig();
$objDSig->setCanonicalMethod(XMLSecurityDSig::EXC_C14N);
$objDSig->addReference($doc, XMLSecurityDSig::SHA1, array('http://www.w3.org/2000/09/xmldsig#enveloped-signature'));
$objKey = new XMLSecurityKey(XMLSecurityKey::RSA_SHA1, array('type' => 'private'));
// load private key
$objKey->passphrase = $pass;
$objKey->loadKey($tout, TRUE);
$objDSig->sign($objKey);
// Add associated public key
$objDSig->add509Cert($certBuffer);
$objDSig->appendSignature($doc->documentElement);
$doc->save($certTempFile);
$codif = file_get_contents($certTempFile);
$xml_base64 = base64_encode($codif);
$param1 = new SoapParam($xml_base64, 'xml');
$com = new SoapClient('https://www.aespd.es:443/agenciapd/axis/SolicitudService?wsdl', array('trace' => 1, 'encoding' => 'UTF-8'));
$respuesta2 = $com->probarXml($param1);
$respuesta = base64_decode($respuesta2);
And the xml is being sent, and that's nice, but when i recover the xml file, and check the signature on: http://www.aleksey.com/xmlsec/xmldsig-verifier.html
the error I'm getting is:
func=xmlSecOpenSSLEvpDigestVerify:file=digests.c:line=229:obj=sha1:subj=unknown:error=12:invalid data:data and digest do not match
I tried to transform the certificate into separate private and public keys, same file, different files, importing and exporting and such.
The flow goes this way:
Java program sends generated unsigned Base64 encoded XML to PHP file, which signs and sends it with a SoapClient, result is printed, then captured and interpreted by Java program, thus avoiding having individual certificates on the machines using this system.

simplexml_load_string doesn't work with soap response

I'm trying to parse the xml response from a soap service. However, I can't get simplexml_load_string to work! Here is my code:
//make soap call
objClient = new SoapClient('my-wsdl',
array('trace' => true,'exceptions' => 0, 'encoding' => 'UTF-8'));
$soapvar = new SoapVar('my-xml', XSD_ANYXML);
$objResponse = $objClient->__soapCall($operation, array($soapvar));
//process result
$str_xml = $objClient->__getLastResponse();
$rs_xml = simplexml_load_string($str_xml);
...$rs_xml always has just one element with name Envelope.
However, if I use *"print var_export($objClient->__getLastResponse(),true);"* to dump the result to my browser, then cut and paste it into my code as a string variable, it work fine! This is what I mean:
$str_xml = 'my cut and pasted xml';
$rs_xml = simplexml_load_string($str_xml);
So it seems the problem is somehow related to something $objClient->__getLastResponse() is doing to the string it creates... but I'm at a loss as to what the problem is or how to fix it.
Do the following:
$str_xml = $objClient->__getLastResponse();
$str_xml = strstr($str_xml, '<');
$rs_xml = simplexml_load_string($str_xml);
As it's a quick and easy hack to strip off stuff before the first opening element.

Categories