Encoding issues with PHP - php

I have been searching and trying for hours and can't seem to find anything that actually solves my problem.
I'm calling a PHP function that grabs content using the Google translate API and I'm passing a string to be translated.
There are quite a few instances where the encoding is affected but I've done this before and it worked fine as far as I can remember.
Here's the code that calls that function:
$name = utf8_encode(mt($name));
And here's the actual function:
function mt($text) {
$apiKey = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
$url = 'https://www.googleapis.com/language/translate/v2?key=' . $apiKey . '&q=' . rawurlencode($text) . '&source=en&target=es';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($handle);
echo curl_error($handle);
$responseDecoded = json_decode($response, true);
$responseCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); //Fetch the HTTP response code
curl_close($handle);
if($responseCode != 200) {
$resultxt = 'failed!';
return $resultxt;
}
else {
$resultxt = $responseDecoded['data']['translations'][0]['translatedText'];
return utf8_decode($resultxt); //return($resultxt) won't work either
}
}
What I end up getting is garbled characters for any accentuated character, like Guía del desarrollador de XML
I've tried all combinations of encoding/decoding and I just can't get it to work...

I had this kind of issues before, what I can tell you to try is:
In the <head> tag try to add:
<meta http-equiv=”Content-type” content=”text/html; charset=utf-8″ />
Try to add it in the PHP header:
header(“Content-Type: text/html;charset=utf-8”);
Check the encoding of your file, for example in the Notepad ++
Encoding > UTF-8 without BOM
Setting charset in the .htaccess
AddDefaultCharset utf-8
As you said you are reading files from the users you can use this function: mb-convert-encoding to check for the encoding, and if it's different from UTF-8 convert it. Try this:
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not converted to UTF-8');
}
}
return $content;
}
?>

Related

PHP - Encoding issue when saving to XML file using SimpleXml

I am struggling with encoding issues in a PHP app that:
Reads an XML file and parses it according to some rules
Calls the Google Translate API and uses the result to populate a
database that is later used to display data on the browser (that
part works well)
Saves that data to an XML file (it saves but there's something wrong
with the encoding).
The data comes from Google Translate encoded in UTF-8 and in the browser, provided that you have the proper heading it displays fine whatever the language is.
Here's the Google Translate function:
function mt($text, $lang) {
$url = 'https://www.googleapis.com/language/translate/v2?key=' . $apiKey . '&q=' . rawurlencode($text) . '&source=en&target=' . $lang;
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($handle);
$responseDecoded = json_decode($response, JSON_UNESCAPED_UNICODE);
$responseCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
curl_close($handle);
if($responseCode != 200) {
$resultxt = 'not200result';
}
else {
$resultxt = $responseDecoded['data']['translations'][0]['translatedText'];
}
return $resultxt;
}
I'm using Simplexml to load an XML file, modify its contents and save it with asXml().
The generated XML file is encoded in something other than UTF-8 as it looks like this:
<value>ようこそ%0 ST数学</value>
Here's the code that attributes the translation to the XML node and saves it.
$xml=simplexml_load_file('myfile.xml'); //Load source XML file
$xml->addAttribute('encoding', 'UTF-8');
$xmlFile = 'translation.xml'; //File that will be saved
//Here I have a call to the MT function above and get it to the XML file at face value.
$xml->asXML($xmlFile) //save translated XML file
I've tried using htmentities() and played with utf8_encode() and utf8_decode() but can't make it work.
I've tried everything and looked at many other posts. For the life of me, I can't figure this one out. Any help is appreciated.

How to make curl call for remote url which contain space

This question is continuation of my previous question
<?php
$remoteFile = 'http://cdn/bucket/my textfile.txt';
$ch = curl_init($remoteFile);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //not necessary unless the file redirects (like the PHP example we're using here)
$data = curl_exec($ch);
print_r($data)
curl_close($ch);
if ($data === false) {
echo 'cURL failed';
exit;
}
$contentLength = 'unknown';
$status = 'unknown';
if (preg_match('/^HTTP\/1\.[01] (\d\d\d)/', $data, $matches)) {
$status = (int)$matches[1];
}
if (preg_match('/Content-Length: (\d+)/', $data, $matches)) {
$contentLength = (int)$matches[1];
}
echo 'HTTP Status: ' . $status . "\n";
echo 'Content-Length: ' . $contentLength;
?>
I am using above code to get the file size in server side from CDN url but when I use the CDN url with space in it. it is throwing below error
page not found 09/18/2014 - 16:54 http://cdn/bucket/my textfile.txt
Can I make curl call for remote url which contain space ?
To give little bit more info on this
I am having interface where user will be saving file to CDN (so user
can give whatever title user want, it may contain space )and all
information in saved in back end db. I have another interface where I
retrieve the saved information and show it in my page along with file
size which I am getting using above code.
You have to encode your url's which have space's in it.
echo urlencode('http://cdn/bucket/my textfile.txt');
Ref: urlencode
or you can use,
echo '<a href="http://example.com/department_list_script/',
rawurlencode('sales and marketing/Miami'), '">';
Ref: rawurlencode
Yes you need to URL / URI encode
In an encoded URL, the spaces are encoded as: %20, so your URL would be: http://cdn/bucket/my%20textfile.txt so you could just use this url.
Or as this is PHP, you could use the urlencode function.
ref: http://php.net/manual/en/function.urlencode.php
e.g.
$remoteFile = urlencode('http://cdn/bucket/my textfile.txt');
or
$ch = curl_init(urlencode($remoteFile));

DOMDocument::loadHTML(): input conversion failed due to input error

I am looking to scrap a Chinese website using PHP and CURL. Earlier I had an issue with the compressed results and SO had helped me to sort it out.
Now I'm facing a trouble while parsing the contents through PHP - DOMDocument.
The error is as follows,
Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..
Even though warning this is preventing from getting further results.
My code is as given below:
$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,$url);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312'));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, ""); // handling all compressions
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');
$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="test"]//a/#href');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
I found the content type in my target website as ,
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
So I tried converting result to utf-8.
Since the input conversion fails at 'DOMDocument::loadHTML()' line of the code ,I can't parse the web page to get the results.
I am currently stuck at this point and any help or suggestions will be highly appreciated. Thanx in advance.
(Earlier I used to work with simple HTML DOM parser,which was pretty simple.But later after reading the cons in SO regarding its usage.I planned to switch to PHP's native DOM Parser )
I see a solution today .
$html=new DOMDocument();
$html_source = get_html();
$html_source =mb_convert_encoding( $html_source, "HTML-ENTITIES", "UTF-8");
$html->loadHTML( $html_source );
Without seeing the full head of the document that you are parsing I can only guess, but if the with the character encoding data does not come directly after the tag, you may be running into a situation where DomDocument is using its default of ISO-8859-1 and running into the【 character (the first three "invalid" bytes in gb2312) of which the 0x80 byte would be the first bit of nonsense since this is an unused code point in ISO-8859-1. This would likely trigger the bug in DomDocument discussed in the comments above. And could easily happen if the element is included before the content-type meta information.
The only thing I can think of to try would be to run the html through a bit of prep and move that content-type meta tag to right after the tag to try to make it use the correct character set. If you use mb_convert_encoding or iconv to convert the encoding to iso-5589-1 or utf-8, make sure that you modify the meta information because DomDocument is, unfortunately, brittle in many ways.
<?php
$contents = file_get_contents('xml.xml');
function convert_utf8( $string ) {
if ( strlen(utf8_decode($string)) == strlen($string) ) {
// $string is not UTF-8
return iconv("ISO-8859-1", "UTF-8", $string);
} else {
// already UTF-8
return $string;
}
}
$contents = mb_convert_encoding( $contents, mb_detect_encoding($contents), "UTF-8");
$xml = simplexml_load_string(convert_utf8($contents));
print_r($xml);

PHP GET Request return inconsistent results

I am using cURL via PHP to test service connections, and I'm getting some inconsistent results. When I run the test via PHP & cURL this is my result:
{"response":"\n\n\n\n \n \n
When I put that same URL in my browser I get this:
{"response":"\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n <link href=\"/images/global/global.css\...and so on
The response in my browser is cut short, but you get the idea.
With my PHP, I read in a JSON file, parse out the URL I need and the use cURL to send a GET request. Here is the code that I am using to test the service via PHP:
<?php
include ("serviceURLs.php");
class callService {
function testService($url){
$ch = curl_init($url);
curl_exec($ch);
$info = curl_getinfo($ch);
if ($info['http_code'] == 200){
echo("Test has passed </br>");
}else{
echo("Test Failed.</br> ");
}
var_dump($info);
curl_close($ch);
}
function readFile(){
$myFile = "./service/catalog-adaptation.json";
$fr = fopen($myFile, 'r');
$fileData = fread($fr, filesize($myFile));
$json_a = json_decode($fileData, TRUE);
$prodServer = $json_a['serverRoots']['%SERVER_ROOT']['PROD'];
$demoServer = $json_a['serverRoots']['%SERVER_ROOT']['DEMO'];
$testServer = $json_a['serverRoots']['%SERVER_ROOT']['TEST'];
$testUrls = $json_a['commands'];
foreach($testUrls as $tURL){
$mURL = $tURL['URL'];
if(stripos($mURL, "%")===0){
$testTestService = str_replace("%SERVER_ROOT", $testServer, $mURL);
$testDemoService = str_replace("%SERVER_ROOT", $demoServer, $mURL);
$testProdService = str_replace("%SERVER_ROOT", $prodServer, $mURL);
echo ("Production test: ");
$this->testService($testProdService);
echo ("Demo test: ");
$this->testService($testDemoService);
echo ("Test test: ");
$this->testService($testTestService);
}
}
}
}
$newServiceTest = new callService;
$newServiceTest->readFile();
?>
Can anyone tell my why I am getting different results and how I can fix my code so I can get consistent results?
You need to set below option for return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

Get the filesize of a js file on another domain using php

How do I get the filesize of js file on another website. I am trying to create a monitor to check that a js file exists and that it is more the 0 bytes.
For example on bar.com I would have the following code:
$filename = 'http://www.foo.com/foo.js';
echo $filename . ': ' . filesize($filename) . ' bytes';
You can use a HTTP HEAD request.
<?php
$url = "http://www.neti.ee/img/neti-logo.gif";
$head = get_headers($url, 1);
echo $head['Content-Length'];
?>
Notice: this is not a real HEAD request, but a GET request that PHP parses for its Content-Length. Unfortunately the PHP function name is quite misleading. This might be sufficient for small js files, but use a real HTTP Head request with Curl for bigger file sizes because then the server won't have to upload the whole file and only send the headers.
For that case, use the code provided by Jakub.
Just use CURL, here is a perfectly good example listed:
Ref: http://www.php.net/manual/en/function.filesize.php#92462
<?php
$remoteFile = 'http://us.php.net/get/php-5.2.10.tar.bz2/from/this/mirror';
$ch = curl_init($remoteFile);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //not necessary unless the file redirects (like the PHP example we're using here)
$data = curl_exec($ch);
curl_close($ch);
if ($data === false) {
echo 'cURL failed';
exit;
}
$contentLength = 'unknown';
$status = 'unknown';
if (preg_match('/^HTTP\/1\.[01] (\d\d\d)/', $data, $matches)) {
$status = (int)$matches[1];
}
if (preg_match('/Content-Length: (\d+)/', $data, $matches)) {
$contentLength = (int)$matches[1];
}
echo 'HTTP Status: ' . $status . "\n";
echo 'Content-Length: ' . $contentLength;
?>
Result:
HTTP Status: 302
Content-Length: 8808759
Another solution. http://www.php.net/manual/en/function.filesize.php#90913
This is just a two step process:
Crawl the the js file and store it to a variable
Check if the length of the js file is greater than 0
thats it!!
Here is how you can do it in PHP
<?php
$data = file_get_contents('http://www.foo.com/foo.js');
if(strlen($data)>0):
echo "yay"
else:
echo "nay"
?>
Note: You can use HTTP Head as suggested by Uku but then if you are seeking for the page content if js file has content then you would have to crawl again :(

Categories