PHP json_decode for foreign characters

PHP json_decode for foreign characters - php

I've got the following script which sends some text to the Google Translate API to translate into Japanese:
<?php
$apiKey = 'myApiKey';
$text = 'Hello world!';
$url = 'https://www.googleapis.com/language/translate/v2?key=' . $apiKey . '&q=' . rawurlencode($text) . '&source=en&target=ja';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($handle);
$responseDecoded = json_decode($response, true);
curl_close($handle);
echo 'Source: ' . $text . '<br>';
echo 'Translation: ' . $responseDecoded['data']['translations'][0]['translatedText'];
?>
However this returns ä¸–ç•Œã“ã‚“ã«ã¡ã¯ï¼
I'm assuming this is an encoding issue but I'm not sure how to solve this issue. The PHP manual says "This function only works with UTF-8 encoded strings."
My question is, how do I get the returned translate results to properly display?
Thanks

The JSON contains UTF-8 encoded text, but you need to tell the browser your page uses UTF-8 encoding. Based on your output, it appears that the unicode text is being interpreted as ASCII text.
You can add the following to your PHP code to set the content type via HTTP headers:
header('Content-Type: text/html; charset=UTF-8');
And/Or, you can add a meta tag within your <head> tags to specify UTF-8:
<!-- html4/xhtml -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<!-- HTML5 -->
<meta charset="utf-8">

Make sure to have the <meta charset="utf-8"> tag on your html header

Related

Scraping meta data on Japanese websites with some character encoding problems

For a small project on Wordpress, I am trying to scrape some information from site given an URL (namely a thumbnail and the publisher). I know there are few plugin doing similar things but they usually inject the result in the article itself which is not my goal. Furthermore, the one I use tend to have the same issue I have.
My overall goal is to display a thumbnail and the publisher name given a URL in a post custom field. I get my data from the opengraph metatags for the moment (I'm a lazy guy).
The overall code works but I get the usual mangled text when dealing with non-latin characters (and that's 105% of the cases). Even stranger for me : it depends on the site.
I have tried to use ForceUTF8 and gzip compression in curl as recommended in various answers here but the result is still the same (or gets worse).
My only clue for the moment is how the encoding is declared on each page
For example, for 3 URL I was given:
https://www.jomo-news.co.jp/life/oricon/25919
<meta charset="UTF-8" />
<meta property="og:site_name" content="上毛新聞" />
Result > ä¸Šæ¯›æ–°è ž
Not OK
https://entabe.jp/21552/rl-waffle-chocolat-corocoro
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta property="og:site_name" content="えん食べ [グルメニュース]" />
Result > えん食べ [グルメニュース]
OK
https://prtimes.jp/main/html/rd/p/000000008.000026619.html
<meta charset="utf-8">
<meta property="og:site_name" content="プレスリリース・ニュースリリース配信シェアNo.1｜PR TIMES" />
Result > ãƒ—ãƒ¬ã‚¹ãƒªãƒªãƒ¼ã‚¹ãƒ»ãƒ‹ãƒ¥ãƒ¼ã‚¹ãƒªãƒªãƒ¼ã‚¹é… ä¿¡ã‚·ã‚§ã‚¢No.1ï½œPR TIMES
Not OK
For reference, the curl declaration I use
function file_get_contents_curl($url)
{
header('Content-type: text/html; charset=UTF-8');
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
And the scraping function:
function get_news_header_info($url){
//parsing begins here:
$news_result = array("news_img_url" => "", "news_name" => "");
$html = file_get_contents_curl($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('property') == 'og:site_name')
{
if(! $news_name)
$news_name = $meta->getAttribute('content');
}
//Script continues
}
Anyone knows what is different between these three cases and how I could deal with it ?
EDIT
Looks like that even though all websites declared a UTF-8 charset, after looking at the curl_getinfo() and testing a bunch of charset conversion combinaison, a conversion to ISO-8859-1 was necessary.
So just adding a
iconv("UTF-8", "ISO-8859-1", $scraped_text);
was enough to solve the problem.
For the sake of giving a complete answer, here is the snippet of code to test conversion pairs from this answer by rid-iculous
$charsets = array(
"UTF-8",
"ASCII",
"Windows-1252",
"ISO-8859-15",
"ISO-8859-1",
"ISO-8859-6",
"CP1256"
);
foreach ($charsets as $ch1) {
foreach ($charsets as $ch2){
echo "<h1>Combination $ch1 to $ch2 produces: </h1>".iconv($ch1, $ch2, $text_2_convert);
}
}
Problem solved, have fun!

Looks like even tough all pages declared using UTF-8, some ISO-8859-1 was hidden in places. Using iconv solved the issue.
Edited the question with all the details, case closed !

Moz api request is being blocked by Incapsula?

When I try to access moz api using below code
$accessID = 'mozscape-key';
$secretKey = 'secert key';
// Set your expires times for several minutes into the future.
// An expires time excessively far in the future will not be honored by the Mozscape API.
$expires = time() + 300;
// Put each parameter on a new line.
$stringToSign = $accessID."\n".$expires;
// Get the "raw" or binary output of the hmac hash.
$binarySignature = hash_hmac('sha1', $stringToSign, $secretKey, true);
// Base64-encode it and then url-encode that.
$urlSafeSignature = urlencode(base64_encode($binarySignature));
// Specify the URL that you want link metrics for.
$objectURL = "www.seomoz.org";
// Add up all the bit flags you want returned.
// Learn more here: https://moz.com/help/guides/moz-api/mozscape/api-reference/url-metrics
$cols = "103079215108";
// Put it all together and you get your request URL.
// This example uses the Mozscape URL Metrics API.
$requestUrl = "http://lsapi.seomoz.com/linkscape/url-metrics/".urlencode($objectURL)."?Cols=".$cols."&AccessID=".$accessID."&Expires=".$expires."&Signature=".$urlSafeSignature;
echo $requestUrl;
die;
// Use Curl to send off your request.
$options = array(
CURLOPT_RETURNTRANSFER => true
);
$ch = curl_init($requestUrl);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
curl_close($ch);
$f = fopen('tte.txt','a');
fwrite($f,$content);
fclose($f);
print_r($content);
The out it return is below
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS">
<meta content="telephone=no" name="format-detection">
<meta content="initial-scale=1.0" name="viewport">
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
<title></title>
</head>
<body style="margin:0px;height:100%">
<iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px"
src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=10-113037580-0%200NNN%20RT(1470041335360%200)%20q(0%20-1%20-1%20-1)%20r(0%20-1)%20B12(8,811001,0)%20U5&incident_id=220010400174850153-812164000562037002&edet=12&cinfo=08000000"
width="100%">Request unsuccessful. Incapsula incident ID:
220010400174850153-812164000562037002</iframe>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS">
<meta content="telephone=no" name="format-detection">
<meta content="initial-scale=1.0" name="viewport">
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
<iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px"
src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=6-31536099-0%200NNN%20RT(1470041496215%200)%20q(0%20-1%20-1%20-1)%20r(0%20-1)%20B12(8,811001,0)%20U5&incident_id=220010400174850153-224923142338658566&edet=12&cinfo=08000000"
width="100%">Request unsuccessful. Incapsula incident ID:
220010400174850153-224923142338658566</iframe>
</body>
</html>
Seems like incapsula is treating request as robot. Can anyone please help me how I can fix it.

If you said you are using the $requestUrl to a GET (in browser) it works fine, try combining your options array.
It should look like this:
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => $requestUrl,
CURLOPT_USERAGENT => 'Maybe You Need Agent?'
));
Note about agent (taken from web):
cURL is a behemoth, and has many many possibilities. Some sites might
only serve pages to some user agents, and when working with APIs, some
might request you send a specfici user agent, this is something to be
aware of.
Also worth checking - you have ID of failture from Incapsula - 220010400174850153-224923142338658566
Can you check the logs and see what is there?

PHP - Encoding issue when saving to XML file using SimpleXml

I am struggling with encoding issues in a PHP app that:
Reads an XML file and parses it according to some rules
Calls the Google Translate API and uses the result to populate a
database that is later used to display data on the browser (that
part works well)
Saves that data to an XML file (it saves but there's something wrong
with the encoding).
The data comes from Google Translate encoded in UTF-8 and in the browser, provided that you have the proper heading it displays fine whatever the language is.
Here's the Google Translate function:
function mt($text, $lang) {
$url = 'https://www.googleapis.com/language/translate/v2?key=' . $apiKey . '&q=' . rawurlencode($text) . '&source=en&target=' . $lang;
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($handle);
$responseDecoded = json_decode($response, JSON_UNESCAPED_UNICODE);
$responseCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
curl_close($handle);
if($responseCode != 200) {
$resultxt = 'not200result';
}
else {
$resultxt = $responseDecoded['data']['translations'][0]['translatedText'];
}
return $resultxt;
}
I'm using Simplexml to load an XML file, modify its contents and save it with asXml().
The generated XML file is encoded in something other than UTF-8 as it looks like this:
<value>ようこそ％0 ST数学</value>
Here's the code that attributes the translation to the XML node and saves it.
$xml=simplexml_load_file('myfile.xml'); //Load source XML file
$xml->addAttribute('encoding', 'UTF-8');
$xmlFile = 'translation.xml'; //File that will be saved
//Here I have a call to the MT function above and get it to the XML file at face value.
$xml->asXML($xmlFile) //save translated XML file
I've tried using htmentities() and played with utf8_encode() and utf8_decode() but can't make it work.
I've tried everything and looked at many other posts. For the life of me, I can't figure this one out. Any help is appreciated.

Encoding issues with PHP

I have been searching and trying for hours and can't seem to find anything that actually solves my problem.
I'm calling a PHP function that grabs content using the Google translate API and I'm passing a string to be translated.
There are quite a few instances where the encoding is affected but I've done this before and it worked fine as far as I can remember.
Here's the code that calls that function:
$name = utf8_encode(mt($name));
And here's the actual function:
function mt($text) {
$apiKey = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
$url = 'https://www.googleapis.com/language/translate/v2?key=' . $apiKey . '&q=' . rawurlencode($text) . '&source=en&target=es';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($handle);
echo curl_error($handle);
$responseDecoded = json_decode($response, true);
$responseCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); //Fetch the HTTP response code
curl_close($handle);
if($responseCode != 200) {
$resultxt = 'failed!';
return $resultxt;
}
else {
$resultxt = $responseDecoded['data']['translations'][0]['translatedText'];
return utf8_decode($resultxt); //return($resultxt) won't work either
}
}
What I end up getting is garbled characters for any accentuated character, like GuÃa del desarrollador de XML
I've tried all combinations of encoding/decoding and I just can't get it to work...

I had this kind of issues before, what I can tell you to try is:
In the <head> tag try to add:
<meta http-equiv=”Content-type” content=”text/html; charset=utf-8″ />
Try to add it in the PHP header:
header(“Content-Type: text/html;charset=utf-8”);
Check the encoding of your file, for example in the Notepad ++
Encoding > UTF-8 without BOM
Setting charset in the .htaccess
AddDefaultCharset utf-8
As you said you are reading files from the users you can use this function: mb-convert-encoding to check for the encoding, and if it's different from UTF-8 convert it. Try this:
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not converted to UTF-8');
}
}
return $content;
}
?>

I cant save some facebook images to my server as it does not understand the file

I am using the facebook graph api and it was working well until I realised that some of the jpg files have a query string at the end that is making them unusable.
e.g.
https://scontent.xx.fbcdn.net/hphotos-xaf1/v/t1.0-9/487872_451835128174833_1613257199_n.jpg?oh=621bed79f5436e81c3e219c86db8f0d9&oe=560F3D0D
I have tried stripping off everything after .jpg in the hope that it would still load the image but unfortunately it doesnt.
In the following code take the $facebook_image_url to be the one above. This works fine when the url ends in .jpg but fails on the above. As a note, I am converting the name to a random number
$File_Name = $facebook_image_url;
$File_Ext = '.jpg';
$Random_Number = rand(0, 9999999999); //Random number to be added to name.
$NewFileName = $Random_Number.$File_Ext; //new file name
$local_file = $UploadDirectory.$NewFileName;
$remote_file = $File_Name;
$ch = curl_init();
$fp = fopen ($local_file, 'w+');
$ch = curl_init($remote_file);
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_exec($ch);
curl_close($ch);
fclose($fp);
$image = new Imagick(DIR_TEMPORARY_IMAGES.$NewFileName);
The error Im getting is
Fatal error: Uncaught exception 'ImagickException' with message 'Not a
JPEG file: starts with 0x3c 0x21 `/mysite/temp-images/1849974705.jpg'
# jpeg.c/EmitMessage/232'
I can confirm the image isnt saving as a proper jpg, just a small 3KB file with the name 1849974705.jpg (or other random numbers)
Is there either
A: A way of getting those images from facebook as raw jpg
or
B: A way of converting them succesfully to jpgs

You could always download the image using file_get_contents()
This code works for me...
file_put_contents("image.jpg", file_get_contents("https://scontent.xx.fbcdn.net/hphotos-xaf1/v/t1.0-9/522827_10152235166655545_26514444_n.jpg?oh=1d52a86082c7904da8f12920e28d3687&oe=5659D5BB"));

Just because something has .jpg in the URI doesn't mean it's an image.
Getting that URL via wget gives the result:
<!DOCTYPE html>
<html lang="en" id="facebook">
<head>
<title>Facebook | Error</title>
<meta charset="utf-8">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="cache-control" content="no-store">
<meta http-equiv="cache-control" content="max-age=0">
<meta http-equiv="expires" content="-1">
<meta http-equiv="pragma" content="no-cache">
<meta name="robots" content="noindex,nofollow">
<style>
....
....
i.e. it's not an image, exactly as the error message is telling you.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP json_decode for foreign characters - php

Make sure to have the <meta charset="utf-8"> tag on your html header

Related

Scraping meta data on Japanese websites with some character encoding problems

Moz api request is being blocked by Incapsula?

PHP - Encoding issue when saving to XML file using SimpleXml

Encoding issues with PHP

I cant save some facebook images to my server as it does not understand the file

Categories

Resources