PHP Curl UTF-8 Charset - php

I have an php script which calls another web page and writes all the html of the page and everything goes ok however there is a charset problem. My php file encoding is utf-8 and all other php files work ok (that means there is no problem with server). What is the missing thing in that code and all spanish letters look weird. PS. When I wrote these weird characters original versions into php, they all look accurate.
header("Content-Type: text/html; charset=utf-8");
function file_get_contents_curl($url)
{
$ch=curl_init();
curl_setopt($ch,CURLOPT_HEADER,0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
$data=curl_exec($ch);
curl_close($ch);
return $data;
}
$html=file_get_contents_curl($_GET["u"]);
$doc=new DOMDocument();
#$doc->loadHTML($html);

Simple:
When you use curl it encodes the string to utf-8 you just need to decode them..
Description
string utf8_decode ( string $data )
This function decodes data , assumed to be UTF-8 encoded, to ISO-8859-1.

You Can use this header
header('Content-type: text/html; charset=UTF-8');
and after decoding the string
$page = utf8_decode(curl_exec($ch));
It worked for me

$output = curl_exec($ch);
$result = iconv("Windows-1251", "UTF-8", $output);

function page_title($val){
include(dirname(__FILE__).'/simple_html_dom.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$val);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0');
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$return = curl_exec($ch);
$encot = false;
$charset = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
curl_close($ch);
$html = str_get_html('"'.$return.'"');
if(strpos($charset,'charset=') !== false) {
$c = str_replace("text/html; charset=","",$charset);
$encot = true;
}
else {
$lookat=$html->find('meta[http-equiv=Content-Type]',0);
$chrst = $lookat->content;
preg_match('/charset=(.+)/', $chrst, $found);
$p = trim($found[1]);
if(!empty($p) && $p != "")
{
$c = $p;
$encot = true;
}
}
$title = $html->find('title')[0]->innertext;
if($encot == true && $c != 'utf-8' && $c != 'UTF-8') $title = mb_convert_encoding($title,'UTF-8',$c);
return $title;
}

I was fetching a windows-1252 encoded file via cURL and the mb_detect_encoding(curl_exec($ch)); returned UTF-8. Tried utf8_encode(curl_exec($ch)); and the characters were correct.

First method (internal function)
The best way I have tried before is to use urlencode(). Keep in mind, don't use it for the whole url; instead, use it only for the needed parts. For example, a request that has two 'text-fa' and 'text-en' fields and they contain a Persian and an English text, respectively, you might only need to encode the Persian text, not the English one.
Second Method (using cURL function)
However, there are better ways if the range of characters have to be encoded is more limited. One of these ways is using CURLOPT_ENCODING, by passing it to curl_setopt():
curl_setopt($ch, CURLOPT_ENCODING, "");

Related

UTF-8 encoded characters show as gibberish in PHP

I am trying to print all the <p> elements of a particular HTML document fetched from a URL. The HTML document is using UTF-8 encoding.
This is my code:
<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);
header('Content-Type: text/plain; charset=utf-8');
header('Access-Control-Allow-Origin: *');
header('Access-Control-Allow-Methods: POST, GET, OPTIONS');
$url = "https://www.sangbadpratidin.in/kolkata/ispat-express-met-an-accident-near-howrah-junction/#.Y7qC6YFeT80.whatsapp";
$user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$html=curl_exec($ch);
if (!curl_errno($ch)) {
$resultStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($resultStatus == 200) {
#$DOM = new DOMDocument;
#$DOM->loadHTML($html);
$bodies = $DOM->getElementsByTagName('p');
foreach($bodies as $body){
$para = $body->nodeValue;
echo $para;
}
}
}
?>
The HTML document is filled with Bengali characters, when I try to print the values, this is what gets printed:
সà§à¦¬à§à¦°à¦¤ বিশà§à¦¬à¦¾à¦¸: ফà§à¦° দà§à¦°à§à¦à¦à¦¨à¦¾à¦° à¦à¦¬à¦²à§ দà§à...
Why am I not getting the original text? Please help me
edit: i just TESTED it, yeah this fixed it :) see it live at https://dh.ratma.net/test/test2.php
known issue with DOMDocument not realizing its UTF-8, and defaulting to some horrible windows-1252 encoding, and proceeds to corrupt actual UTF-8 multibyte characters. with a bit of luck, replacing
#$DOM->loadHTML($html);
with
#$DOM->loadHTML('<?xml encoding="UTF-8">' . $html);
should fix it.
Changing $DOM->loadHTML($html) to $DOM->loadHTML(mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8")) seems to resolve the issue.
Source: PHP DOMDocument loadHTML not encoding UTF-8 correctly

How to select specific text from a string generated by a PHP script?

I've been trying to scrape a HLS file from Twitch using several PHP scripts. The first one runs a cURL command to get the HLS URL through a Python script that returns said URL and converts the generated string to plain text, and the second (which is the one that isn't working) is supposed the extract the M3U8 file and make it able to be played.
First script (extract.php)
<?php
header('Content-Type: text/plain; charset=utf-8');
$url = "https://pwn.sh/tools/streamapi.py?url=twitch.tv/cgtn_live_russian&quality=1080p60";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
//for debug only!
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$resp = curl_exec($curl);
curl_close($curl);
var_dump($resp);
$undesirable = array("}");
$cleanurl = str_replace($undesirable,"");
echo substr($cleanurl, 39, 898);
?>
This script (let's call it extract.php) works, and it returns (in plain text) the same information the Python script would return, which is this:
string(904) "{"success": true, "urls": {"1080p60": "https://video-weaver.fra05.hls.ttvnw.net/v1/playlist/[token].m3u8"}}"
Second script (play.php)
<?php
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Referer:https://myserver.com/" .
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0"
));
$html = file_get_contents("extract.php");
preg_match_all(
'/(http.*?\.m3u8[^&">]+)/',
$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
header("Location: $link");
}
?>
This second script (let's call it play.php) should theoretically return the M3U8 file (without string(904) "{"success": true, "urls": {"1080p60":) and make it able to be played in a media player, such as VLC, but it doesn't return anything.
Can someone tell me what's wrong? Did I make a syntax or regex error when making these PHP files or is the second file not working because of the other elements of the string?
Thanks in advance.
I think you can rely on the regex to get the URL out instead of trying to clean the string manually. The other way would be to use json_decode().
Anyways the idea is to define a variable in extract.php, in this case it is $resp. Doing it via echo as you are now will not make it available in the parent script.
You can then reference that variable in play.php once extract.php has been included.
<?php
//extract.php
$resp = '';
$url = "https://pwn.sh/tools/streamapi.py?url=twitch.tv/cgtn_live_russian&quality=1080p60";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
//for debug only!
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$resp = curl_exec($curl);
curl_close($curl);
//play.php
include('./extract.php');
//$resp is set in extraact.php
preg_match_all(
'/(http.*?\.m3u8)/',
$resp,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
}
header("Location: $link");
die();

Get utf8 DOM from utf8 file

I have the following code:
<?php
header('Content-Type: text/html; charset=utf-8');
function getSource($url)
{
if (!function_exists('curl_init'))
{
die('CURL is not installed!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_ENCODING, "UTF-8");
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
$source = getSource('http://www.website.com/');
var_dump($source); die();
And the file itself is in UTF-8. The thing is the UTF-8 characters of the output are not displayed properly. Instead they are shown as question marks, or some other trash.
And the only thing to solve this that I found out is to encode the file as ISO-8859-1. But I don't want that. What's wrong here?
The value you pass to CURLOPT_ENCODING is (a) invalid, and (b) meaningless, in that it doesn't force Curl to translate the content it fetches into the encoding you want. If the remote site returns ISO-8859-1, then you have to translate that to UTF-8 yourself.
CURLOPT_ENCODING is used to accept the Accept-Encoding: header when fetching a page. Valid values are "identity","deflate", and "gzip". As you can see, it has no meaning for the character-set encoding.

PHP curl - having a bit of trouble with special/unique/rare characters

I have the following code on my server running on php 5.2.*;
$curl = curl_init();
//$sumName = curl_escape($curl, $sumNameWeb);
$summonerName = urlencode($summonerName);
$url = "https://euw.api.pvp.net/api/lol/euw/v1.4/summoner/by-name/{$summonerName}?api_key=".$key;
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, $url);
$result = curl_exec($curl);
$result = utf8_encode($result);
$obj = json_decode($result, true);
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
It works fine, however when it comes to special characters like; ë Ö å í .. etc it fails to connect.. I have been trying different ways maybe i would find a fix but i am failing to do so..
ok i have found my error!! however this is my situation.. it is connecting to the server and getting the data.. AND i am using $sumNameWeb to access the JSON when it is decoded however the returned $sumNameWeb special character has changed.. here is the code to access the JSON;
$sumID = $obj[$sumNameWeb]["id"];
$sumLvl = $obj[$sumNameWeb]["summonerLevel"];
an example is, entering ë and returning ë from the server
Try This
Try to set one more curl parameter into your curl request that filters garbage data from result.
curl_setopt($curl, CURLOPT_ENCODING ,"");
I hope this helps you!!
urlencode encode non-ASCII characters according to the UTF-8 charset encoding. So most likely your problem is that your text (source code) is in other encoding (different from UTF-8). You have to ensure it has UTF-8 encoding.
Add header in the page before any sending curl.
header('Content-Type: text/html; charset=utf-8');
I faced the same problem. urlencode would not work with these links. I had to specifically replace them my self.
$curl = curl_init();
//$sumName = curl_escape($curl, $sumNameWeb);
$summonerName = urlencode($summonerName);
$url = "https://euw.api.pvp.net/api/lol/euw/v1.4/summoner/by-name/{$summonerName}?api_key=".$key;
$str = $url;
$str = str_replace("{", "%7B", $str);
$str = str_replace("$", "%24", $str);
$str = str_replace("}", "%7D", $str);
$url = $str;
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, $url);
$result = curl_exec($curl);
$result = utf8_encode($result);
$obj = json_decode($result, true);
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
this should work. If additional characters need to be replaced you can find out their link substitute by following this link: url encoder

PHP function to convert from html codes to normal chars

I have a string like this:
La Torre Eiffel paragonata all’Everest
What PHP function should I use to convert the ’ to the actual "normal" char ':
La Torre Eiffel paragonata all’Everest
I'm using CURL to fetch a page and this page has that string in it but for some reason the HTML chars are not decoded.
The my_url test page is an Italian blog with iso characters, and all the apostrophes are encoded in html code like above.
$output = curl_download($my_url);
$output = htmlspecialchars_decode($output);
function curl_download($Url){
// is cURL installed yet?
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
// OK cool - then let's create a new cURL resource handle
$ch = curl_init();
// Now set some options (most are optional)
// Set URL to download
curl_setopt($ch, CURLOPT_URL, $Url);
// Set a referer
curl_setopt($ch, CURLOPT_REFERER, "http://www.example.org/yay.htm");
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
html_entity_decode. From the php.net manual: html_entity_decode() is the opposite of htmlentities() in that it converts all HTML entities in the string to their applicable characters.
try this
echo html_entity_decode('La Torre Eiffel paragonata all’Everest',ENT_QUOTES,'UTF-8');
so in your code change this
$output = curl_download($my_url);
$output = htmlspecialchars_decode($output);
to
$output = curl_download($my_url);
$output = html_entity_decode($output,ENT_QUOTES,'UTF-8');

Categories