Acents become interrogation marks in php when parsing html - php

i'm getting a PT-BR text automatically from downloading a html page and the acentution becomes interrogation marks when I use uft8_decode, this is my function:
function pegaMsg($string)
{
$bot_url = "http://website.com";
//&rnd=&msg="
$rand_msg = rand(0,100);
$url = $bot_url . $rand_msg . "&msg=" . $string;
$url = str_replace(" ", "%20", $url);
//echo "\n" . $url;
$download = http_get($url, $referer="");
$download['FILE'] = utf8_decode($download['FILE']);
$download['FILE'] = str_replace("var resp = ", "", $download['FILE']);
$download['FILE'] = str_replace("\\r\\n", "", $download['FILE']);
$download['FILE'] = str_replace(";", "", $download['FILE']);
$download['FILE'] = str_replace("\'", "", $download['FILE']);
$download['FILE'] = trim($download['FILE']);
return $download['FILE'];
}
this is the output expected:
VOCÊ TINHA DUAS ESCOLHAS:
and this is what I get:
'VOC? TINHA DUAS ESCOLHAS:
what can I do ? I want the ^ displayed ! thanks and sorry for the bad english

utf8_decode replaces invalid code unit sequences ?. The reason you're getting a ? is likely because the text you're passing to utf8_decode was not in UTF-8 to begin with.
In fact, it's possible it was already in ISO-8859-1, which is the encoding of the string returned by utf8_decode. In that case, your solution would be to just omit the call to utf8_decode.
If the original text was neither in UTF-8 nor in ISO-8859-1 (which is what I'm assuming you want, since you're calling utf8_decode), you have to use iconv or mb_convert_encoding.
A final possibility is that whatever is interpreting the script output is assuming the encoding of the script output is different from what it actually and it also converts invalid code unit sequences to ?.

Try to use encode
$download['FILE'] = utf8_encode($download['FILE']);

Related

PHP UTF-8 mb_convert_encode and Internet-Explorer

Since some days I read about Character-Encoding, I want to make all my Pages with UTF-8 for Compability. But I get stuck when I try to convert User-Input to UTF-8, this works on all Browsers, expect Internet-Explorer (like always).
I don't know whats wrong with my code, it seems fine to me.
I set the header with char encoding
I saved the file in UTF-8 (No BOM)
This happens only, if you try to access to the page via $_GET on the internet-Explorer myscript.php?c=äüöß
When I write down specialchars on my site, they would displayed correct.
This is my Code:
// User Input
$_GET['c'] = "äüöß"; // Access URL ?c=äüöß
//--------
header("Content-Type: text/html; charset=utf-8");
mb_internal_encoding('UTF-8');
$_GET = userToUtf8($_GET);
function userToUtf8($string) {
if(is_array($string)) {
$tmp = array();
foreach($string as $key => $value) {
$tmp[$key] = userToUtf8($value);
}
return $tmp;
}
return userDataUtf8($string);
}
function userDataUtf8($string) {
print("1: " . mb_detect_encoding($string) . "<br>"); // Shows: 1: UTF-8
$string = mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string)); // Convert non UTF-8 String to UTF-8
print("2: " . mb_detect_encoding($string) . "<br>"); // Shows: 2: ASCII
$string = preg_replace('/[\xF0-\xF7].../s', '', $string);
print("3: " . mb_detect_encoding($string) . "<br>"); // Shows: 3: ASCII
return $string;
}
echo $_GET['c']; // Shows nothing
echo mb_detect_encoding($_GET['c']); // ASCII
echo "äöü+#"; // Shows "äöü+#"
The most confusing Part is, that it shows me, that's converted from UTF-8 to ASCII... Can someone tell me why it doesn't show me the specialchars correctly, whats wrong here? Or is this a Bug on the Internet-Explorer?
Edit:
If I disable converting it says, it's all UTF-8 but the Characters won't show to me either... They are displayed like "????"....
Note: This happens ONLY in the Internet-Explorer!
Although I prefer using urlencoded strings in address bar but for your case you can try to encode $_GET['c'] to utf8. Eg.
$_GET['c'] = utf8_encode($_GET['c']);
An approach to display the characters using IE 11.0.18 which worked:
Retrieve the Unicode of your character : example for 'ü' = 'U+00FC'
According to this post, convert it to utf8 entity
Decode it using utf8_decode before dumping
The line of code illustrating the example with the 'ü' character is :
var_dump(utf8_decode(html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", 'U+00FC'), ENT_NOQUOTES, 'UTF-8')));
To summarize: For displaying purposes, go from Unicode to UTF8 then decode it before displaying it.
Other resources:
a post to retrieve characters' unicode

Replacing \r\n (newline characters) after running json_encode

So when I run json_encode, it grabs the \r\n from MySQL aswell. I have tried rewriting strings in the database to no avail. I have tried changing the encoding in MySQL from the default latin1_swedish_ci to ascii_bin and utf8_bin. I have done tons of str_replace and chr(10), chr(13) stuff. I don't know what else to say or do so I'm gonna just leave this here....
$json = json_encode($new);
if(isset($_GET['pretty'])) {
echo str_replace("\/", "/", jsonReadable(parse($json)));
} else {
$json = str_replace("\/", "/", $json);
echo parse($json);
}
The jsonReadable function is from here and the parse function is from here. The str_replaces that are already in there are because I am getting weird formatted html tags like </h1>. Finally, $new is an array which is crafted above. Full code upon request.
Help me StackOverflow. You're my only hope
Does the string contain "\r\n" (as in 0x0D 0x0A) or the literal string '\r\n'? If it's the former, this should remove any newlines.
$json = preg_replace("!\r?\n!", "", $json);
Optionally, replace the second parameter "" with "<br />" if you'd like to replace the newlines with a br tag. For the latter case, try the following:
$json = preg_replace('!\\r?\\n!', "", $json);
Don't replace it in the JSON, replace it in the source before you encode it.
I had a similar issue, i used:
$p_num = trim($this->recp);
$p_num = str_replace("\n", "", $p_num);
$p_num = str_replace("\r", ",", $p_num);
$p_num = str_replace("\n",',', $p_num);
$p_num = rtrim($p_num, "\x00..\x1F");
Not sure if this will help with your requirements.

Using single 'smart quote' in my JSON data is breaking PHP script

I've got a PHP script that is reading in some JSON data provided by a client. The JSON data provided had a single 'smart quote' in it.
Example:
{
"title" : "Lorem Ipsum’s Dolar"
}
In my script I'm using a small function to get the json data:
public function getJson($url) {
$filePath = $url;
$fh = fopen($filePath, 'r') or die();
$temp = fread($fh, filesize($filePath));
$temp = utf8_encode($temp);
echo $temp . "<br />";
$json = json_decode($temp);
fclose($fh);
return $json;
}
If I utf8 encode the data, when I echo it out I see nothing where the quote should be. If I don't utf8 encode the data, when I echo it out I see the funny question mark symbol �
Any thoughts on how to actually see the proper character??
Thanks!
Is it possibe that the server is sending the json data in an encoding like windows-1252? That codepage has some smart code characters where iso-8859 has control characters. Could you try to use iconv("windows-1252", "utf-8", $temp) instead of utf8_encode. Even better would be if the server already sends utf-8 encoded json, since that is the recommended encoding per rfc4627.
The issue is more on the side, that generates the JSON file. There you should escape the ' by \'
If you can't modify this part, you should do it like this with addslashes:
$temp = fread($fh, filesize($filePath));
$temp = utf8_encode($temp);
echo $temp . "<br />";
$temp = addslashes($temp);
$json = json_decode($temp);
Can you possibly do a string replace assuming the data is all utf8?
$text = str_replace($find, $replace, $text);
Looking for the characters below?
'“' // left side double smart quote
'â€' // right side double smart quote
'‘' // left side single smart quote
'’' // right side single smart quote
'…' // elipsis
'—' // em dash
'–' // en dash

problems with encoding (php)

My site have a utf-8 encoding (Drupal).
I use include function to integrate my page with 3rd party service.
But this gives a bad result - bad encoding for include part of page.
i try this, but this don't give any result:
iconv("ASCII","utf-8",include("http://new.velo-travel.ru/themes/themex/spectrum_view.php?$QUERY_STRING"))
before this i use mb_detect_encoding to know encoding
this is included file:
$url = 'http://young.spectrum.ru/cgi-bin/programs_form.pl';
$params = $_GET;
if ($params){
$url .= '?';
foreach ($params as $key => $value) $url .= '&' . $keys . '=' . urlencode($value);
}
#$content = file_get_contents($url);
echo iconv("cp-1251","utf-8", $url);
The include function doesn't return a string (unless you actually call return in the included file), it literally includes the content to your source code. So if the server returns PHP source code, it would be interpreted on your server (which basically means the owners of "new.velo-travel.ru" would have control over your site :)).
Also, as Augenfeind pointed out, the page is in windows-1251. So, you probably want:
echo iconv("windows-1251", "utf-8", file_get_contents("http://new.velo-travel.ru/themes/themex/spectrum_view.php?$QUERY_STRING"))
Well, the page you are including is delivered as "windows-1251" by the web server of velo-travel.ru.
So you might want to change the encoding in your iconv-command?!
Try mb_convert_encoding or iconv for all outputted text and literals in your file. For example:
$text = mb_convert_encoding( $oldText, 'UTF-8', "ISO-8859-1" );
or
$text = iconv( "ISO-8859-1", "UTF-8", $oldText);
Hope you get it working!

removing strange characters from php string

this is what i have right now
Drawing an RSS feed into the php, the raw xml from the rss feed reads:
Paul’s Confidence
The php that i have so far is this.
$newtitle = $item->title;
$newtitle = utf8_decode($newtitle);
The above returns;
Paul?s Confidence
If i remove the utf_decode, i get this
Paul’s Confidence
When i try a str_replace;
$newtitle = str_replace("”", "", $newtitle);
It doesnt work, i get;
Paul’s Confidence
Any thoughts?
This is my function that always works, regardless of encoding:
function RemoveBS($Str) {
$StrArr = str_split($Str); $NewStr = '';
foreach ($StrArr as $Char) {
$CharNo = ord($Char);
if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £
if ($CharNo > 31 && $CharNo < 127) {
$NewStr .= $Char;
}
}
return $NewStr;
}
How it works:
echo RemoveBS('Hello õhowå åare youÆ?'); // Hello how are you?
Try this:
$newtitle = html_entity_decode($newtitle, ENT_QUOTES, "UTF-8")
If this is not the solution browse this page http://us2.php.net/manual/en/function.html-entity-decode.php
This will remove all non-ascii characters / special characters from a string.
//Remove from a single line string
$output = "Likening ‘not-critical’ with";
$output = preg_replace('/[^(\x20-\x7F)]*/','', $output);
echo $output;
//Remove from a multi-line string
$output = "Likening ‘not-critical’ with \n Likening ‘not-critical’ with \r Likening ‘not-critical’ with. ' ! -.";
$output = preg_replace('/[^(\x20-\x7F)\x0A\x0D]*/','', $output);
echo $output;
I solved the problem. Seems to be a short fix rather than the larger issue, but it works.
$newtitle = str_replace('’', "'", $newtitle);
I also found this useful snippit that may help others with same problem;
<?
$find[] = '“'; // left side double smart quote
$find[] = 'â€'; // right side double smart quote
$find[] = '‘'; // left side single smart quote
$find[] = '’'; // right side single smart quote
$find[] = '…'; // elipsis
$find[] = '—'; // em dash
$find[] = '–'; // en dash
$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = "...";
$replace[] = "-";
$replace[] = "-";
$text = str_replace($find, $replace, $text);
?>
Thanks everyone for your time and consideration.
Yeah this is not working for me. What is the workaround for this? – vaichidrewar Mar 12 at 22:29
Add this to the HTML head (or modify if already there):
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
This will encode the funny chars like "“" into UTF-8 so that the str_replace() function will interpret them properly.
Or you can do this:
ini_set('default_charset', 'utf-8');
Is the character encoding setting for your PHP server something other than UTF-8? If so, is there a reason or could it be changed to UTF-8? Though we don't store data in UTF-8 in our database, I've found that setting the webserver's character set to UTF-8 seems to help resolve character set issues.
I'd be interested in hearing others' opinions about this... whether I'm setting myself up for problems by setting webserver to UTF-8 while storing submitted data in Latin1 in our mysql database. I know there was a reason I chose Latin1 for the database but can't recall what it was. Interestingly, our current setup seems to allow for non-UTF-8 character entry and subsequent rendering... it seems that storing in Latin1 doesn't prevent subsequent decoding and display of all UTF-8 characters?
Use the below PHP code to remove
html_entity_decode(mb_convert_encoding(stripslashes($name), "HTML-ENTITIES", 'UTF-8'))
Read up on http://us.php.net/manual/en/function.html-entity-decode.php
That & symbol is a html code so you can easily decode it.
Super simple solution is to have the characters decoded when the page is loaded
Simply copy/paste the following at the beginning of the script
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_regex_encoding('UTF-8');
Reference: http://php.net/manual/en/function.mb-internal-encoding.php
comment left by webfav at web dot de
Many Strange Character be removed by applying
mysqli_set_charset($con,"utf8");
below the mysql connection code.
but in some circumstances of removing this type strange character like â€
we need to use: $title = ' Stefen Suraj'; $newtitle = preg_replace('/[^(\x20-\x7F)]*/','', $title); echo $newtitle;
Output will be: Stefen Suraj
It does not work
You need to use
$arr1 = str_split($str)
then foreach and
echo($arr1[$k])
This will show you exactly which characters are written into the string.
Please Try this.
$find[] = '/“/' //'“'; // left side double smart quote
$find[] = '/”/' //'â€'; // right side double smart quote
$find[] = '/‘/' //'‘'; // left side single smart quote
$find[] = '/’/' //'’'; // right side single smart quote
$find[] = '/â€&#133/' //'…'; // elipsis
$find[] = '/‖/' //'—'; // em dash
$find[] = '/–/' //'–'; // en dash
$replace[] = '“' // '"';
$replace[] = '”' // '"';
$replace[] = '‘' // "'";
$replace[] = '’' // "'";
$replace[] = '⋯' // "...";
$replace[] = '—' // "-";
$replace[] = '–' // "-";
$text = str_replace($find, $replace, $text);
1.The order of the strings in the $find array is significant.
2.This string "‘" should contain a tilde and look like three characters. If I save the .php file with my Genie editor it gits changed to just two characters "â€".
3.This is a useful reference https://www.i18nqa.com/debug/utf8-debug.html
<?php
$text = "‘’“â€1‘ 2’ 3â€â€œâ€™â€˜ 4’ 5 6 7’ ‘, ’, “, â€â€˜";
echo($text . "<br>");
$find = array("‘", "’", "“", "â€");
$replace = array("‘", "’", "“", "”");
$text = str_replace($find, $replace, $text);
echo($text);
?>
Just one simple solution.
if your string contains these type of strange chars
suppose $text contains some of these then just do as shown bellow:
$mytext=mb_convert_encoding($text, "HTML-ENTITIES", 'UTF-8')
and it will work..

Categories