removing strange characters from php string

removing strange characters from php string - php

this is what i have right now
Drawing an RSS feed into the php, the raw xml from the rss feed reads:
Paul’s Confidence
The php that i have so far is this.
$newtitle = $item->title;
$newtitle = utf8_decode($newtitle);
The above returns;
Paul?s Confidence
If i remove the utf_decode, i get this
Paulâ€™s Confidence
When i try a str_replace;
$newtitle = str_replace("”", "", $newtitle);
It doesnt work, i get;
Paulâ€™s Confidence
Any thoughts?

This is my function that always works, regardless of encoding:
function RemoveBS($Str) {
$StrArr = str_split($Str); $NewStr = '';
foreach ($StrArr as $Char) {
$CharNo = ord($Char);
if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £
if ($CharNo > 31 && $CharNo < 127) {
$NewStr .= $Char;
}
}
return $NewStr;
}
How it works:
echo RemoveBS('Hello õhowå åare youÆ?'); // Hello how are you?

Try this:
$newtitle = html_entity_decode($newtitle, ENT_QUOTES, "UTF-8")
If this is not the solution browse this page http://us2.php.net/manual/en/function.html-entity-decode.php

This will remove all non-ascii characters / special characters from a string.
//Remove from a single line string
$output = "Likening â€˜not-criticalâ€™ with";
$output = preg_replace('/[^(\x20-\x7F)]*/','', $output);
echo $output;
//Remove from a multi-line string
$output = "Likening â€˜not-criticalâ€™ with \n Likening â€˜not-criticalâ€™ with \r Likening â€˜not-criticalâ€™ with. ' ! -.";
$output = preg_replace('/[^(\x20-\x7F)\x0A\x0D]*/','', $output);
echo $output;

I solved the problem. Seems to be a short fix rather than the larger issue, but it works.
$newtitle = str_replace('â€™', "'", $newtitle);
I also found this useful snippit that may help others with same problem;
<?
$find[] = 'â€œ'; // left side double smart quote
$find[] = 'â€'; // right side double smart quote
$find[] = 'â€˜'; // left side single smart quote
$find[] = 'â€™'; // right side single smart quote
$find[] = 'â€¦'; // elipsis
$find[] = 'â€”'; // em dash
$find[] = 'â€“'; // en dash
$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = "...";
$replace[] = "-";
$replace[] = "-";
$text = str_replace($find, $replace, $text);
?>
Thanks everyone for your time and consideration.

Yeah this is not working for me. What is the workaround for this? – vaichidrewar Mar 12 at 22:29
Add this to the HTML head (or modify if already there):
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
This will encode the funny chars like "â€œ" into UTF-8 so that the str_replace() function will interpret them properly.
Or you can do this:
ini_set('default_charset', 'utf-8');

Is the character encoding setting for your PHP server something other than UTF-8? If so, is there a reason or could it be changed to UTF-8? Though we don't store data in UTF-8 in our database, I've found that setting the webserver's character set to UTF-8 seems to help resolve character set issues.
I'd be interested in hearing others' opinions about this... whether I'm setting myself up for problems by setting webserver to UTF-8 while storing submitted data in Latin1 in our mysql database. I know there was a reason I chose Latin1 for the database but can't recall what it was. Interestingly, our current setup seems to allow for non-UTF-8 character entry and subsequent rendering... it seems that storing in Latin1 doesn't prevent subsequent decoding and display of all UTF-8 characters?

Use the below PHP code to remove
html_entity_decode(mb_convert_encoding(stripslashes($name), "HTML-ENTITIES", 'UTF-8'))

Read up on http://us.php.net/manual/en/function.html-entity-decode.php
That & symbol is a html code so you can easily decode it.

Super simple solution is to have the characters decoded when the page is loaded
Simply copy/paste the following at the beginning of the script
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_regex_encoding('UTF-8');
Reference: http://php.net/manual/en/function.mb-internal-encoding.php
comment left by webfav at web dot de

Many Strange Character be removed by applying
mysqli_set_charset($con,"utf8");
below the mysql connection code.
but in some circumstances of removing this type strange character like â€
we need to use: $title = 'â€…Stefen Suraj'; $newtitle = preg_replace('/[^(\x20-\x7F)]*/','', $title); echo $newtitle;
Output will be: Stefen Suraj

It does not work
You need to use
$arr1 = str_split($str)
then foreach and
echo($arr1[$k])
This will show you exactly which characters are written into the string.

Please Try this.
$find[] = '/â/' //'â€œ'; // left side double smart quote
$find[] = '/â/' //'â€'; // right side double smart quote
$find[] = '/â/' //'â€˜'; // left side single smart quote
$find[] = '/â/' //'â€™'; // right side single smart quote
$find[] = '/â&#133/' //'â€¦'; // elipsis
$find[] = '/â/' //'â€”'; // em dash
$find[] = '/â/' //'â€“'; // en dash
$replace[] = '“' // '"';
$replace[] = '”' // '"';
$replace[] = '‘' // "'";
$replace[] = '’' // "'";
$replace[] = '⋯' // "...";
$replace[] = '—' // "-";
$replace[] = '–' // "-";
$text = str_replace($find, $replace, $text);

1.The order of the strings in the $find array is significant.
2.This string "â€˜" should contain a tilde and look like three characters. If I save the .php file with my Genie editor it gits changed to just two characters "â€".
3.This is a useful reference https://www.i18nqa.com/debug/utf8-debug.html
<?php
$text = "â€˜â€™â€œâ€1â€˜ 2â€™ 3â€â€œâ€™â€˜ 4â€™ 5 6 7â€™ â€˜, â€™, â€œ, â€â€˜";
echo($text . "<br>");
$find = array("â€˜", "â€™", "â€œ", "â€");
$replace = array("‘", "’", "“", "”");
$text = str_replace($find, $replace, $text);
echo($text);
?>

Just one simple solution.
if your string contains these type of strange chars
suppose $text contains some of these then just do as shown bellow:
$mytext=mb_convert_encoding($text, "HTML-ENTITIES", 'UTF-8')
and it will work..

Related

unicode chars with wikipedia search in PHP

I pass a PHP string to wikipedia search page in order to retrieve part of the definition.
Everythin works fine, except unicode chars which appear in the \u... form. Here is an example to explain myself better. As you can see, the phonetic transcription of the name is not readable:
Henrik Ibsen, Henrik Ibsen \u02c8h\u025bn\u027eik \u02c8ips\u0259n
(Skien, 20 marzo 1828 - Oslo, 23 maggio 1906) è stato uno scrittore,
drammaturgo, poeta e regista teatrale norvegese.
The code I use to get the snippet from Wikipedia is this:
$word = $_GET["word"];
$html = file_get_contents('https://it.wikipedia.org/w/api.php?action=opensearch&search='.$word);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');
The last line of my code does not solve the problem.
Do you know how to get a clean text which is entirely readable?

The output of the Wikipedia search API is JSON. Don't try to scrape bits out of it and parse string literal escapes yourself, that way madness lies. Just use a readily available JSON parser.
Also, you need to URL-escape the word when you add it into a query string, otherwise any searches for words with URL-special characters in will fail.
In summary:
$word = $_GET['word'];
$url = 'https://it.wikipedia.org/w/api.php?action=opensearch&search='.urlencode($word);
$response = json_decode(file_get_contents($url));
$matching_titles_array = $response[1];
$matching_summaries_array = $response[2];
$matching_urls = $response[3];
...etc...

You got some errors in your regex string, try using:
<?php
$str = "Henrik Ibsen, Henrik Ibsen \u02c8h\u025bn\u027eik \u02c8ips\u0259n(Skien, 20 marzo 1828 - Oslo, 23 maggio 1906) è stato uno scrittore, drammaturgo, poeta e regista teatrale norvegese.";
$utf8html = preg_replace('#\\\U([0-9A-F]{4})#i', "&#x\\1", $str);
echo $utf8html;

Well, the answer posted by bobince is certainly more effective than my previous procedure, which aimed at scraping and pruning bit by bit what I needed. Just to show you how I was doing it, here is my previous code:
$html = file_get_contents('https://it.wikipedia.org/w/api.php?action=opensearch&search='.$s);
$decoded = preg_replace('#\\\U([0-9A-F]{4})#i', "&#x\\1", $html);
$par = array("[", "]");
$def_no_par = str_replace($par, "", $decoded);
$def_no_vir = str_replace("\"\",", "", $def_no_par);
$def_cap = str_replace("\",", "\",<br>", $def_no_vir);
$def_pulita = str_replace("\"", "", $def_cap);
$def_clean = str_replace(".,", ".", $def_pulita);
$definizione = str_replace("$s,", "", $def_clean);
$out = str_replace("\\", "\"", $definizione);
As you can see, removing parts of the output to make it more readable was quite tiresome (and not completely successful).
Using the JSON approach makes everything more linear. Here is my new workaround:
$search = 'https://it.wikipedia.org/w/api.php?action=opensearch&search='.urlencode($s);
$response = json_decode(file_get_contents($search));
$matching_titles_array = $response[1];
$matching_summaries_array = $response[2];
$matching_urls = $response[3];
echo '<h3><div align="center"><font color=" #A3A375">'.$titolo.'</font></div></h3><br><br>';
foreach($response[1] as $t) {
echo '<font color="#5C85D6"><b>'.$t.'</b></font><br><br>';
}
foreach($response[2] as $s) {
echo $s.'<br><br>';
}
foreach($response[3] as $l) {
$link = preg_replace('!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $l);
echo $link.'<br><br>';
}
The advantage is that now I can manipulate the arrays as I wish.
You can see it in action here:

PHP extra whitespace not being deleted

I'm counting words in an article and removing common words such as "and" or "the".
I"m removing them by use of preg_replace
after it is done I do a quick clean of extra white space by using.
$search_body = preg_replace('/\s+/',' ',$search_body);
However I've got some very stubborn white space that will not go away. I've tried
if($word == "" OR $word == " "){
//chop it's head off
}
But the if statement does not see $word as being just whitespace. I've also tried printing it to the screen to get the raw data type of it and it's still just showing up blank.
Here is the full regex that I'm using.
$pattern = array(
'/\&quot\;/',
'/[0-9]/',
'/\,/',
'/\./',
'/\!/',
'/\#/',
'/\#/',
'/\$/',
'/\%/',
'/\^/',
'/\&/',
'/\*/',
'/\(/',
'/\)/',
'/\_/',
'/\"/',
'/\'/',
'/\:/',
'/\;/',
'/\?/',
'/\`/',
'/\~/',
'/\[/',
'/\]/',
'/\{/',
'/\}/',
'/\|/',
'/\+/',
'/\=/',
'/\-/',
'/–/',
'/°/',
'/\bthe\b/',
'/\band\b/',
'/\bthat\b/',
'/\bhave\b/',
'/\bfor\b/',
'/\bnot\b/',
'/\bwith\b/',
'/\byou\b/',
'/\bthis\b/',
'/\bbut\b/',
'/\bhis\b/',
'/\bfrom\b/',
'/\bthey\b/',
'/\bsay\b/',
'/\bher\b/',
'/\bshe\b/',
'/\bwill\b/',
'/\bone\b/',
'/\ball\b/',
'/\bwould\b/',
'/\bthere\b/',
'/\btheir\b/',
'/\bwhat\b/',
'/\bout\b/',
'/\babout\b/',
'/\bwho\b/',
'/\bget\b/',
'/\bwhich\b/',
'/\bwhen\b/',
'/\bmake\b/',
'/\bcan\b/',
'/\blike\b/',
'/\btime\b/',
'/\bjust\b/',
'/\bhim\b/',
'/\bknow\b/',
'/\btake\b/',
'/\bpeople\b/',
'/\binto\b/',
'/\byear\b/',
'/\byour\b/',
'/\bgood\b/',
'/\bsome\b/',
'/\bcould\b/',
'/\bthem\b/',
'/\bsee\b/',
'/\bother\b/',
'/\bthan\b/',
'/\bthen\b/',
'/\bnow\b/',
'/\blook\b/',
'/\bonly\b/',
'/\bcome\b/',
'/\bits\b/', //it's?
'/\bover\b/',
'/\bthink\b/',
'/\balso\b/',
'/\bback\b/',
'/\bafter\b/',
'/\buse\b/',
'/\btwo\b/',
'/\bhow\b/',
'/\bour\b/',
'/\bwork\b/',
'/\bfirst\b/',
'/\bwell\b/',
'/\bway\b/',
'/\beven\b/',
'/\bnew\b/',
'/\bwant\b/',
'/\bbecause\b/',
'/\bany\b/',
'/\bthese\b/',
'/\bgive\b/',
'/\bday\b/',
'/\bmost\b/',
'/\bare\b/',
'/\bwas\b/',
'/\<\w+\>/', '/\<\/\w+\>/',
'/\b\w{1}\b/', //1 letter word
'/\b\w{2}\b/', //2 letter word
'/\//',
'/\</',
'/\>/'
);
$search_body = strip_tags($body);
$search_body = strtolower($search_body);
$search_body = preg_replace($pattern, ' ', $search_body);
$search_body = preg_replace('/\s+/',' ',$search_body);
$search_body = explode(" ", $search_body);
When exploded blank values show up left and right
Example text that I am using is too long to post here. But I copied and pasted
This article to give it a test and it showed 32 counts of white space, not including the white space in front of or behind of other words even after using trim().
Here's a js.fiddle of the raw data that is being handled by php.
htmlentities and htmlspecialchars also show nothing.
Here's the code counts all the values and puts them into one.
$inhere = array();
$body_hold = array();
foreach($search_body as $value){
$value = trim($value);
if(in_array($value, $inhere) && $value != ""){
$key = array_search($value, $inhere);
$body_hold[$key]['count'] = $body_hold[$key]['count']+1;
}elseif($value != ""){
$inhere[] = $value;
$body_hold[] = array(
'count' => 1,
'word' => $value
);
}
}
rsort($body_hold);
Basic foreach to see values.
foreach($body_hold as $value){
$count = $value['count'];
$word = trim($value['word']);
echo "Count: ".$count;
echo " Word: ".$word;
echo '<br>';
}
Here's a PHP example of what it's returning

Are you sure you put the exact same data you're processing in the js.fiddle? Or did you get it from a subsequent post-processed step?
It's obviously a Wikipedia article. I went to that article on Wikipedia and opened it in Edit mode, and saw that there are s in the raw wikitext. However, those nbsp's don't appear in your js.fiddle data.
TL;DR: Check for in your processing (and convert to spaces, etc.).

This character 160 looks like space but it's not, replacing all of them to the regular spaces (32) and then removing all the double spaces will fix your problem.
$search_body = str_replace(chr(160), chr(32), $search_body);
$search_body = trim(preg_replace('/\s+/', ' ', $search_body));

Acents become interrogation marks in php when parsing html

i'm getting a PT-BR text automatically from downloading a html page and the acentution becomes interrogation marks when I use uft8_decode, this is my function:
function pegaMsg($string)
{
$bot_url = "http://website.com";
//&rnd=&msg="
$rand_msg = rand(0,100);
$url = $bot_url . $rand_msg . "&msg=" . $string;
$url = str_replace(" ", "%20", $url);
//echo "\n" . $url;
$download = http_get($url, $referer="");
$download['FILE'] = utf8_decode($download['FILE']);
$download['FILE'] = str_replace("var resp = ", "", $download['FILE']);
$download['FILE'] = str_replace("\\r\\n", "", $download['FILE']);
$download['FILE'] = str_replace(";", "", $download['FILE']);
$download['FILE'] = str_replace("\'", "", $download['FILE']);
$download['FILE'] = trim($download['FILE']);
return $download['FILE'];
}
this is the output expected:
VOCÊ TINHA DUAS ESCOLHAS:
and this is what I get:
'VOC? TINHA DUAS ESCOLHAS:
what can I do ? I want the ^ displayed ! thanks and sorry for the bad english

utf8_decode replaces invalid code unit sequences ?. The reason you're getting a ? is likely because the text you're passing to utf8_decode was not in UTF-8 to begin with.
In fact, it's possible it was already in ISO-8859-1, which is the encoding of the string returned by utf8_decode. In that case, your solution would be to just omit the call to utf8_decode.
If the original text was neither in UTF-8 nor in ISO-8859-1 (which is what I'm assuming you want, since you're calling utf8_decode), you have to use iconv or mb_convert_encoding.
A final possibility is that whatever is interpreting the script output is assuming the encoding of the script output is different from what it actually and it also converts invalid code unit sequences to ?.

Try to use encode
$download['FILE'] = utf8_encode($download['FILE']);

Replacing \r\n (newline characters) after running json_encode

So when I run json_encode, it grabs the \r\n from MySQL aswell. I have tried rewriting strings in the database to no avail. I have tried changing the encoding in MySQL from the default latin1_swedish_ci to ascii_bin and utf8_bin. I have done tons of str_replace and chr(10), chr(13) stuff. I don't know what else to say or do so I'm gonna just leave this here....
$json = json_encode($new);
if(isset($_GET['pretty'])) {
echo str_replace("\/", "/", jsonReadable(parse($json)));
} else {
$json = str_replace("\/", "/", $json);
echo parse($json);
}
The jsonReadable function is from here and the parse function is from here. The str_replaces that are already in there are because I am getting weird formatted html tags like </h1>. Finally, $new is an array which is crafted above. Full code upon request.
Help me StackOverflow. You're my only hope

Does the string contain "\r\n" (as in 0x0D 0x0A) or the literal string '\r\n'? If it's the former, this should remove any newlines.
$json = preg_replace("!\r?\n!", "", $json);
Optionally, replace the second parameter "" with "<br />" if you'd like to replace the newlines with a br tag. For the latter case, try the following:
$json = preg_replace('!\\r?\\n!', "", $json);

Don't replace it in the JSON, replace it in the source before you encode it.

I had a similar issue, i used:
$p_num = trim($this->recp);
$p_num = str_replace("\n", "", $p_num);
$p_num = str_replace("\r", ",", $p_num);
$p_num = str_replace("\n",',', $p_num);
$p_num = rtrim($p_num, "\x00..\x1F");
Not sure if this will help with your requirements.

Using single 'smart quote' in my JSON data is breaking PHP script

I've got a PHP script that is reading in some JSON data provided by a client. The JSON data provided had a single 'smart quote' in it.
Example:
{
"title" : "Lorem Ipsum’s Dolar"
}
In my script I'm using a small function to get the json data:
public function getJson($url) {
$filePath = $url;
$fh = fopen($filePath, 'r') or die();
$temp = fread($fh, filesize($filePath));
$temp = utf8_encode($temp);
echo $temp . "<br />";
$json = json_decode($temp);
fclose($fh);
return $json;
}
If I utf8 encode the data, when I echo it out I see nothing where the quote should be. If I don't utf8 encode the data, when I echo it out I see the funny question mark symbol �
Any thoughts on how to actually see the proper character??
Thanks!

Is it possibe that the server is sending the json data in an encoding like windows-1252? That codepage has some smart code characters where iso-8859 has control characters. Could you try to use iconv("windows-1252", "utf-8", $temp) instead of utf8_encode. Even better would be if the server already sends utf-8 encoded json, since that is the recommended encoding per rfc4627.

The issue is more on the side, that generates the JSON file. There you should escape the ' by \'
If you can't modify this part, you should do it like this with addslashes:
$temp = fread($fh, filesize($filePath));
$temp = utf8_encode($temp);
echo $temp . "<br />";
$temp = addslashes($temp);
$json = json_decode($temp);

Can you possibly do a string replace assuming the data is all utf8?
$text = str_replace($find, $replace, $text);
Looking for the characters below?
'â€œ' // left side double smart quote
'â€' // right side double smart quote
'â€˜' // left side single smart quote
'â€™' // right side single smart quote
'â€¦' // elipsis
'â€”' // em dash
'â€“' // en dash

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

removing strange characters from php string - php

Try this: $newtitle = html_entity_decode($newtitle, ENT_QUOTES, "UTF-8") If this is not the solution browse this page http://us2.php.net/manual/en/function.html-entity-decode.php

Use the below PHP code to remove html_entity_decode(mb_convert_encoding(stripslashes($name), "HTML-ENTITIES", 'UTF-8'))

Read up on http://us.php.net/manual/en/function.html-entity-decode.php That & symbol is a html code so you can easily decode it.

It does not work You need to use $arr1 = str_split($str) then foreach and echo($arr1[$k]) This will show you exactly which characters are written into the string.

Just one simple solution. if your string contains these type of strange chars suppose $text contains some of these then just do as shown bellow: $mytext=mb_convert_encoding($text, "HTML-ENTITIES", 'UTF-8') and it will work..

Related

unicode chars with wikipedia search in PHP

PHP extra whitespace not being deleted

Acents become interrogation marks in php when parsing html

Replacing \r\n (newline characters) after running json_encode

Using single 'smart quote' in my JSON data is breaking PHP script

Categories

Resources