Encoding problems with exec - php

I'm fetching some info from SVN on an internal webpage for our company using exec and SVN log commands. The webpage is ISO-8859-1 by requirement, but the SVN-server/exec outpu is UTF8 and special characters are decimal encoded not the "normal" encoding.
This makes it impossible to use UTF8_decode or similar function as far as I've been able to tell, and I can't really get the grips on exactly how the return is formatted otherwise str_replace would have worked as an workaround for the moment. For instance as far as i can see ä is represented by ?\195?\164, but I cant find and replace that string in the output so probably there are some other things going on that I'm missing
My SVN-server is a CentOS and the webserver is Debian running Apache if the culprit can be there somwhere
Pseudocode
exec('svn log PATH' , $output);
foreach ($output as data){
$data = str_replace(array('?\195?\165', '?\195?\182'), array('å','ö'), $data);
echo $data . '<br>';
}
foreach ($output as data){
$data = utf8_decode($data);
echo $data . '<br>';
}
foreach ($output as data){
$data = mb_convert_encoding($data, 'ISO-8859-1', 'UTF-8');
echo $data . '<br>';
}
Example string echoed is "Buggfix f?\195?\182r 7.1.34" but should be "Buggfix för 7.1.34"

Related

turkish/hungrairan/ polish letters not executing properly

I'm a bit of a novice here so my apologies if I didnt derive this answer from earlier posts I read.
I put together a file in php. Everything works when the URL to the php file is executed, except some of the Polish and Turkish characters come up as question marks (in utf8 and unicode) and simply disappear and turn into latin letters in anicode.
i edited both in wordpad and notepad.
How can I fix this problem, please?
thanks.
function array_utf8_encode($dat)
{
if (is_string($dat))
return utf8_encode($dat);
if (!is_array($dat))
return $dat;
$ret = array();
foreach ($dat as $i => $d)
$ret[$i] = array_utf8_encode($d);
return $ret;
}
header('Content-Type: application/json');
// Return the array back to Qualtrics
print json_encode(array_utf8_encode($returnarray));
?>
Sounds like you might need to add the json_encode JSON_UNESCAPED_UNICODE option:
print json_encode(array_utf8_encode($returnarray), JSON_UNESCAPED_UNICODE);
Other json_encode options are at: http://php.net/manual/en/json.constants.php

How to use PHP json_encode without UTF8?

I am working on an old existing website. All pages were encoded in ISO-European, including the MySQL database.
I want to add some AJAX using PHP's json_encode, which only supports UTF8.
Is there a solution to use json_encode without UTF8?
The only thing you need to do is to convert your data to UTF-8 before passing it to json_encode. That function requires UTF-8 data, and unless you want to reimplement json_encode yourself it's a lot easier to go along with its requirements:
function recursivelyConvertToUTF8($data, $from = 'ISO-8859-1') {
if (!is_array($data)) {
return iconv($from, 'UTF-8', $data);
}
return array_map(function ($value) use ($from) {
return recursivelyConvertToUTF8($value, $from);
}, $data);
}
echo json_encode(recursivelyConvertToUTF8($myData));
This is not necessarily a complete solution covering every possible use case, but it should illustrate the idea.
You can use var_export, utf8_encode and eval to convert an array to UTF-8 recursively. It's a bit of a hack, but something like the following works:
$obj = array("key" => "\xC4rger"); // "Ärger" in Latin1
eval('$utf8_obj = ' . utf8_encode(var_export($obj, TRUE)) . ';');
print json_encode($utf8_obj);
This will print
{"key":"\u00c4rger"}

PHP: Converting xml to array

I have an xml string. That xml string has to be converted into PHP array in order to be processed by other parts of software my team is working on.
For xml -> array conversion i'm using something like this:
if(get_class($xmlString) != 'SimpleXMLElement') {
$xml = simplexml_load_string($xmlString);
}
if(!$xml) {
return false;
}
It works fine - most of the time :) The problem arises when my "xmlString" contains something like this:
<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>
Then, simplexml_load_string won't do it's job (and i know that's because of character "<").
As i can't influence any other part of the code (i can't open up a module that's generating XML string and tell it "encode special characters, please!") i need your suggestions on how to fix that problem BEFORE calling "simplexml_load_string".
Do you have some ideas? I've tried
str_replace("<","<",$xmlString)
but, that simply ruins entire "xmlString"... :(
Well, then you can just replace the special characters in the $xmlString to the HTML entity counterparts using htmlspecialchars() and preg_replace_callback().
I know this is not performance friendly, but it does the job :)
<?php
$xmlString = '<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>';
$xmlString = preg_replace_callback('~(?:").*?(?:")~',
function ($matches) {
return htmlspecialchars($matches[0], ENT_NOQUOTES);
},
$xmlString
);
header('Content-Type: text/plain');
echo $xmlString; // you will see the special characters are converted to HTML entities :)
echo PHP_EOL . PHP_EOL; // tidy :)
$xmlobj = simplexml_load_string($xmlString);
var_dump($xmlobj);
?>

Spell checking UTF-8 text with HunSpellChecker class

I'm trying to spell check strings using the HunSpellChecker class (see https://web.archive.org/web/20130311163032/http://www.phpkode.com/source/s/php-spell-checker/php-spell-checker/HunSpellChecker.class.php) and the hunspell spelling engine. The relevant function is copied here:
public function checkSpelling ($text, $locale, $suggestions = true) {
$text = trim($text);
if ($this->textIsHtml == true) {
$text = strtr($text, "\n", ' ');
} elseif ($text == "") {
$this->spellingWarnings[] = array(self::SPELLING_WARNING__TEXT_EMPTY=>"Text empty");
return false;
}
$descspec = array(
0=>array('pipe', 'r'),
1=>array('pipe', 'w'),
2=>array('pipe', 'w')
);
$pipes = array();
$cmd = $this->hunspellPath;
$cmd .= ($this->textIsHtml) ? " -H ":"";
$cmd .= " -d ".dirname(__FILE__)."/dictionaries/hunspell/".$locale;
$process = proc_open($cmd, $descspec, $pipes);
if (!is_resource($process)) {
$this->spellingError[] = array(self::SPELLING_ERROR__INTERNAL_ERROR=>"Hunspell process could not be created.");
return false;
}
fwrite($pipes[0], $text);
fclose($pipes[0]);
$out = '';
while (!feof($pipes[1])) {
$out .= fread($pipes[1], 4096);
}
fclose($pipes[1]);
// check for errors
$err = '';
while (!feof($pipes[2])) {
$err .= fread($pipes[2], 4096);
}
if ($err != '') {
$this->spellingError[] = array(self::SPELLING_ERROR__INTERNAL_ERROR=>"Spell checking error: ".$err);
fclose($pipes[2]);
return false;
}
fclose($pipes[2]);
proc_close($process);
if (strlen($out) === 0) {
$this->spellingError[] = array(self::SPELLING_WARNING__EMPTY_RESULT=>"Empty result");
return false;
}
return $this->parseHunspellOutput(explode("\n", $out), $locale, $suggestions);
}
It works fine with ASCII strings, but I must check strings in different languages, which have accented characters (necessário, segurança, etc) or are in non-Latin alphabets (Greek, Arabic, etc.).
The problem in those cases is that non-ASCII words are segmented incorrectly and the "misspelled" word sent to Hunspell is in fact a substring rather than the full word (necess, seguran).
I tried to track where the issue happens, and I assume it must be in line 072 of the class linked above, when the string is converted into a resource (or somewhere after that). Line 072 contains:
fwrite($pipes[0], $text);
The class is not commented so I'm not really sure what's going on there.
Has anyone dealt with similar issues, or could someone provide any help?
That class is included in file examples/HunspellBased.php (package downloaded from http://titirit.users.phpclasses.org/package/5597-PHP-Check-spelling-of-text-and-get-fix-suggestions.html). I tried to use Enchant, but I didn't manage to make it work at all.
Thank you!
Cheers, Manuel
I think your issue is either HTML entities, or a problem with your dictionary files.
Trying your example with the Portuguese dictionary downloaded from Mozilla add-ons, I can reproduce your problem only when using HTML-encoded entities. i.e. segurança is fine, but segurança get's segmented as you say.
I don't think this is an issue with the class. All the class does is pipe the text to the command line program. You can eliminate the PHP class as an issue by using the program directly as follows:
Change working directory to the place you have your dictionaries, php-spell-checker/dictionaries/hunspell according to your code above. Prepare a text file containing the accented words you want to test and then do:
hunspell -l -d pt-PT test.text
or for HTML:
hunspell -l -d pt-PT -H test.html
Where pt_PT represents the name of the Portuguese dictionary file pair, namely pt-PT.aff and pt–PT.dic
No output means no errors. If you get partials words like "necess" only when using HTML entities, then this is your issue. If not, then it's either some other kind of string-encoding issue, or an issue with the dictionary you're using.
I suspect this is a limitation of hunspell's HTML parser - that it ignores HTML tags and other punctuating entities, but won't include and decode a word with an entity in the middle.
The only way around this (assuming HTML is your issue) is do your own pre-processing before sending HTML to the spellcheck. PHP's html_entity_decode will convert ç -> ç so you could try calling that on every string. Ideally though you'd parse the HTML DOM and pull out only the text nodes.
If HTML is not your issue, check that the strings are valid UTF-8.
Failing that try another dictionary file. The one I grabbed from Mozilla works fine with plain text. Just rename the .xpi file to .gzip, expand it using whatever decompress software you have, then copy the .dic and .aff files to your dictionary folder.
I think you can add After :
$cmd = $this->hunspellPath;
$cmd .= ($this->textIsHtml) ? " -H ":"";
$cmd .= " -d ".dirname(__FILE__)."/dictionaries/hunspell/".$locale;
Add
$cmd .= " -i UTF-8";

Encoding Conversion with PHP

Trying to do a Latin1 to UTF-8 conversion for WordPress, had no luck with the tutorial posted in the Codex. I came up with this to check encoding and convert.
while($row = mysql_fetch_assoc($sql)) {
if(!mb_check_encoding($row['post_content'], 'UTF-8')) {
$row = mb_convert_encoding($row['post_content'], 'ISO-8859-1', 'UTF-8');
if(!mb_check_encoding($row['post_content'], 'UTF-8')) {
echo 'Can\'t Be Converted<br/>';
}
else {
echo '<br/>'.$row.'<br/><br/>';
}
}
else {
echo 'UTF-8<br/>';
}
}
This works... sorta. I'm not getting any rows that can't converted but I did notice that Panamá becomes Panam
Am I missing a step? Or am I doing this all wrong?
UPDATE
The data stored within the database is corrupt(á characters are stored). So its looking more like a find and replace job than a conversion. I haven't found any great solutions so far for doing this automagically.
This will help you. http://php.net/manual/en/book.iconv.php
Further more you can set your mysql connection to utf8 this way:
mysql_set_charset ('utf8',$this->getConnection());
$this->getConnection in my code returns the variable which was returned by
mysql_connect(MYSQL_SERVER,DB_LOGIN,DB_PASS);
Refer to the PHP documentation for mb_convert_encoding:
string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )
Your code is attempting to convert to ISO-8859-1 from UTF-8!

Categories