Spell checking UTF-8 text with HunSpellChecker class - php

I'm trying to spell check strings using the HunSpellChecker class (see https://web.archive.org/web/20130311163032/http://www.phpkode.com/source/s/php-spell-checker/php-spell-checker/HunSpellChecker.class.php) and the hunspell spelling engine. The relevant function is copied here:
public function checkSpelling ($text, $locale, $suggestions = true) {
$text = trim($text);
if ($this->textIsHtml == true) {
$text = strtr($text, "\n", ' ');
} elseif ($text == "") {
$this->spellingWarnings[] = array(self::SPELLING_WARNING__TEXT_EMPTY=>"Text empty");
return false;
}
$descspec = array(
0=>array('pipe', 'r'),
1=>array('pipe', 'w'),
2=>array('pipe', 'w')
);
$pipes = array();
$cmd = $this->hunspellPath;
$cmd .= ($this->textIsHtml) ? " -H ":"";
$cmd .= " -d ".dirname(__FILE__)."/dictionaries/hunspell/".$locale;
$process = proc_open($cmd, $descspec, $pipes);
if (!is_resource($process)) {
$this->spellingError[] = array(self::SPELLING_ERROR__INTERNAL_ERROR=>"Hunspell process could not be created.");
return false;
}
fwrite($pipes[0], $text);
fclose($pipes[0]);
$out = '';
while (!feof($pipes[1])) {
$out .= fread($pipes[1], 4096);
}
fclose($pipes[1]);
// check for errors
$err = '';
while (!feof($pipes[2])) {
$err .= fread($pipes[2], 4096);
}
if ($err != '') {
$this->spellingError[] = array(self::SPELLING_ERROR__INTERNAL_ERROR=>"Spell checking error: ".$err);
fclose($pipes[2]);
return false;
}
fclose($pipes[2]);
proc_close($process);
if (strlen($out) === 0) {
$this->spellingError[] = array(self::SPELLING_WARNING__EMPTY_RESULT=>"Empty result");
return false;
}
return $this->parseHunspellOutput(explode("\n", $out), $locale, $suggestions);
}
It works fine with ASCII strings, but I must check strings in different languages, which have accented characters (necessário, segurança, etc) or are in non-Latin alphabets (Greek, Arabic, etc.).
The problem in those cases is that non-ASCII words are segmented incorrectly and the "misspelled" word sent to Hunspell is in fact a substring rather than the full word (necess, seguran).
I tried to track where the issue happens, and I assume it must be in line 072 of the class linked above, when the string is converted into a resource (or somewhere after that). Line 072 contains:
fwrite($pipes[0], $text);
The class is not commented so I'm not really sure what's going on there.
Has anyone dealt with similar issues, or could someone provide any help?
That class is included in file examples/HunspellBased.php (package downloaded from http://titirit.users.phpclasses.org/package/5597-PHP-Check-spelling-of-text-and-get-fix-suggestions.html). I tried to use Enchant, but I didn't manage to make it work at all.
Thank you!
Cheers, Manuel

I think your issue is either HTML entities, or a problem with your dictionary files.
Trying your example with the Portuguese dictionary downloaded from Mozilla add-ons, I can reproduce your problem only when using HTML-encoded entities. i.e. segurança is fine, but segurança get's segmented as you say.
I don't think this is an issue with the class. All the class does is pipe the text to the command line program. You can eliminate the PHP class as an issue by using the program directly as follows:
Change working directory to the place you have your dictionaries, php-spell-checker/dictionaries/hunspell according to your code above. Prepare a text file containing the accented words you want to test and then do:
hunspell -l -d pt-PT test.text
or for HTML:
hunspell -l -d pt-PT -H test.html
Where pt_PT represents the name of the Portuguese dictionary file pair, namely pt-PT.aff and pt–PT.dic
No output means no errors. If you get partials words like "necess" only when using HTML entities, then this is your issue. If not, then it's either some other kind of string-encoding issue, or an issue with the dictionary you're using.
I suspect this is a limitation of hunspell's HTML parser - that it ignores HTML tags and other punctuating entities, but won't include and decode a word with an entity in the middle.
The only way around this (assuming HTML is your issue) is do your own pre-processing before sending HTML to the spellcheck. PHP's html_entity_decode will convert ç -> ç so you could try calling that on every string. Ideally though you'd parse the HTML DOM and pull out only the text nodes.
If HTML is not your issue, check that the strings are valid UTF-8.
Failing that try another dictionary file. The one I grabbed from Mozilla works fine with plain text. Just rename the .xpi file to .gzip, expand it using whatever decompress software you have, then copy the .dic and .aff files to your dictionary folder.

I think you can add After :
$cmd = $this->hunspellPath;
$cmd .= ($this->textIsHtml) ? " -H ":"";
$cmd .= " -d ".dirname(__FILE__)."/dictionaries/hunspell/".$locale;
Add
$cmd .= " -i UTF-8";

Related

PHP & JSON : Trying to write JSON to Mysql

I am having an issue where my PHP script opens a file with JSON code and needs to insert it into a MySQL database.
For some reason it only displays some of the output from the JSON.
Here is my code
$json = json_decode(file_get_contents('data.json'), true);
$data = $json;
// VAR's
$system = $data['System'];
$cid_from = $data["From"];
$cid_to = $data['To'];
//DEBUG USAGES
$array = print_r($data, true);
////// THIS ONE WORKS FINE
echo $data["System"];
////// THIS ONE DOESN'T WORK
echo $data["To"];
file_put_contents('output/json-local.txt',$array . "\r\n", FILE_APPEND);
////// BUT HERE IT ACTUALLY WORKS
file_put_contents('output/cli-from.txt',$data['From']. "\r\n", FILE_APPEND);
file_put_contents('output/cli-to.txt',$data['To']. "\r\n", FILE_APPEND);
// file_put_contents('json-sysid-local.txt',$systemid . "\r\n", FILE_APPEND);
Here is the contents of data.json
{"action":"call-data-record",
"System":"48130b83e2232f0ecd366a92d4d1261d",
"PrimaryCallID":"n1bWEfCdHcf#MSS.MTN.CO.ZA-b2b_1",
"CallID":"0440b807#pbx",
"From":"<sip:+27722080036#xxx.co.za>",
"To":"<sip:27102850816#xxx.co.za>",
"Direction":"O",
"RemoteParty":"",
"LocalParty":"",
"TrunkName":"",
"TrunkID":"",
"Cost":"",
"CMC":"",
"Domain":"xxx.co.za",
"TimeStart":"2018-08-14 16:03:21",
"TimeConnected":"",
"TimeEnd":"2018-08-14 16:03:23",
"LocalTime":"2018-08-14 18:03:21",
"DurationHHMMSS":"0:00:00",
"Duration":"0",
"RecordLocation":"",
"RecordUsers":"",
"Type":"hunt",
"Extension":"100",
"ExtensionName":"100",
"IdleDuration":"",
"RingDuration":"2",
"HoldDuration":"0",
"IvrDuration":"0",
"AccountNumber":"400",
"IPAdr":"",
"Quality":"VQSessionReport: CallTerm\r\nLocalMetrics:\r\nCallID:0440b807#pbx\r\nFromID:<sip:27102850816#xxx.co.za>\r\nToID:<sip:+27722080036#xxxx.co.za>;tag=1460166964\r\nx-UserAgent:Vodia-PBX/57.0\r\nx-SIPmetrics:SVA=RG SRD=91\r\nx-SIPterm:SDC=OK SDR=OR\r\n"}
Your "To" data is encapsulated in <>. This causes your browser to interpret it as an HTML tag and not display any content.
You can (should!) escape the special HTML control characters:
echo htmlspecialchars($data["To"]);
See http://php.net/htmlspecialchars
Edit: It doesn't hurt to precautionary add this to your other outputs aswell. If the string doesn't contain such characters, it will simply be returned onchanged. You eliminate possible XSS attack vectors this way.
The browser source clearly shows "To":"" is being written by PHP to the browser output correctly but the browser is interpreting as an HTML opening tag hence ignoring the rest of the content.
Wrap your output in the PHP htmlspecialchars() function to see the output as in the file.
Add - echo "TO : ".htmlspecialchars($data["To"]);

Check if a string is valid PHP code

I want to do:
$str = "<? echo 'abcd'; ?>";
if (is_valid_php($str)){
echo "{$str} is valid";
} else {
echo "{$str} is not valid PHP code";
}
Is there a simple way to do the is_valid_php check?
All I can find is online php syntax checkers and command-line ways to check syntax, or php_check_syntax which only works for a file, and I don't want to have to write to a file to check if the string is valid.
I'd rather not use system()/exec() or eval()
Related Question - It's old, so I'm hoping there's something new
Other Related Question - all the options (as far as I could tell) either are no longer supported or use command line or have to check files (not strings)
I don't need a full-fledged compiler or anything. I literally only need to know if the string is valid PHP code.
EDIT: By valid php code, I mean that it can be executed, has no compiler/syntax errors, and should contain only PHP code. It could have runtime errors and still be valid, like $y = 33/0;. And it should contain only PHP... Such as, <div>stuff</div><? echo "str"; ?> would be invalid, but <? echo "str"; ?> and echo "$str"; would be valid
You could pipe the string to php -l and call it using shell_exec:
$str1 = "<?php echo 'hello world';";
$str2 = "<?php echo 'hello world'";
echo isValidPHP($str1) ? "Valid" : "Invalid"; //Valid
echo isValidPHP($str2) ? "Valid" : "Invalid"; //Inalid
function isValidPHP($str) {
return trim(shell_exec("echo " . escapeshellarg($str) . " | php -l")) == "No syntax errors detected in -";
}
Just had another idea... this time using eval but safely:
test_code.php
$code = "return; " . $_GET['code'];
//eval is safe here because it won't do anything
//since the first thing we do is return
//but we still get parse errors if it's not valid
//If that happens, it will crash the whole script,
//so we need it to be in a separate request
eval($code);
return "1";
elsewhere in your code:
echo isValidPHP("<?php echo \"It works!\";");
function isValidPHP($code) {
$valid = file_get_contents("http://www.yoursite.com/test_code.php?" . http_build_query(['code' => $code]));
return !!$valid;
}
Edit: Just saw that you want to rather refrain from using exec().
I think this is going to be very hard without it though, php removed the check_syntax function for a reason.
Maybe ( and I really am just guessing here when it comes to the effort you want to invest ) running the php -l in a container ( docker pretty much these days ) and passing the code via a http call to the docker daemon ( just a run command against the standard php cli would do it here ). Then you could use something like below code example without worries ( provided you don't give any permissions to your container obv ).
You could use exec and the php commandline like this:
$ret = exec( 'echo "<?php echo \"bla\";" | php -l 2> /dev/null' );
echo strpos( $ret, 'Errors parsing' ) !== false ? "\nerror" : "\nno error";
$ret = exec( 'echo "<?php eccho \"bla\";" | php -l 2> /dev/null' );
echo strpos( $ret, 'Errors parsing' ) !== false ? "\nerror" : "\nno error";
which returns:
no error
error
No file needed thanks to piping the output. Also using the redirection to dev null we get no unwanted output elsewhere.
Still surely a little dirty, but the shortest I can come up with.
Off the top of my head, this is easiest.
$str = "<? echo 'abcd'; ?>";
file_put_contents("/some/temp/path", $str);
exec("php -l /some/temp/path", $output, $result);
if ($result == 0){
echo "{$str} is valid";
} else {
echo "{$str} is not valid PHP code";
}
unlink("/some/temp/path");
The reason I didn't use php_check_syntax is because:
it's deprecated
it actually executes the code

Detect Encoding and Convert Everything to UTF-8 with PHP

I want to extract various data from URLs that will be converted to UTF-8 no matter what the encoding methods is used in original page (or at least it will work on most of the source encodings).
So, after looking and searching many discussions and answers, I finally came with the following code, with which I am parsing HTML data twice (once for detecting encoding and a second time for getting the actual data). This is working at least on all the checked URLs. But I think that the code is poorly written.
Can anyone let me know if there are any better alternatives to do the same or if I need any improvements on the code?
<?php
header('Content-Type: text/html; charset=utf-8');
require_once 'curl.php';
require_once 'curl_response.php';
$curl = new Curl;
$url = "http://" . $_GET['domain'];
$curl_response = $curl->get($url);
$header_content_type = $curl_response->headers['Content-Type'];
$dom_doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $curl_response);
libxml_use_internal_errors(FALSE);
$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
if (strtolower($meta->getAttribute('http-equiv')) == 'content-type') {
$meta_content_type = $meta->getAttribute('content');
}
if ($meta->getAttribute('charset') != '') {
$html5_charset = $meta->getAttribute('charset');
}
}
if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
$charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
$charset = $m[1];
} elseif (!empty($html5_charset)) {
$charset = $html5_charset;
} elseif (preg_match('/encoding=(.+)/', $curl_response, $m)) {
$charset = $m[1];
} else {
// browser default charset
// $charset = 'ISO-8859-1';
}
if (!empty($charset) && $charset != "utf-8") {
$tmp = iconv($charset,'utf-8', $curl_response);
libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $tmp);
libxml_use_internal_errors(FALSE);
}
$page_title = $dom_doc->getElementsByTagName('title')->item(0)->nodeValue;
$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
if (strtolower($meta->getAttribute('name')) == 'description') {
$meta_description = $meta->getAttribute('content');
}
if (strtolower($meta->getAttribute('name')) == 'keywords') {
$meta_tags = $meta->getAttribute('content');
}
}
print $charset;
print "<hr>";
print $page_title;
print "<hr>";
print $meta_description;
print "<hr>";
print $meta_tags;
print "<hr>";
print "Memory Peak Usages: " . memory_get_peak_usage()/1024/1024 . " MB";
?>
Your question is too open-ended, and I've voted to close it. However, I will still provide a stub of an answer that will, hopefully, point you in the right direction.
At the moment, you are checking user-defined input for the charset. This is a very, very, very bad move, for various reasons:
Most webmasters on small site will just header("Content-type: text/html; charset=utf-8") because they've heard it is good practice, without actually encoding. Not taking this into account will lead to mangled UTF-8 outputs
Some webmasters do the opposite: they do not set a header, and their webserver outputs ISO-8859-1 headers despite an UTF-8 encoding. Visibly on a page, this does not matter - it matters for DOMDocument (I've had this issue recently)
iconv double utf-8 encoding is never fun.
I'd strongly advise using a utility to decode UTF-8 until there are no more entities within the UTF-8 extended range of characters and then encoding once rather than relying on iconv or multibyte encoding. The reason is simple: these can get it wrong. You can also set an error handler to parse DOMDocument errors in order to catch and redirect the loadXML "failed due to malformed XML" errors, which will not be related to your character encoding at all. Basically, the key to you problem is to not blindly do stuff.
If you'd like good targets where you need to worry about UTF-8, parse the home page of Google Play. They send out malformed replies (which is what initially forced me to go through the UTF-8-decode-until-nothing-is-in-the-range approach). It will also show you that DOMDocument can fail due to a wide variety of reasons - not just charset - and that you need to follow the errors to deal with them.
Other performance pointers outside of that big encoding snafu include:
Fragmenting your code into resultant functions. You've got a lot of repetition in there - learn to use functions to stop having to explicitely write the same core functions multiple times.
This:
if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
$charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
is horrible. You can easily replace it with a strpos call, which will speed this particular set of ifs by about 5-10x.
* $metas = $dom_doc->getElementsByTagName('meta'); - you're aware that DOMDocument will go through your entire DOM when you use this method, right? Consider restricting the XPath query to just the head tag (which is always the first child of html, which is the document. XPath: /html/head[0])
In regard to performance you should be using unset(); when you're done with variables or values even if you're going to reset their values, but not if you need the value further down your script. PHP cannot reclaim memory and will reuse the preallocated memory released from the unset command for future use.
Another thing you could do is take huge chunks of that code and split it into functions that return resultant values. Remember that function variables and memory are automatically released after execution unless you're working with global variables.
Those will help performance and memory utilization.

Modifying a PHP file from a Bash script

I need to perform some modifications to PHP files (PHTML files to be exact, but they are still valid PHP files), from a Bash script. My original thought was to use sed or similar utility with regex, but reading some of the replies here for other HTML parsing questions it seems that there might be a better solution.
The problem I was facing with the regex was a lack of support for detecting if the string I wanted to match: (src|href|action)=["']/ was in <?php ?> tags or not, so that I could then either perform string concatenation if the match was in PHP tags, or add in new PHP tags should it not be. For example:
(1) <img id="icon-loader-small" src="/css/images/loader-small.gif" style="vertical-align:middle; display:none;"/>
(2) <li><span class="name"><?php echo $this->loggedInAs()?></span> | Logout</li>
(3) <?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?><span><?php echo $watched_dir->getDirectory();?></span></span><span class="ui-icon ui-icon-close"></span>
(EDIT: 4) <form method="post" action="/Preference/stream-setting" enctype="application/x-www-form-urlencoded" onsubmit="return confirm('<?php echo $this->confirm_pypo_restart_text ?>');">
In (1) there a src="/css, and as it is not in PHP tags I want that to become src="<?php echo $baseUrl?>/css. In (2), there is a PHP tag but it is not around the href="/Login, so it also becomes href="<?php echo $baseUrl?>/Login.
Unfortunately, (3) has src='/css but inside the PHP tags (it is an echoed string). It is also quoted by " in the PHP code, so the modification needs to pick up on that too. The final result would look something like: src='".$baseUrl."/css.
All the other modifications to my HTML and PHP files have been done using a regex (I know, I know...). If regexes could support matching everything except a certain pattern, like [^(<\?php)(\?>)]* then I would be flying through this part. Unfortunately it seems that this is Type 2 grammar territory. So - what should I use?
Ideally it needs to be installed by default with the GNU suite, but other tools like PHP itself or other interpreters are fine too, just not preferred. Of course, if someone could structure a regex that would work on the above examples, then that would be excellent.
EDIT: (4) is the nasty match, where most regexes will fail.
The way I solved this problem was by separating my file into sections that were encapsulated by . The script kept track of the 'context' it was currently in - by default set to html but switching to php when it hit those tags. An operation (not necessarily a regex) then performs on that section, which is then appended to the output buffer. When the file is completely processed the output buffer is written back into the file.
I attempted to do this with sed, but I faced the problem of not being able to control where newlines would be printed. The context based logic was also hardcoded meaning it would be tedious to add in a new context, like ASP.NET support for example. My current solution is written in Perl and mitigates both problems, although I am having a bit of trouble getting my regex to actually do something, but this might just be me coding my regex incorrectly.
Script is as follows:
#!/usr/bin/perl -w
use strict;
#Prototypes
sub readFile(;\$);
sub writeFile($);
#Constants
my $file;
my $outputBuffer;
my $holdBuffer;
# Regexes should have s and g modifiers
# Pattern is in $_
my %contexts = (
html => {
operation => ''
},
php => {
openTag => '<\?php |<\? ', closeTag => '\?>', operation => ''
},
js => {
openTag => '<script>', closeTag => '<\/script>', operation => ''
}
);
my $currentContext = 'html';
my $combinedOpenTags;
#Initialisation
unshift(#ARGV, '-') unless #ARGV;
foreach my $key (keys %contexts) {
if($contexts{$key}{openTag}) {
if($combinedOpenTags) {
$combinedOpenTags .= "|".$contexts{$key}{openTag};
} else {
$combinedOpenTags = $contexts{$key}{openTag};
}
}
}
#Main loop
while(readFile($holdBuffer)) {
$outputBuffer = '';
while($holdBuffer) {
$currentContext = "html";
foreach my $key (keys %contexts) {
if(!$contexts{$key}{openTag}) {
next;
}
if($holdBuffer =~ /\A($contexts{$key}{openTag})/) {
$currentContext = $key;
last;
}
}
if($currentContext eq "html") {
$holdBuffer =~ s/\A(.*?)($combinedOpenTags|\z)/$2/s;
$_ = $1;
} else {
$holdBuffer =~ s/\A(.*?$contexts{$currentContext}{closeTag}|\z)//s;
$_ = $1;
}
eval($contexts{$currentContext}{operation});
$outputBuffer .= $_;
}
writeFile($outputBuffer);
}
# readFile: read file into $_
sub readFile(;\$) {
my $argref = #_ ? shift() : \$_;
return 0 unless #ARGV;
$file = shift(#ARGV);
open(WORKFILE, "<$file") || die("$0: can't open $file for reading ($!)\n");
local $/;
$$argref = <WORKFILE>;
close(WORKFILE);
return 1;
}
# writeFile: write $_[0] to file
sub writeFile($) {
open(WORKFILE, ">$file") || die("$0: can't open $file for writing ($!)\n");
print WORKFILE $_[0];
close(WORKFILE);
}
I hope that this can be used and modified by others to suit their needs.

Encoding problems with exec

I'm fetching some info from SVN on an internal webpage for our company using exec and SVN log commands. The webpage is ISO-8859-1 by requirement, but the SVN-server/exec outpu is UTF8 and special characters are decimal encoded not the "normal" encoding.
This makes it impossible to use UTF8_decode or similar function as far as I've been able to tell, and I can't really get the grips on exactly how the return is formatted otherwise str_replace would have worked as an workaround for the moment. For instance as far as i can see ä is represented by ?\195?\164, but I cant find and replace that string in the output so probably there are some other things going on that I'm missing
My SVN-server is a CentOS and the webserver is Debian running Apache if the culprit can be there somwhere
Pseudocode
exec('svn log PATH' , $output);
foreach ($output as data){
$data = str_replace(array('?\195?\165', '?\195?\182'), array('å','ö'), $data);
echo $data . '<br>';
}
foreach ($output as data){
$data = utf8_decode($data);
echo $data . '<br>';
}
foreach ($output as data){
$data = mb_convert_encoding($data, 'ISO-8859-1', 'UTF-8');
echo $data . '<br>';
}
Example string echoed is "Buggfix f?\195?\182r 7.1.34" but should be "Buggfix för 7.1.34"

Categories