Issue with accents and encoding with Bing Spell Check API v7

Issue with accents and encoding with Bing Spell Check API v7 - php

So I am trying to work with Bing's Spell Check API in PHP, but I'm having an issues where accents and other special characters aren't decoded properly, creating many errors that aren't in the original text and messing with the offsets.
My implementation is quite simple - it's heavily based on the example they give in their documentation. I'm not sure if I am supposed to be doing something differently or if it is an issue on their side with how they decode those special characters (which seems highly unlikely - me messing something up is much more probable..!)
Here's the code:
$host = 'https://api.cognitive.microsoft.com';
$path = '/bing/v7.0/spellcheck?';
$data = array (
'mkt' => $lang,
'mode' => 'proof',
'text' => urlencode($text)
);
$encodedData = http_build_query($data);
$key = 'subscription key redacted for obvious reasons';
$headers = "Content-type: application/x-www-form-urlencoded\r\n" .
"Ocp-Apim-Subscription-Key: $key\r\n";
if (isset($_SERVER['REMOTE_ADDR']))
$headers .= "X-MSEdge-ClientIP: " . $_SERVER['REMOTE_ADDR'] . "\r\n";
$options = array (
'http' => array (
'header' => $headers,
'method' => 'POST',
'content' => $encodedData
)
);
$context = stream_context_create ($options);
$result = file_get_contents ($host . $path, false, $context);
if ($result === FALSE) {
# Handle error
}
$decodedResult = json_decode($result, true);
If, for example, I try to spell check the following string:
d'institution
$encodedData becomes the following:
mkt=fr-CA&method=proof&text=d%25E2%2580%2599institutions
And the results I get from the API are the following:
array(2) {
["_type"]=>
string(10) "SpellCheck"
["flaggedTokens"]=>
array(1) {
[0]=>
array(4) {
["offset"]=>
int(8)
["token"]=>
string(14) "99institutions"
["type"]=>
string(12) "UnknownToken"
["suggestions"]=>
array(2) {
[0]=>
array(2) {
["suggestion"]=>
string(15) "99 institutions"
["score"]=>
float(0.93191315174102)
}
[1]=>
array(2) {
["suggestion"]=>
string(14) "99 institution"
["score"]=>
float(0.6518044080768)
}
}
}
}
}
As you can see, the decoding seems to be problematic, as the % gets encoded twice, and is only decoded once apparently. Now, if I remove the url_encode() when setting the value of 'text' in $data, it'll work fine for the apostrophe, but it doesn't work with accents. For example, the following string:
Responsabilité
is interpreted by the API as
ResponsabilitÃ©
which returns an error.
This could very well be something simple that I'm overlooking, but I've been struggling with this for quite a while and would appreciate any help I can get.
Thanks,
- Émile
[ Edit ] Well, as always... when in doubt, assume you're wrong. The API recommended to change all of the accents for regular letters because even if the specified language was French, it still gave suggestions in English instead of returning an empty array. As for the accents that didn't seem to be decoded, well... I was var_dump-ing that data without any doctype set, so of course it would show without the proper encoding. Sorry about that - in the end, simply removing the urlencode() does the trick!

As per the docs:
The API supports two proofing modes, Proof and Spell. The default mode is Proof. The Proof spelling mode provides the most comprehensive checks, but it's available only in the en-US (English-United States) market. For all other markets, set the mode query parameter to Spell. The Spell mode finds most spelling mistakes but doesn't find some of the grammar errors that Proof catches (for example, capitalization and repeated words).

Related

Puphpeteer - Get text and href-attribute from link

I am using "#nesk/puphpeteer": "^2.0.0" Link to Github-Repo and want get the text and the href-attribute from a link.
I tried the following:
<?php
require_once '../vendor/autoload.php';
use Nesk\Puphpeteer\Puppeteer;
use Nesk\Rialto\Data\JsFunction;
$debug = true;
$puppeteer = new Puppeteer([
'read_timeout' => 100,
'debug' => $debug,
]);
$browser = $puppeteer->launch([
'headless' => !$debug,
'ignoreHTTPSErrors' => true,
]);
$page = $browser->newPage();
$page->goto('http://example.python-scraping.com/');
//get text and link
$links = $page->querySelectorXPath('//*[#id="results"]/table/tbody/tr/td/div/a', JsFunction::createWithParameters(['node'])
->body('return node.textContent;'));
// iterate over links and print each link and its text
// get single text
$singleText = $page->querySelectorXPath('//*[#id="pagination"]/a', JsFunction::createWithParameters(['node'])
->body('return node.textContent;'));
$browser->close();
When I run the above script I get the nodes from the page, BUT I cannot access the attributes or the text?
Any suggestions how to do this?
I appreciate your replies!

querySelectorXPath return array of ElementHandle. one more thing querySelectorXPath does not support callback function.
first get all node ElementHandle
$links = $page->querySelectorXPath('//*[#id="results"]/table/tbody/tr/td/div/a');
then loop over links to access attributes or text of node
foreach($links as $link){
// for text
$text = $link->evaluate(JsFunction::createWithParameters(['node'])
->body('return node.innerText;'));
// for link
$link = $link->evaluate(JsFunction::createWithParameters(['node'])
->body('return node.href;'));
}

Disclaimer: This is just an intermediate answer - I would update once I've got more specific requests on HTML attrs or other expectations to be retrieved.
tl;dr: Mentioned composer package nesk/puphpeteer really is just a wrapper to underlying NodeJS based implementation of puppeteer. Thus, accessing data and structures has to be "similar" to their JavaScript counterparts...
Maybe Codeception (headless) or symfony/dom-crawler (raw markup) might be better and more mature alternatives.
Anyway, let's pick the example from above and go through it step by step:
$links = $page->querySelectorXPath(
'//*[#id="results"]/table/tbody/tr/td/div/a',
JsFunction::createWithParameters(['node'])->body('return node.textContent;')
);
XPath query $x() would result in an array of ElementHandle items
to access exported node.textContent (from JsFunction), corresponding data gets fetched via ElementHandle.getProperty(prop)
exporting a scalar value (to PHP) is then done via ElementHandle.jsonValue()
Thus, after that we would have something like this:
$links = $page->querySelectorXPath(
'//*[#id="results"]/table/tbody/tr/td/div/a',
JsFunction::createWithParameters(['node'])->body('return node.textContent;')
);
/** #var \Nesk\Puphpeteer\Resources\ElementHandle $link */
foreach ($links as $link) {
var_dump($link->getProperty('textContent')->jsonValue());
}
Which outputs the following raw data (as retrieved from http://example.python-scraping.com/):
string(12) " Afghanistan"
string(14) " Aland Islands"
string(8) " Albania"
string(8) " Algeria"
string(15) " American Samoa"
string(8) " Andorra"
string(7) " Angola"
string(9) " Anguilla"
string(11) " Antarctica"
string(20) " Antigua and Barbuda"

How to sort Japanese like Excel

I want to sort Japanese words ( Kanji) like sort feature in excel.
I have tried many ways to sort Japanese text in PHP but the result is not 100% like result in excel.
First . I tried to convert Kanji to Katakana by using this lib (https://osdn.net/projects/igo-php/) but some case is not same like excel.
I want to sort these words ASC
けやきの家
高森台病院
みのりの里
My Result :
けやきの家
高森台病院
みのりの里
Excel Result:
けやきの家
みのりの里
高森台病院
Second I tried other way by using this function
mb_convert_kana($text, "KVc", "utf-8");
The sorting result is correct with those text above, but it contain some case not correct
米田病院
米田病院
高森台病院
My result :
米田病院
米田病院
高森台病院
Excel Result:
高森台病院
米田病院
米田病院
Do you guys have any idea about this. (Sorry for my English ) . Thank you

Firstly, Japanese kanji are not sortable. You can sort by its code number, but that order has no meanings.
Your using Igo (or any other morphological analysis libraries) sounds good solution, though it can not be perfect. And your first sort result seems fine for me. Why do you want them to be sorted in Excel order?
In Excel, if a cell keeps remembering its phonetic notations when the user initially typed on Japanese IME (Input Method Editor), that phonetics will be used in sort. That means, as not all cell might be typed manually on IME, some cells may not have information how those kanji-s are read. So results of sorting Kanji-s on Excel could be pretty unpredictable. (If sort seriously needed, usually we add another yomigana field, either in hiragana or katakana, and sort by that column.)
The second method mb_convert_kana() is totally off point. That function is to normalize hiragana/katakana, as there are two sets of letters by historical reason (full-width kana and half-width kana). Applying that function to your Japanese texts only changes kana parts. If that made your expectation satisfied, that must be coincidence.
You must define what Excel Japanese sort order your customer requires first. I will be happy to help you if it is clear.
[Update]
As op commented, mb_convert_kana() was to sort mixed hiragana/katakana. For that purpose, I suggest to use php_intl Collator. For example,
<?php
// demo: Japanese(kana) sort by php_intl Collator
if (version_compare(PHP_VERSION, '5.3.0', '<')) {
exit ('php_intl extension is available on PHP 5.3.0 or later.');
}
if (!class_exists('Collator')) {
exit ('You need to install php_intl extension.');
}
$collator = new Collator('ja_JP');
$textArray = [
'ｶｷｸｹｺ',
'日本語',
'アアト',
'Alphabet',
'アイランド',
'はひふへほ',
'あいうえお',
'漢字',
'たほいや',
'さしみじょうゆ',
'Roma',
'ラリルレロ',
'アート',
];
$result = $collator->sort($textArray);
if ($result === false) {
echo "sort failed" . PHP_EOL;
exit();
}
var_dump($textArray);
This sorts hiragana/katakana mixed texts array. Results are here.
array(13) {
[0]=>
string(8) "Alphabet"
[1]=>
string(4) "Roma"
[2]=>
string(9) "アート"
[3]=>
string(9) "アアト"
[4]=>
string(15) "あいうえお"
[5]=>
string(15) "アイランド"
[6]=>
string(15) "ｶｷｸｹｺ"
[7]=>
string(21) "さしみじょうゆ"
[8]=>
string(12) "たほいや"
[9]=>
string(15) "はひふへほ"
[10]=>
string(15) "ラリルレロ"
[11]=>
string(6) "漢字"
[12]=>
string(9) "日本語"
}
You won't need to normalize them by yourself. Both PHP(though with php_intl extension) and database(such like MySQL) know how to sort alphabets in many languages so you do not need to write it.
And, this does not solve the original issue, Kanji sort.

Laravel Alpha to Hiragana with a custom function
Note : $modals (laravel models with get() )
alphabets : Hiragana orders
Source : https://gist.github.com/mdzhang/899a427eb3d0181cd762
public static function orderByHiranagana ($modals,$column){
$outArray = array();
$alphabets = array("a","i","u","e","o","ka","ki","ku","ke","ko","sa","shi","su","se","so","ta","chi","tsu","te","to","na","ni","nu","ne","no","ha","hi","fu","he","ho","ma","mi","mu","me","mo","ya","yu","yo","ra","ri","ru","re","ro","wa","wo","n","ga","gi","gu","ge","go","za","ji","zu","ze","zo","da","ji","zu","de","do","ba","bi","bu","be","bo","pa","pi","pu","pe","po","(pause)","kya","kyu","kyo","sha","shu","sho","cha","chu","cho","nya","nyu","nyo","hya","hyu","hyo","mya","myu","myo","rya","ryu","ryo","gya","gyu","gyo","ja","ju","jo","bya","byu","byo","pya","pyu","pyo","yi","ye","va","vi","vu","ve","vo","vya","vyu","vyo","she","je","che","swa","swi","swu","swe","swo","sya","syu","syo","si","zwa","zwi","zwu","zwe","zwo","zya","zyu","zyo","zi","tsa","tsi","tse","tso","tha","ti","thu","tye","tho","tya","tyu","tyo","dha","di","dhu","dye","dho","dya","dyu","dyo","twa","twi","tu","twe","two","dwa","dwi","du","dwe","dwo","fa","fi","hu","fe","fo","fya","fyu","fyo","ryi","rye","(wa)","wi","(wu)","we","wo","wya","wyu","wyo","kwa","kwi","kwu","kwe","kwo","gwa","gwi","gwu","gwe","gwe","mwa","mwi","mwu","mwe","mwo");
$existIds = array();
foreach ($alphabets as $alpha){
foreach ($modals as $modal) {
if($alpha == strtolower(substr($modal->$column, 0, strlen($alpha))) && !in_array($modal->id,$existIds)) {
array_push($outArray,$modal);
array_push($existIds,$modal->id);
}
}
}
return $outArray;
}
Call like this :
$students = Students::get();
$students = CommonHelper::orderByHiranagana($students,'lastname');

PHP Mcrypt_decrypt decrypt only parts of the original string

I have a weird problem regarding passing an encrypted string through url. I'm using base64 encryptions from mcrypt() for encryptHTML() and decryptHTML().
I have this piece of code to encrypt:
$link_string = http_build_query(array('index_number'=>30843854, 'extra_attendence_id'=>27982423, 'target_temporary_id'=>378492085, 'date'=>'2016-05-06', 'action'=>'OUT', 'target_id'=>390234), '', '&');
$link_string = encryptHTML($link_string);
then I passed it through this url:
'localhost/website/controller/action/'.$link_string
then I decrypted it with this piece of code:
$id = $this->request->param('id');
$id = decryptHTML($id);
parse_str($id, $arr_id2);
var_dump($arr_id2);
I will get these in return, as expected:
array(6) { ["index_number"]=> string(8) "30843854" ["extra_attendence_id"]=> string(8) "27982423" ["target_temporary_id"]=> string(9) "378492085" ["date"]=> string(10) "2016-05-06" ["action"]=> string(3) "OUT" ["target_id"]=> string(6) "390234" }
The next case is when I still want the encrypted link but I need to attach some other value from DOM element in the page, so I tried to
'localhost/website/controller/action/encrypt='.$link_string.'&DOMvalue=10000'
then I modified the decryption with this piece of code:
$id = $this->request->param('id');
parse_str($id, $arr_id2);
$the_DOMValue = $arr_id2['DOMvalue'];
$id = decryptHTML($arr_id2['crypted']);
parse_str($id, $arr_id);
var_dump($the_DOMValue); echo "<br>";
var_dump($arr_id);
But then, I get these in return, to my surprise:
string(5) "10000"
array(3) { ["index_number"]=> string(13) "58_2016-04-26" ["extra_attendence_id"]=> string(1) "0" ["target_t"]=> string(0) "" }
My original string was cut short! Note that the DOMvalue is fine.
Then, I checked that right before both decryption, if the given encrypted string is different:
on first case of decryptHTML:
$id = $this->request->param('id');
var_dump($id);
$id = decryptHTML($id);
returns:
string(224) "zCQnh-rNP2R7h4UHyV5Dm5zp494DIIku5LWN51yYGMXBaHf0gJgEDw8UCuHRZxr-CkjkevHQ70kOPnSBQ9CJP6lZrFone-nDMDJhYlL8330wz+zud8-3tSWvdOLB7je5D-22aX4OrE3zlBYZZZtI-rMT73H0JGIRzZge2GzcZGLwS7Rj+GL5Ym-ET6JEHDShST4etgcQaEYXml-+BZ2+0BQKvubZEBOB"
on the second case of decryptHTML:
$id = $this->request->param('id');
parse_str($id, $arr_id2);
$the_DOMValue = $arr_id2['DOMvalue'];
var_dump($arr_id2['crypted']);
$id = decryptHTML($arr_id2['crypted']);
returns:
string(224) "zCQnh-rNP2R7h4UHyV5Dm5zp494DIIku5LWN51yYGMXBaHf0gJgEDw8UCuHRZxr-CkjkevHQ70kOPnSBQ9CJP6lZrFone-nDMDJhYlL8330wz zud8-3tSWvdOLB7je5D-22aX4OrE3zlBYZZZtI-rMT73H0JGIRzZge2GzcZGLwS7Rj GL5Ym-ET6JEHDShST4etgcQaEYXml- BZ2 0BQKvubZEBOB"
It looks exactly the same to me, but strangely it was decrypted differently. I of course used the same functions to decrypt both cases...
Anybody can shed me some light on this?

passing an encrypted string through url
Passing an encrypted string through a URL is a bad idea. Full stop.
I'm using base64 encryptions from mcrypt() for encryptHTML() and decryptHTML().
Without seeing what these functions do, this isn't helpful information, but mcrypt should be avoided. Use Libsodium (if you can; otherwise, use OpenSSL) instead.
My original string was cut short!
It probably treated the + as a space. Using urlencode() would fix one problem, but it wouldn't solve the vulnerability to chosen-ciphertext attacks that using mcrypt introduces into your application in the absence of a Message Authentication Code (MAC).

Parse Push Notification - PHP API

Parse Push Notifications are working. The thing is that I am trying to send a multiline notification and the PHP API is not detecting my EOL command. The messages arrived exactly as I send them:
Line1\r\nLine2
Any help will be appreciated.
Many thanks.
EDIT
This is my code:
require 'autoload.php';
$app_id = "zzzzzzzzzzzzzzzzzzzzzzzzz";
$rest_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx";
$master_key = "cccccccccccccccccccccccccccccccc";
\Parse\ParseClient::initialize( $app_id, $rest_key, $master_key );
use Parse\ParsePush;
$data = array("alert" => $_POST["txtMessage"]);
ParsePush::send(array("channels" => ["Test"], "data" => $data));
EDIT #2:
My data array:
array(1) (
[alert] => (string) Line1\r\nLine2
)

It looks like you are escaping your string somewhere along the way.
This is how the escaped string will look like in PHP:
var_dump("Line1\\r\\nLine2");
string(14) "Line1\r\nLine2"
Simply because you escaped the escape character.
What you need is this:
var_dump("Line1\r\nLine2");
string(12) "Line1
Line2"
The above code should produce what you need. Check other parts of your code (also the frontend part) if there is anything that is escaping the string.

json_decode Preservation of Type

I'm using the json_decode function to decode (and verify a postback from a payment processor). the json object received looks as follow
{
"notification":{
"version":6.0,
"attemptCount":0,
"role":"VENDOR",
.....
"lineItems":[
{
"itemNo":"1",
"productTitle":"A passed in title",
"shippable":false,
"recurring":false,
"customerProductAmount":1.00,
"customerTaxAmount":0.00
}
]
},
"verification":"9F6E504D"
}
The verification works as follows, one takes the notification node and append a secret key. The first eight characters of the SHA1 hash of this string should match the content of the validation node.
However, I noticed that whilst using json_decode, the double value 6.0, 0.00 etc are truncated to integers (6, 0 ,etc). This messes up the string (in terms of it not generating the correct SHA1-hash). Do note, I cannot use the depth limit to prevent decoding of the notification branch, since I need to support PHP 5.0. How can I tackle this issue. The (defect) validation code I wrote is:
public function IPN_Check(){
$o = (json_decode($this->test_ipn));
$validation_string = json_encode($o->notification);
}

I tried the following:
<?php
var_dump(json_decode('
{
"notification":{
"version":6.0,
"attemptCount":0
}
}
'));
and got this output:
object(stdClass)#1 (1) {
["notification"]=>
object(stdClass)#2 (2) {
["version"]=>
float(6)
["attemptCount"]=>
int(0)
}
}
PHP does make a difference between float and int, maybe you could do something like gettype($o->notification[$i]) == 'float' to check whether you need to add a zero using a string.
UPD.
PHP does make a difference between float and int, but json_encode() - may not. To be sure, that you encode all values as they are - use json_encode() with JSON_PRESERVE_ZERO_FRACTION parameter, and all your float/double types will be saved correctly.

It looks like ClickBank they always send it in the same format with only the two top level fields "notification" and "verification". So you can just use substr to remove the first 16 characters ({"notification":) and the last 27 characters (,"verification":"XXXXXXXX"}) from the raw JSON and then proceed from there:
$notification = substr($json, 16, -27);
$verification = strtoupper( substr( hash('sha1', $notification . $secret), 0, 8) );

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Issue with accents and encoding with Bing Spell Check API v7 - php

Related

Puphpeteer - Get text and href-attribute from link

How to sort Japanese like Excel

PHP Mcrypt_decrypt decrypt only parts of the original string

Parse Push Notification - PHP API

json_decode Preservation of Type

Categories

Resources