I am using "#nesk/puphpeteer": "^2.0.0" Link to Github-Repo and want get the text and the href-attribute from a link.
I tried the following:
<?php
require_once '../vendor/autoload.php';
use Nesk\Puphpeteer\Puppeteer;
use Nesk\Rialto\Data\JsFunction;
$debug = true;
$puppeteer = new Puppeteer([
'read_timeout' => 100,
'debug' => $debug,
]);
$browser = $puppeteer->launch([
'headless' => !$debug,
'ignoreHTTPSErrors' => true,
]);
$page = $browser->newPage();
$page->goto('http://example.python-scraping.com/');
//get text and link
$links = $page->querySelectorXPath('//*[#id="results"]/table/tbody/tr/td/div/a', JsFunction::createWithParameters(['node'])
->body('return node.textContent;'));
// iterate over links and print each link and its text
// get single text
$singleText = $page->querySelectorXPath('//*[#id="pagination"]/a', JsFunction::createWithParameters(['node'])
->body('return node.textContent;'));
$browser->close();
When I run the above script I get the nodes from the page, BUT I cannot access the attributes or the text?
Any suggestions how to do this?
I appreciate your replies!
querySelectorXPath return array of ElementHandle. one more thing querySelectorXPath does not support callback function.
first get all node ElementHandle
$links = $page->querySelectorXPath('//*[#id="results"]/table/tbody/tr/td/div/a');
then loop over links to access attributes or text of node
foreach($links as $link){
// for text
$text = $link->evaluate(JsFunction::createWithParameters(['node'])
->body('return node.innerText;'));
// for link
$link = $link->evaluate(JsFunction::createWithParameters(['node'])
->body('return node.href;'));
}
Disclaimer: This is just an intermediate answer - I would update once I've got more specific requests on HTML attrs or other expectations to be retrieved.
tl;dr: Mentioned composer package nesk/puphpeteer really is just a wrapper to underlying NodeJS based implementation of puppeteer. Thus, accessing data and structures has to be "similar" to their JavaScript counterparts...
Maybe Codeception (headless) or symfony/dom-crawler (raw markup) might be better and more mature alternatives.
Anyway, let's pick the example from above and go through it step by step:
$links = $page->querySelectorXPath(
'//*[#id="results"]/table/tbody/tr/td/div/a',
JsFunction::createWithParameters(['node'])->body('return node.textContent;')
);
XPath query $x() would result in an array of ElementHandle items
to access exported node.textContent (from JsFunction), corresponding data gets fetched via ElementHandle.getProperty(prop)
exporting a scalar value (to PHP) is then done via ElementHandle.jsonValue()
Thus, after that we would have something like this:
$links = $page->querySelectorXPath(
'//*[#id="results"]/table/tbody/tr/td/div/a',
JsFunction::createWithParameters(['node'])->body('return node.textContent;')
);
/** #var \Nesk\Puphpeteer\Resources\ElementHandle $link */
foreach ($links as $link) {
var_dump($link->getProperty('textContent')->jsonValue());
}
Which outputs the following raw data (as retrieved from http://example.python-scraping.com/):
string(12) " Afghanistan"
string(14) " Aland Islands"
string(8) " Albania"
string(8) " Algeria"
string(15) " American Samoa"
string(8) " Andorra"
string(7) " Angola"
string(9) " Anguilla"
string(11) " Antarctica"
string(20) " Antigua and Barbuda"
I want to sort Japanese words ( Kanji) like sort feature in excel.
I have tried many ways to sort Japanese text in PHP but the result is not 100% like result in excel.
First . I tried to convert Kanji to Katakana by using this lib (https://osdn.net/projects/igo-php/) but some case is not same like excel.
I want to sort these words ASC
けやきの家
高森台病院
みのりの里
My Result :
けやきの家
高森台病院
みのりの里
Excel Result:
けやきの家
みのりの里
高森台病院
Second I tried other way by using this function
mb_convert_kana($text, "KVc", "utf-8");
The sorting result is correct with those text above, but it contain some case not correct
米田病院
米田病院
高森台病院
My result :
米田病院
米田病院
高森台病院
Excel Result:
高森台病院
米田病院
米田病院
Do you guys have any idea about this. (Sorry for my English ) . Thank you
Firstly, Japanese kanji are not sortable. You can sort by its code number, but that order has no meanings.
Your using Igo (or any other morphological analysis libraries) sounds good solution, though it can not be perfect. And your first sort result seems fine for me. Why do you want them to be sorted in Excel order?
In Excel, if a cell keeps remembering its phonetic notations when the user initially typed on Japanese IME (Input Method Editor), that phonetics will be used in sort. That means, as not all cell might be typed manually on IME, some cells may not have information how those kanji-s are read. So results of sorting Kanji-s on Excel could be pretty unpredictable. (If sort seriously needed, usually we add another yomigana field, either in hiragana or katakana, and sort by that column.)
The second method mb_convert_kana() is totally off point. That function is to normalize hiragana/katakana, as there are two sets of letters by historical reason (full-width kana and half-width kana). Applying that function to your Japanese texts only changes kana parts. If that made your expectation satisfied, that must be coincidence.
You must define what Excel Japanese sort order your customer requires first. I will be happy to help you if it is clear.
[Update]
As op commented, mb_convert_kana() was to sort mixed hiragana/katakana. For that purpose, I suggest to use php_intl Collator. For example,
<?php
// demo: Japanese(kana) sort by php_intl Collator
if (version_compare(PHP_VERSION, '5.3.0', '<')) {
exit ('php_intl extension is available on PHP 5.3.0 or later.');
}
if (!class_exists('Collator')) {
exit ('You need to install php_intl extension.');
}
$collator = new Collator('ja_JP');
$textArray = [
'カキクケコ',
'日本語',
'アアト',
'Alphabet',
'アイランド',
'はひふへほ',
'あいうえお',
'漢字',
'たほいや',
'さしみじょうゆ',
'Roma',
'ラリルレロ',
'アート',
];
$result = $collator->sort($textArray);
if ($result === false) {
echo "sort failed" . PHP_EOL;
exit();
}
var_dump($textArray);
This sorts hiragana/katakana mixed texts array. Results are here.
array(13) {
[0]=>
string(8) "Alphabet"
[1]=>
string(4) "Roma"
[2]=>
string(9) "アート"
[3]=>
string(9) "アアト"
[4]=>
string(15) "あいうえお"
[5]=>
string(15) "アイランド"
[6]=>
string(15) "カキクケコ"
[7]=>
string(21) "さしみじょうゆ"
[8]=>
string(12) "たほいや"
[9]=>
string(15) "はひふへほ"
[10]=>
string(15) "ラリルレロ"
[11]=>
string(6) "漢字"
[12]=>
string(9) "日本語"
}
You won't need to normalize them by yourself. Both PHP(though with php_intl extension) and database(such like MySQL) know how to sort alphabets in many languages so you do not need to write it.
And, this does not solve the original issue, Kanji sort.
Laravel Alpha to Hiragana with a custom function
Note : $modals (laravel models with get() )
alphabets : Hiragana orders
Source : https://gist.github.com/mdzhang/899a427eb3d0181cd762
public static function orderByHiranagana ($modals,$column){
$outArray = array();
$alphabets = array("a","i","u","e","o","ka","ki","ku","ke","ko","sa","shi","su","se","so","ta","chi","tsu","te","to","na","ni","nu","ne","no","ha","hi","fu","he","ho","ma","mi","mu","me","mo","ya","yu","yo","ra","ri","ru","re","ro","wa","wo","n","ga","gi","gu","ge","go","za","ji","zu","ze","zo","da","ji","zu","de","do","ba","bi","bu","be","bo","pa","pi","pu","pe","po","(pause)","kya","kyu","kyo","sha","shu","sho","cha","chu","cho","nya","nyu","nyo","hya","hyu","hyo","mya","myu","myo","rya","ryu","ryo","gya","gyu","gyo","ja","ju","jo","bya","byu","byo","pya","pyu","pyo","yi","ye","va","vi","vu","ve","vo","vya","vyu","vyo","she","je","che","swa","swi","swu","swe","swo","sya","syu","syo","si","zwa","zwi","zwu","zwe","zwo","zya","zyu","zyo","zi","tsa","tsi","tse","tso","tha","ti","thu","tye","tho","tya","tyu","tyo","dha","di","dhu","dye","dho","dya","dyu","dyo","twa","twi","tu","twe","two","dwa","dwi","du","dwe","dwo","fa","fi","hu","fe","fo","fya","fyu","fyo","ryi","rye","(wa)","wi","(wu)","we","wo","wya","wyu","wyo","kwa","kwi","kwu","kwe","kwo","gwa","gwi","gwu","gwe","gwe","mwa","mwi","mwu","mwe","mwo");
$existIds = array();
foreach ($alphabets as $alpha){
foreach ($modals as $modal) {
if($alpha == strtolower(substr($modal->$column, 0, strlen($alpha))) && !in_array($modal->id,$existIds)) {
array_push($outArray,$modal);
array_push($existIds,$modal->id);
}
}
}
return $outArray;
}
Call like this :
$students = Students::get();
$students = CommonHelper::orderByHiranagana($students,'lastname');
I have a weird problem regarding passing an encrypted string through url. I'm using base64 encryptions from mcrypt() for encryptHTML() and decryptHTML().
I have this piece of code to encrypt:
$link_string = http_build_query(array('index_number'=>30843854, 'extra_attendence_id'=>27982423, 'target_temporary_id'=>378492085, 'date'=>'2016-05-06', 'action'=>'OUT', 'target_id'=>390234), '', '&');
$link_string = encryptHTML($link_string);
then I passed it through this url:
'localhost/website/controller/action/'.$link_string
then I decrypted it with this piece of code:
$id = $this->request->param('id');
$id = decryptHTML($id);
parse_str($id, $arr_id2);
var_dump($arr_id2);
I will get these in return, as expected:
array(6) { ["index_number"]=> string(8) "30843854" ["extra_attendence_id"]=> string(8) "27982423" ["target_temporary_id"]=> string(9) "378492085" ["date"]=> string(10) "2016-05-06" ["action"]=> string(3) "OUT" ["target_id"]=> string(6) "390234" }
The next case is when I still want the encrypted link but I need to attach some other value from DOM element in the page, so I tried to
'localhost/website/controller/action/encrypt='.$link_string.'&DOMvalue=10000'
then I modified the decryption with this piece of code:
$id = $this->request->param('id');
parse_str($id, $arr_id2);
$the_DOMValue = $arr_id2['DOMvalue'];
$id = decryptHTML($arr_id2['crypted']);
parse_str($id, $arr_id);
var_dump($the_DOMValue); echo "<br>";
var_dump($arr_id);
But then, I get these in return, to my surprise:
string(5) "10000"
array(3) { ["index_number"]=> string(13) "58_2016-04-26" ["extra_attendence_id"]=> string(1) "0" ["target_t"]=> string(0) "" }
My original string was cut short! Note that the DOMvalue is fine.
Then, I checked that right before both decryption, if the given encrypted string is different:
on first case of decryptHTML:
$id = $this->request->param('id');
var_dump($id);
$id = decryptHTML($id);
returns:
string(224) "zCQnh-rNP2R7h4UHyV5Dm5zp494DIIku5LWN51yYGMXBaHf0gJgEDw8UCuHRZxr-CkjkevHQ70kOPnSBQ9CJP6lZrFone-nDMDJhYlL8330wz+zud8-3tSWvdOLB7je5D-22aX4OrE3zlBYZZZtI-rMT73H0JGIRzZge2GzcZGLwS7Rj+GL5Ym-ET6JEHDShST4etgcQaEYXml-+BZ2+0BQKvubZEBOB"
on the second case of decryptHTML:
$id = $this->request->param('id');
parse_str($id, $arr_id2);
$the_DOMValue = $arr_id2['DOMvalue'];
var_dump($arr_id2['crypted']);
$id = decryptHTML($arr_id2['crypted']);
returns:
string(224) "zCQnh-rNP2R7h4UHyV5Dm5zp494DIIku5LWN51yYGMXBaHf0gJgEDw8UCuHRZxr-CkjkevHQ70kOPnSBQ9CJP6lZrFone-nDMDJhYlL8330wz zud8-3tSWvdOLB7je5D-22aX4OrE3zlBYZZZtI-rMT73H0JGIRzZge2GzcZGLwS7Rj GL5Ym-ET6JEHDShST4etgcQaEYXml- BZ2 0BQKvubZEBOB"
It looks exactly the same to me, but strangely it was decrypted differently. I of course used the same functions to decrypt both cases...
Anybody can shed me some light on this?
passing an encrypted string through url
Passing an encrypted string through a URL is a bad idea. Full stop.
I'm using base64 encryptions from mcrypt() for encryptHTML() and decryptHTML().
Without seeing what these functions do, this isn't helpful information, but mcrypt should be avoided. Use Libsodium (if you can; otherwise, use OpenSSL) instead.
My original string was cut short!
It probably treated the + as a space. Using urlencode() would fix one problem, but it wouldn't solve the vulnerability to chosen-ciphertext attacks that using mcrypt introduces into your application in the absence of a Message Authentication Code (MAC).
I'm using the json_decode function to decode (and verify a postback from a payment processor). the json object received looks as follow
{
"notification":{
"version":6.0,
"attemptCount":0,
"role":"VENDOR",
.....
"lineItems":[
{
"itemNo":"1",
"productTitle":"A passed in title",
"shippable":false,
"recurring":false,
"customerProductAmount":1.00,
"customerTaxAmount":0.00
}
]
},
"verification":"9F6E504D"
}
The verification works as follows, one takes the notification node and append a secret key. The first eight characters of the SHA1 hash of this string should match the content of the validation node.
However, I noticed that whilst using json_decode, the double value 6.0, 0.00 etc are truncated to integers (6, 0 ,etc). This messes up the string (in terms of it not generating the correct SHA1-hash). Do note, I cannot use the depth limit to prevent decoding of the notification branch, since I need to support PHP 5.0. How can I tackle this issue. The (defect) validation code I wrote is:
public function IPN_Check(){
$o = (json_decode($this->test_ipn));
$validation_string = json_encode($o->notification);
}
I tried the following:
<?php
var_dump(json_decode('
{
"notification":{
"version":6.0,
"attemptCount":0
}
}
'));
and got this output:
object(stdClass)#1 (1) {
["notification"]=>
object(stdClass)#2 (2) {
["version"]=>
float(6)
["attemptCount"]=>
int(0)
}
}
PHP does make a difference between float and int, maybe you could do something like gettype($o->notification[$i]) == 'float' to check whether you need to add a zero using a string.
UPD.
PHP does make a difference between float and int, but json_encode() - may not. To be sure, that you encode all values as they are - use json_encode() with JSON_PRESERVE_ZERO_FRACTION parameter, and all your float/double types will be saved correctly.
It looks like ClickBank they always send it in the same format with only the two top level fields "notification" and "verification". So you can just use substr to remove the first 16 characters ({"notification":) and the last 27 characters (,"verification":"XXXXXXXX"}) from the raw JSON and then proceed from there:
$notification = substr($json, 16, -27);
$verification = strtoupper( substr( hash('sha1', $notification . $secret), 0, 8) );