I am using "#nesk/puphpeteer": "^2.0.0" Link to Github-Repo and want get the text and the href-attribute from a link.
I tried the following:
<?php
require_once '../vendor/autoload.php';
use Nesk\Puphpeteer\Puppeteer;
use Nesk\Rialto\Data\JsFunction;
$debug = true;
$puppeteer = new Puppeteer([
'read_timeout' => 100,
'debug' => $debug,
]);
$browser = $puppeteer->launch([
'headless' => !$debug,
'ignoreHTTPSErrors' => true,
]);
$page = $browser->newPage();
$page->goto('http://example.python-scraping.com/');
//get text and link
$links = $page->querySelectorXPath('//*[#id="results"]/table/tbody/tr/td/div/a', JsFunction::createWithParameters(['node'])
->body('return node.textContent;'));
// iterate over links and print each link and its text
// get single text
$singleText = $page->querySelectorXPath('//*[#id="pagination"]/a', JsFunction::createWithParameters(['node'])
->body('return node.textContent;'));
$browser->close();
When I run the above script I get the nodes from the page, BUT I cannot access the attributes or the text?
Any suggestions how to do this?
I appreciate your replies!
querySelectorXPath return array of ElementHandle. one more thing querySelectorXPath does not support callback function.
first get all node ElementHandle
$links = $page->querySelectorXPath('//*[#id="results"]/table/tbody/tr/td/div/a');
then loop over links to access attributes or text of node
foreach($links as $link){
// for text
$text = $link->evaluate(JsFunction::createWithParameters(['node'])
->body('return node.innerText;'));
// for link
$link = $link->evaluate(JsFunction::createWithParameters(['node'])
->body('return node.href;'));
}
Disclaimer: This is just an intermediate answer - I would update once I've got more specific requests on HTML attrs or other expectations to be retrieved.
tl;dr: Mentioned composer package nesk/puphpeteer really is just a wrapper to underlying NodeJS based implementation of puppeteer. Thus, accessing data and structures has to be "similar" to their JavaScript counterparts...
Maybe Codeception (headless) or symfony/dom-crawler (raw markup) might be better and more mature alternatives.
Anyway, let's pick the example from above and go through it step by step:
$links = $page->querySelectorXPath(
'//*[#id="results"]/table/tbody/tr/td/div/a',
JsFunction::createWithParameters(['node'])->body('return node.textContent;')
);
XPath query $x() would result in an array of ElementHandle items
to access exported node.textContent (from JsFunction), corresponding data gets fetched via ElementHandle.getProperty(prop)
exporting a scalar value (to PHP) is then done via ElementHandle.jsonValue()
Thus, after that we would have something like this:
$links = $page->querySelectorXPath(
'//*[#id="results"]/table/tbody/tr/td/div/a',
JsFunction::createWithParameters(['node'])->body('return node.textContent;')
);
/** #var \Nesk\Puphpeteer\Resources\ElementHandle $link */
foreach ($links as $link) {
var_dump($link->getProperty('textContent')->jsonValue());
}
Which outputs the following raw data (as retrieved from http://example.python-scraping.com/):
string(12) " Afghanistan"
string(14) " Aland Islands"
string(8) " Albania"
string(8) " Algeria"
string(15) " American Samoa"
string(8) " Andorra"
string(7) " Angola"
string(9) " Anguilla"
string(11) " Antarctica"
string(20) " Antigua and Barbuda"
I want to sort Japanese words ( Kanji) like sort feature in excel.
I have tried many ways to sort Japanese text in PHP but the result is not 100% like result in excel.
First . I tried to convert Kanji to Katakana by using this lib (https://osdn.net/projects/igo-php/) but some case is not same like excel.
I want to sort these words ASC
けやきの家
高森台病院
みのりの里
My Result :
けやきの家
高森台病院
みのりの里
Excel Result:
けやきの家
みのりの里
高森台病院
Second I tried other way by using this function
mb_convert_kana($text, "KVc", "utf-8");
The sorting result is correct with those text above, but it contain some case not correct
米田病院
米田病院
高森台病院
My result :
米田病院
米田病院
高森台病院
Excel Result:
高森台病院
米田病院
米田病院
Do you guys have any idea about this. (Sorry for my English ) . Thank you
Firstly, Japanese kanji are not sortable. You can sort by its code number, but that order has no meanings.
Your using Igo (or any other morphological analysis libraries) sounds good solution, though it can not be perfect. And your first sort result seems fine for me. Why do you want them to be sorted in Excel order?
In Excel, if a cell keeps remembering its phonetic notations when the user initially typed on Japanese IME (Input Method Editor), that phonetics will be used in sort. That means, as not all cell might be typed manually on IME, some cells may not have information how those kanji-s are read. So results of sorting Kanji-s on Excel could be pretty unpredictable. (If sort seriously needed, usually we add another yomigana field, either in hiragana or katakana, and sort by that column.)
The second method mb_convert_kana() is totally off point. That function is to normalize hiragana/katakana, as there are two sets of letters by historical reason (full-width kana and half-width kana). Applying that function to your Japanese texts only changes kana parts. If that made your expectation satisfied, that must be coincidence.
You must define what Excel Japanese sort order your customer requires first. I will be happy to help you if it is clear.
[Update]
As op commented, mb_convert_kana() was to sort mixed hiragana/katakana. For that purpose, I suggest to use php_intl Collator. For example,
<?php
// demo: Japanese(kana) sort by php_intl Collator
if (version_compare(PHP_VERSION, '5.3.0', '<')) {
exit ('php_intl extension is available on PHP 5.3.0 or later.');
}
if (!class_exists('Collator')) {
exit ('You need to install php_intl extension.');
}
$collator = new Collator('ja_JP');
$textArray = [
'カキクケコ',
'日本語',
'アアト',
'Alphabet',
'アイランド',
'はひふへほ',
'あいうえお',
'漢字',
'たほいや',
'さしみじょうゆ',
'Roma',
'ラリルレロ',
'アート',
];
$result = $collator->sort($textArray);
if ($result === false) {
echo "sort failed" . PHP_EOL;
exit();
}
var_dump($textArray);
This sorts hiragana/katakana mixed texts array. Results are here.
array(13) {
[0]=>
string(8) "Alphabet"
[1]=>
string(4) "Roma"
[2]=>
string(9) "アート"
[3]=>
string(9) "アアト"
[4]=>
string(15) "あいうえお"
[5]=>
string(15) "アイランド"
[6]=>
string(15) "カキクケコ"
[7]=>
string(21) "さしみじょうゆ"
[8]=>
string(12) "たほいや"
[9]=>
string(15) "はひふへほ"
[10]=>
string(15) "ラリルレロ"
[11]=>
string(6) "漢字"
[12]=>
string(9) "日本語"
}
You won't need to normalize them by yourself. Both PHP(though with php_intl extension) and database(such like MySQL) know how to sort alphabets in many languages so you do not need to write it.
And, this does not solve the original issue, Kanji sort.
Laravel Alpha to Hiragana with a custom function
Note : $modals (laravel models with get() )
alphabets : Hiragana orders
Source : https://gist.github.com/mdzhang/899a427eb3d0181cd762
public static function orderByHiranagana ($modals,$column){
$outArray = array();
$alphabets = array("a","i","u","e","o","ka","ki","ku","ke","ko","sa","shi","su","se","so","ta","chi","tsu","te","to","na","ni","nu","ne","no","ha","hi","fu","he","ho","ma","mi","mu","me","mo","ya","yu","yo","ra","ri","ru","re","ro","wa","wo","n","ga","gi","gu","ge","go","za","ji","zu","ze","zo","da","ji","zu","de","do","ba","bi","bu","be","bo","pa","pi","pu","pe","po","(pause)","kya","kyu","kyo","sha","shu","sho","cha","chu","cho","nya","nyu","nyo","hya","hyu","hyo","mya","myu","myo","rya","ryu","ryo","gya","gyu","gyo","ja","ju","jo","bya","byu","byo","pya","pyu","pyo","yi","ye","va","vi","vu","ve","vo","vya","vyu","vyo","she","je","che","swa","swi","swu","swe","swo","sya","syu","syo","si","zwa","zwi","zwu","zwe","zwo","zya","zyu","zyo","zi","tsa","tsi","tse","tso","tha","ti","thu","tye","tho","tya","tyu","tyo","dha","di","dhu","dye","dho","dya","dyu","dyo","twa","twi","tu","twe","two","dwa","dwi","du","dwe","dwo","fa","fi","hu","fe","fo","fya","fyu","fyo","ryi","rye","(wa)","wi","(wu)","we","wo","wya","wyu","wyo","kwa","kwi","kwu","kwe","kwo","gwa","gwi","gwu","gwe","gwe","mwa","mwi","mwu","mwe","mwo");
$existIds = array();
foreach ($alphabets as $alpha){
foreach ($modals as $modal) {
if($alpha == strtolower(substr($modal->$column, 0, strlen($alpha))) && !in_array($modal->id,$existIds)) {
array_push($outArray,$modal);
array_push($existIds,$modal->id);
}
}
}
return $outArray;
}
Call like this :
$students = Students::get();
$students = CommonHelper::orderByHiranagana($students,'lastname');
I'm using a Mongo MapReduce to perform a word-count operation on a bunch of documents. The documents are very simple (just an ID and a hash of words):
{ "_id" : 6714078, "words" : { "my" : 1, "cat" : 1, "john" : 1, "likes" : 1, "cakes" : 1 } }
{ "_id" : 6715298, "words" : { "jeremy" : 1, "kicked" : 1, "the" : 1, "ball" : 1 } }
{ "_id" : 6717695, "words" : { "dogs" : 1, "can't" : 1, "look" : 1, "up" : 1 } }
The database is called "words" in my environment, the collections in question are named "wordsX" where X is a category number (I know, don't ask). The field in the document hash where the words are stored is also named "words". Gah.
The problem I'm having is that under certain conditions in my PHP app, the MapReduce doesn't return any data. Annoyingly, running the same commands from the Mongo shell gives perfect results. I'm trying to pin down where this bug is occurring but I'm really stumped, so hoping someone might be able to shed some light on this. The lead-up to this question does go on a bit, because the environment is a bit complicated, but please bear with me.
The commands I've tried running from the Mongo shell to replicate the PHP-based operations are as follows:
m = function () {
if (this.words) {
for (index in this.words) {
emit(index, this.words[index]);
}
}
}
r = function (key, values) {
var total = 0;
for (var i in values) {
total += values[i];
}
return total;
}
res = db.words.mapReduce(m, r, { query : { _id : { $in : [6714078,6715298,6717695] } } });
This results in a temporary collection being created containing the word count data. All OK so far.
However if I run the same commands from PHP (using the standard Mongo library), I end up with no data under certain conditions. It's a bit tricky to describe because I don't want to bore you with the details of the application/environment beyond Mongo, but basically I'm using Sphinx to filter some records, then supplying a list of content IDs to Mongo on which the MapReduce is performed. If I filter back into the data set by 2 or 3 days, I get results back from Mongo; if I don't filter, I get an empty dataset back. The PHP code to run the same operation is as follows. I've not included the Sphinx-based parts as I don't think they're relevant (just know that we get a list of IDs back) because I've tried supplying exactly the same list to Mongo on the command line and got the right results, whereas I don't from within PHP. Hope that makes sense.
The PHP code I'm using looks like this:
$objMongo = new Mongo();
$objDB = $objMongo->words;
$arrWordList = array();
$strMap = '
function() {
if (this.words) {
for (index in this.words) {
emit(index, this.words[index]);
}
}
}
';
$strReduce = '
function(key, values) {
var total = 0;
for (var i in values) {
total += values[i];
}
return total;
}
';
$objMapFunc = new MongoCode($strMap);
$objReduceFunc = new MongoCode($strReduce);
$arrQuery = array(
'_id' => array('$in' => $arrIDs) // <--- list of IDs from Sphinx
);
$arrCommand = array(
'mapreduce' => 'wordsX',
'map' => $objMapFunc,
'reduce' => $objReduceFunc,
'query' => $arrQuery
);
MongoCursor::$timeout = -1;
$arrStatsInfo = $objDB->command($arrCommand);
var_dump($arrStatsInfo);
The contents of the result-info array ($arrStatsInfo) under working and non-working conditions (the filtering as specified above) are as follows.
Working results:
array(4) {
["result"]=>
string(31) "tmp.mr.mapreduce_1279637336_227"
["timeMillis"]=>
int(171)
["counts"]=>
array(3) {
["input"]=>
int(54)
["emit"]=>
int(2517)
["output"]=>
int(1526)
}
["ok"]=>
float(1)
}
Empty results:
array(4) {
["result"]=>
string(31) "tmp.mr.mapreduce_1279637381_228"
["timeMillis"]=>
int(21)
["counts"]=>
array(3) {
["input"]=>
int(0)
["emit"]=>
int(0)
["output"]=>
int(0)
}
["ok"]=>
float(1)
}
So it looks like under the broken condition, no records even make it into the MapReduce. I've spent ages trying to work out what on earth is going on here but I've had no insights thus far. As I've said, running the same commands (as above) directly in the Mongo command line using exactly the same set of IDs returns the right results.
After all that, I guess my question is: is there anything obviously wrong with the PHP-Mongo interaction I'm doing above? Are there other steps I can take to try to debug this?
Please let me know if supplying any further information would be helpful. I appreciate this is a somewhat expansive and ill-defined question but I've tried my best to communicate the issue! Really hope someone can suggest a way out of this.
Many thanks for reading!
For future readers, this issue turned out to be the result of inconsistent handling of ints/numeric strings elsewhere in the app. Sorry about the red herring!