list=allpages does not deliver all pages - php

i have the problem, that i want to fill a list with the names of all pages in my wiki. My script:
$TitleList = [];
$nsList = [];
$nsURL= 'wiki/api.php?action=query&meta=siteinfo& siprop=namespaces|namespacealiases&format=json';
$nsJson = file_get_contents($nsURL);
$nsJsonD = json_decode($nsJson, true);
foreach ($nsJsonD['query']['namespaces'] as $ns)
{
if ( $ns['id'] >= 0 )
array_push ($nsList, $ns['id']);
}
# populate the list of all pages in each namespace
foreach ($nsList as $n)
{
$urlGET = 'wiki/api.php?action=query&list=allpages&apnamespace='.$n.'&format=json';
$json = file_get_contents($urlGET);
$json_b = json_decode( $json ,true);
foreach ($json_b['query']['allpages'] as $page)
{
echo("\n".$page['title']);
array_push($TitleList, $page["title"]);
}
}
But there are still 35% pages missing, that i can visit on my wiki (testing with "random site"). Does anyone know, why this could happen?

MediaWiki API doesn't return all results at once, but does so in batches.
A default batch is only 10 pages; you can specify aplimit to change that (500 max for users, 5,000 max for bots).
To get the next batch, you need to specify the continue= parameter; in each batch, you will also get a continue property in the returned data, which you can use to ask for the next batch. To get all pages, you must loop as long as a continue element is present.
For example, on the English Wikipedia, this would be the first API call:
https://en.wikipedia.org/w/api.php?action=query&list=allpages&apnamespace=0&format=json&aplimit=500&continue=
...and the continue object will be this:
"continue":{
"apcontinue":"\"Cigar\"_Daisey",
"continue":"-||"
}
(Updated according to comment by OP, with example code)
You would now want to flatten the continue array into url parameters, for example using `
See the more complete explanation here:
https://www.mediawiki.org/wiki/API:Query#Continuing_queries
A working version of your code should be (tested with Wikipedia with a slightly different code):
# populate the list of all pages in each namespace
$baseUrl = 'wiki/api.php?action=query&list=allpages&apnamespace='.$n.'&format=json&limit=500&'; // Increase limit if you are using a bot, up to 5,000
foreach ($nsList as $n) {
$next = '';
while ( isset( $next ) ) {
$urlGET = $baseUrl . $next;
$json = file_get_contents($urlGET);
$json_b = json_decode($json, true);
foreach ($json_b['query']['allpages'] as $page)
{
echo("\n".$page['title']);
array_push($TitleList, $page["title"]);
}
if (isset($json_b['continue'])) {
$next = http_build_query($json_b['continue']);
}
}
}

Related

PHP Looping through JSON data without knowing the numbers of data

I am trying to get all images from an image API. It returns a maximum of 500 result at a time and if the result has a next_page field, then I have to grab the value of that field and add it to the URL. The code should continue looping until that field is absent. I used the following code to grab the first two pages:
$key = true;
$link = 'https://url.com/dhj/?prefix=images&type=download';
$json = file_get_contents($link);
$data = json_decode($json, true);
$dataArray = array();
foreach ($data["images"] as $r)
{
array_push($dataArray, array($r["id"], $r["image"]));
}
while($key)
{
if($data["next_page"])
{
$key=true;
$link2 = "https://url.com/dhj/?prefix=images&type=download&next_page=" . $data[$next_page];
$json2 = file_get_contents($link2);
$data2 = json_decode($json2, true);
foreach ($data2["images"] as $r2)
{
array_push($dataArray, array($r2["id"], $r2["image"]));
}
}
else
{
$key=false;
}
}
This should fetch 2000 records but is only fetching 1000 records, so it appears the loop is not working as expected.
So your problem is that you are only fetching twice. The second time, you never check $data2 for a next page, so everything stops. You do not want to keep going like this, or you will need $data3, $data4, etc.
A do/while loop is similar to a while loop, except that it always runs at least once. The condition is evaluated at the end of the loop instead of the beginning. You can use that behaviour to ensure you always get the first page of data, and then use the condition to check if you should keep getting more.
$page = "";
do {
$link = "https://url.com/dhj/?prefix=images&type=download&next_page=$page";
$json = file_get_contents($link);
$data = json_decode($json, true);
foreach ($data["images"] as $r) {
$dataArray[] = [$r["id"], $r["image"]];
}
$page = $data["next_page"] ?? "";
} while ($page);
Note I've got rid of your array_push() call. This is rarely used in PHP because the $var[] syntax is less verbose and doesn't require predeclaration of the array. Likewise, calls to array() have long been replaced by use of array literal syntax.
The expression $page = $data["next_page"] ?? "" uses the null coalesce operator, and is identical to:
if (isset($data["next_page"])) {
$page = $data["next_page"];
} else {
$page = "";
}

Calls to Office365 API to synchronize events, throttling

I am trying to synchronize a few events from Outlook to my local DB and I call the API as below:
$url = 'https://outlook.office365.com/api/v2.0/users/' . $this->user . '/CalendarView/'
. '?startDateTime=' . $start_datetime
. '&endDateTime=' . $end_datetime
This gives me all the events from Outlook between two specific dates.
Then I go and save all this events using the code below. The problem with it is that it returns only 10 events at a time.
$http = new \Http_Curl();
$http->set_headers( $this->get_headers() );
$response = $http->get( $url );
$data = array();
$continue = true;
while ( $continue ) {
if ( isset($response->value) ) {
$arr = array();
foreach ( $response->value as $event ) {
$arr[] = $event;
}
$data = array_merge( $data, $arr );
}
$property = '#odata.nextLink';
if ( isset( $response->$property ) ) {
$url = $response->$property;
$response = $http->get( $url );
} else {
$continue = false;
}
}
unset( $http );
return $data;
I tried then to call the API like below, setting the top parameter to 10, but I end up with many empty events.
$url = 'https://outlook.office365.com/api/v2.0/users/' . $this->user . '/CalendarView/'
. '?startDateTime=' . $start_datetime
. '&endDateTime=' . $end_datetime
.'&top=100'
I am trying to avoid making more than 60 calls per minute. Is there any way to first get the number of events between two dates and then retrieve all of them, so the top parameter should actually be the total number of events.
The correct query parameter is $top and not top. Notice $ in there.
http://docs.oasis-open.org/odata/odata/v4.0/errata03/os/complete/part2-url-conventions/odata-v4.0-errata03-os-part2-url-conventions-complete.html#_Toc453752362
5.1.5 System Query Options $top and $skip
The $top system query option requests the number of items in the queried collection to be included in the result. The $skip query option requests the number of items in the queried collection that are to be skipped and not included in the result. A client can request a particular page of items by combining $top and $skip.
The semantics of $top and $skip are covered in the [OData-Protocol] document. The [OData-ABNF] top and skip syntax rules define the formal grammar of the $top and $skip query options respectively.

Is it possible to execute a function on all members of an array all at once?

Context :
$a = array('1','2','3');
foreach ($a as $item){
//rest of code
//example file_get_contents(url);
//the script waits for it to be completed before going to the next
}
The above script goes one by one.
My concern is, when a process on a single element takes too long, the remaining elements have to wait to be processed.
Is it possible to do stuff to all array items all at once?
Doing some search I found Rolling curl
https://github.com/joshfraser/rolling-curl
I took the included example.php I make it a bit shorter
require("RollingCurl.php");
$urls = array(); // regular array, or from csv or from Database...
function request_callback($response, $info) {
// parse the page title out of the returned HTML
if (preg_match("~<title>(.*?)</title>~i", $response, $out)) {
$title = $out[1];
}
echo "<b>$title</b>";
print_r($info);
echo "<hr>";
}
$rc = new RollingCurl("request_callback");
$rc->window_size = 20;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();

How to improve performance iterating a DOMDocument?

I'm using cURL to pull a webpage from a server. I pass it to Tidy and throw the output into a DOMDocument. Then the trouble starts.
The webpage contains about three thousand (yikes) table tags, and I'm scraping data from them. There are two kinds of tables, where one or more type B follow a type A.
I've profiled my script using microtome(true) calls. I've placed calls before and after each stage of my script and subtracted the times from each other. So, if you'll follow me through my code, I'll explain it, share the profile results, and point out where the problem is. Maybe you can even help me solve the problem. Here we go:
First, I include two files. One handles some parsing, and the other defines two "data structure" classes.
// Imports
include('./course.php');
include('./utils.php');
Includes are inconsequential as far as I know, and so let's proceed to the cURL import.
// Execute cURL
$response = curl_exec($curl_handle);
I've configured cURL to not time out, and to post some header data, which is required to get a meaningful response. Next, I clean up the data to prepare it for DOMDocument.
// Run about 25 str_replace calls here, to clean up
// then run tidy.
$html = $response;
//
// Prepare some config for tidy
//
$config = array(
'indent' => true,
'output-xhtml' => true,
'wrap' => 200);
//
// Tidy up the HTML
//
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$html = $tidy;
Up until now, the code has taken about nine seconds. Considering this to be a cron job, running infrequently, I'm fine with that. However, the next part of the code really barfs. Here's where I take what I want from the HTML and shove it into my custom classes. (I plan to stuff this into a MySQL database too, but this is a first step.)
// Get all of the tables in the page
$tables = $dom->getElementsByTagName('table');
// Create a buffer for the courses
$courses = array();
// Iterate
$numberOfTables = $tables->length;
for ($i=1; $i <$numberOfTables ; $i++) {
$sectionTable = $tables->item($i);
$courseTable = $tables->item($i-1);
// We've found a course table, parse it.
if (elementIsACourseSectionTable($sectionTable)) {
$course = courseFromTable($courseTable);
$course = addSectionsToCourseUsingTable($course, $sectionTable);
$courses[] = $course;
}
}
For reference, here's the utility functions that I call:
//
// Tell us if a given element is
// a course section table.
//
function elementIsACourseSectionTable(DOMElement $element){
$tableHasClass = $element->hasAttribute('class');
$tableIsCourseTable = $element->getAttribute("class") == "coursetable";
return $tableHasClass && $tableIsCourseTable;
}
//
// Takes a table and parses it into an
// instance of the Course class.
//
function courseFromTable(DOMElement $table){
$secondRow = $table->getElementsByTagName('tr')->item(1);
$cells = $secondRow->getElementsByTagName('td');
$course = new Course;
$course->startDate = valueForElementInList(0, $cells);
$course->endDate = valueForElementInList(1, $cells);
$course->name = valueForElementInList(2, $cells);
$course->description = valueForElementInList(3, $cells);
$course->credits = valueForElementInList(4, $cells);
$course->hours = valueForElementInList(5, $cells);
$course->division = valueForElementInList(6, $cells);
$course->subject = valueForElementInList(7, $cells);
return $course;
}
//
// Takes a table and parses it into an
// instance of the Section class.
//
function sectionFromRow(DOMElement $row){
$cells = $row->getElementsByTagName('td');
//
// Skip any row with a single cell
//
if ($cells->length == 1) {
# code...
return NULL;
}
//
// Skip header rows
//
if (valueForElementInList(0, $cells) == "Section" || valueForElementInList(0, $cells) == "") {
return NULL;
}
$section = new Section;
$section->section = valueForElementInList(0, $cells);
$section->code = valueForElementInList(1, $cells);
$section->openSeats = valueForElementInList(2, $cells);
$section->dayAndTime = valueForElementInList(3, $cells);
$section->instructor = valueForElementInList(4, $cells);
$section->buildingAndRoom = valueForElementInList(5, $cells);
$section->isOnline = valueForElementInList(6, $cells);
return $section;
}
//
// Take a table containing course sections
// and parse it put the results into a
// give course object.
//
function addSectionsToCourseUsingTable(Course $course, DOMElement $table){
$rows = $table->getElementsByTagName('tr');
$numRows = $rows->length;
for ($i=0; $i < $numRows; $i++) {
$section = sectionFromRow($rows->item($i));
// Make sure we have an array to put sections into
if (is_null($course->sections)) {
$course->sections = array();
}
// Skip "meta" rows, since they're not really sections
if (is_null($section)) {
continue;
}
$course->addSection($section);
}
return $course;
}
//
// Returns the text from a cell
// with a
//
function valueForElementInList($index, $list){
$value = $list->item($index)->nodeValue;
$value = trim($value);
return $value;
}
This code takes 63 seconds. That's over a minute for a PHP script to pull data from a webpage. Sheesh!
I've been advised to split up the workload of my main work loop, but considering the homogenous nature of my data, I'm not entirely sure how. Any suggestions on improving this code are greatly appreciated.
What can I do to improve my code execution time?
It turns out that my loop is terribly inefficient.
Using a foreach cut time in half to about 31 seconds. But that wasn't fast enough. So I reticulated some splines and did some brainstorming with about half of the programmers that I know how to poke online. Here's what we found:
Using DOMNodeList's item() accessor is linear, producing exponentially slow processing times in loops. So, removing the first element after each iteration makes the loop faster. Now, we always access the first element of the list. This brought me down to 8 seconds.
After playing some more, I realized that the ->length property of DOMNodeList is just as bad as item(), since it also incurs linear cost. So I changed my for loop to this:
$table = $tables->item(0);
while ($table != NULL) {
$table = $tables->item(0);
if ($table === NULL) {
break;
}
//
// We've found a section table, parse it.
//
if (elementIsACourseSectionTable($table)) {
$course = addSectionsToCourseUsingTable($course, $table);
}
//
// Skip the last table if it's not a course section
//
else if(elementIsCourseHeaderTable($table)){
$course = courseFromTable($table);
$courses[] = $course;
}
//
// Remove the first item from the list
//
$first = $tables->item(0);
$first->parentNode->removeChild($first);
//
// Get the next table to parse
//
$table = $tables->item(0);
}
Note that I've done some other optimizations in terms of targeting the data I want, but the relevant part is how I handle progressing from one item to the next.

Multi-dimensional array search to preserve parent

TL;DR
I have this data: var_export and print_r.
And I need to narrow it down to: http://pastebin.com/EqwgpgAP ($data['Stock Information:'][0][0]);
How would one achieve it? (dynamically)
I'm working with vTiger 5.4.0 CRM and am looking to implement a function that would return a particular field information based on search criteria.
Well, vTiger is pretty weakly written system, looks and feels old, everything comes out from hundreds of tables with multiple joins (that's actually not that bad) etc., but job is job.
The need arose from getting usageunit picklist from Products module, Stock Information block.
Since there is no such function as getField();, I am looking forward to filter it out from Blocks, that is actually gathering the information about fields also.
getBlocks(); then calls something close to getFields();, that again something close to getValues(); and so on.
So...
$focus = new $currentModule(); // Products
$displayView = getView($focus->mode);
$productsBlocks = getBlocks($currentModule, $displayView, $focus->mode, $focus->column_fields); // in theory, $focus->column_fields should/could be narrowed down to my specific field, but vTiger doesn't work that way
echo "<pre>"; print_r($productsBlocks); echo "</pre>"; // = http://pastebin.com/3iTDUUgw (huge dump)
As you can see, the array under the key [Stock Information:], that actually comes out from translations (yada, yada...), under [0][0] contains information for usageunit.
Now, I was trying to array_filter(); the data out from there, but only thing I've managed to get is $productsBlocks stripped down to only contain [Stock Information:] with all the data:
$getUsageUnit = function($value) use (&$getUsageUnit) {
if(is_array($value)) return array_filter($value, $getUsageUnit);
if($value == 'usageunit') return true;
};
$productsUsageUnit = array_filter($productsBlocks, $getUsageUnit);
echo "<pre>"; print_r($productsUsageUnit); echo "</pre>"; // = http://pastebin.com/LU6VRC4h (not that huge of a dump)
And, the result I'm looking forward to is http://pastebin.com/EqwgpgAP, that I've manually got by print_r($productsUsageUnit['Stock Information:'][0][0]);.
How do I achieve this? (dynamically...)
function helper($data, $query) {
$result = array();
$search = function ($data, &$stack) use(&$search, $query) {
foreach ($data as $entry) {
if (is_array($entry) && $search($entry, $stack) || $entry === $query) {
$stack[] = $entry;
return true;
}
}
return false;
};
foreach ($data as $sub) {
$parentStack = array();
if ($search($sub, $parentStack)) {
$result[] = $parentStack[sizeof($parentStack) - 2];
}
}
return $result;
}
$node = helper($data, 'usageunit');
print_r($node);

Categories