Calls to Office365 API to synchronize events, throttling - php

I am trying to synchronize a few events from Outlook to my local DB and I call the API as below:
$url = 'https://outlook.office365.com/api/v2.0/users/' . $this->user . '/CalendarView/'
. '?startDateTime=' . $start_datetime
. '&endDateTime=' . $end_datetime
This gives me all the events from Outlook between two specific dates.
Then I go and save all this events using the code below. The problem with it is that it returns only 10 events at a time.
$http = new \Http_Curl();
$http->set_headers( $this->get_headers() );
$response = $http->get( $url );
$data = array();
$continue = true;
while ( $continue ) {
if ( isset($response->value) ) {
$arr = array();
foreach ( $response->value as $event ) {
$arr[] = $event;
}
$data = array_merge( $data, $arr );
}
$property = '#odata.nextLink';
if ( isset( $response->$property ) ) {
$url = $response->$property;
$response = $http->get( $url );
} else {
$continue = false;
}
}
unset( $http );
return $data;
I tried then to call the API like below, setting the top parameter to 10, but I end up with many empty events.
$url = 'https://outlook.office365.com/api/v2.0/users/' . $this->user . '/CalendarView/'
. '?startDateTime=' . $start_datetime
. '&endDateTime=' . $end_datetime
.'&top=100'
I am trying to avoid making more than 60 calls per minute. Is there any way to first get the number of events between two dates and then retrieve all of them, so the top parameter should actually be the total number of events.

The correct query parameter is $top and not top. Notice $ in there.
http://docs.oasis-open.org/odata/odata/v4.0/errata03/os/complete/part2-url-conventions/odata-v4.0-errata03-os-part2-url-conventions-complete.html#_Toc453752362
5.1.5 System Query Options $top and $skip
The $top system query option requests the number of items in the queried collection to be included in the result. The $skip query option requests the number of items in the queried collection that are to be skipped and not included in the result. A client can request a particular page of items by combining $top and $skip.
The semantics of $top and $skip are covered in the [OData-Protocol] document. The [OData-ABNF] top and skip syntax rules define the formal grammar of the $top and $skip query options respectively.

Related

list=allpages does not deliver all pages

i have the problem, that i want to fill a list with the names of all pages in my wiki. My script:
$TitleList = [];
$nsList = [];
$nsURL= 'wiki/api.php?action=query&meta=siteinfo& siprop=namespaces|namespacealiases&format=json';
$nsJson = file_get_contents($nsURL);
$nsJsonD = json_decode($nsJson, true);
foreach ($nsJsonD['query']['namespaces'] as $ns)
{
if ( $ns['id'] >= 0 )
array_push ($nsList, $ns['id']);
}
# populate the list of all pages in each namespace
foreach ($nsList as $n)
{
$urlGET = 'wiki/api.php?action=query&list=allpages&apnamespace='.$n.'&format=json';
$json = file_get_contents($urlGET);
$json_b = json_decode( $json ,true);
foreach ($json_b['query']['allpages'] as $page)
{
echo("\n".$page['title']);
array_push($TitleList, $page["title"]);
}
}
But there are still 35% pages missing, that i can visit on my wiki (testing with "random site"). Does anyone know, why this could happen?
MediaWiki API doesn't return all results at once, but does so in batches.
A default batch is only 10 pages; you can specify aplimit to change that (500 max for users, 5,000 max for bots).
To get the next batch, you need to specify the continue= parameter; in each batch, you will also get a continue property in the returned data, which you can use to ask for the next batch. To get all pages, you must loop as long as a continue element is present.
For example, on the English Wikipedia, this would be the first API call:
https://en.wikipedia.org/w/api.php?action=query&list=allpages&apnamespace=0&format=json&aplimit=500&continue=
...and the continue object will be this:
"continue":{
"apcontinue":"\"Cigar\"_Daisey",
"continue":"-||"
}
(Updated according to comment by OP, with example code)
You would now want to flatten the continue array into url parameters, for example using `
See the more complete explanation here:
https://www.mediawiki.org/wiki/API:Query#Continuing_queries
A working version of your code should be (tested with Wikipedia with a slightly different code):
# populate the list of all pages in each namespace
$baseUrl = 'wiki/api.php?action=query&list=allpages&apnamespace='.$n.'&format=json&limit=500&'; // Increase limit if you are using a bot, up to 5,000
foreach ($nsList as $n) {
$next = '';
while ( isset( $next ) ) {
$urlGET = $baseUrl . $next;
$json = file_get_contents($urlGET);
$json_b = json_decode($json, true);
foreach ($json_b['query']['allpages'] as $page)
{
echo("\n".$page['title']);
array_push($TitleList, $page["title"]);
}
if (isset($json_b['continue'])) {
$next = http_build_query($json_b['continue']);
}
}
}

Building chained function calls dynamically in PHP

I use PHP (with KirbyCMS) and can create this code:
$results = $site->filterBy('a_key', 'a_value')->filterBy('a_key2', 'a_value2');
This is a chain with two filterBy. It works.
However I need to build a function call like this dynamically. Sometimes it can be two chained function calls, sometimes three or more.
How is that done?
Maybe you can play with this code?
chain is just a random number that can be used to create between 1-5 chains.
for( $i = 0; $i < 10; $i ++ ) {
$chains = rand(1, 5);
}
Examples of desired result
Example one, just one function call
$results = $site->filterBy('a_key', 'a_value');
Example two, many nested function calls
$results = $site->filterBy('a_key', 'a_value')->filterBy('a_key2', 'a_value2')->filterBy('a_key3', 'a_value3')->filterBy('a_key4', 'a_value4')->filterBy('a_key5', 'a_value5')->filterBy('a_key6', 'a_value6');
$chains = rand(1, 5)
$results = $site
$suffix = ''
for ( $i = 1; $i <= $chains; $i ++) {
if ($i != 1) {
$suffix = $i
}
$results = $results->filterBy('a_key' . $suffix, 'a_value' . $suffix)
}
If you are able to pass 'a_key1' and 'a_value1' to the first call to filterBy instead of 'a_key' and 'a_value', you could simplify the code by removing $suffix and the if block and just appending $i.
You don't need to generate the list of chained calls. You can put the arguments of each call in a list then write a new method of the class that gets them from the list and uses them to invoke filterBy() repeatedly.
I assume from your example code that function filterBy() returns $this or another object of the same class as site.
//
// The code that generates the filtering parameters:
// Store the arguments of the filtering here
$params = array();
// Put as many sets of arguments you need
// use whatever method suits you best to produce them
$params[] = array('key1', 'value1');
$params[] = array('key2', 'value2');
$params[] = array('key3', 'value3');
//
// Do the multiple filtering
$site = new Site();
$result = $site->filterByMultiple($params);
//
// The code that does the actual filtering
class Site {
public function filterByMultiple(array $params) {
$result = $this;
foreach ($params as list($key, $value)) {
$result = $result->filterBy($key, $value);
}
return $result;
}
}
If filterBy() returns $this then you don't need the working variable $result; call $this->filterBy() and return $this; and remove the other occurrences of $result.

xampp crashes when many simultaneous API requests are made

I'm making an application which takes in a user's tweets using the Twitter API and one component of it is performing sentiment extraction from the tweet texts. For development I'm using xampp, of course using the Apache HTML Server as my workspace. I'm using Eclipse for PHP as an IDE.
For the sentiment extraction I'm using the uClassify Sentiment Classifier. The Classifier uses an API to receive a number of requests and with each request it sends back XML data from which the sentiment values can be parsed.
Now the application may process a large number of tweets (maximum allowed is 3200) at once. For example if there are 3200 tweets then the system will send 3200 API calls at once to this Classifier. Unfortunately for this number the system does not scale well and in fact xampp crashes after a short while of running the system with these calls. However, with a modest number of tweets (for example 500 tweets) the system works fine, so I am assuming it may be due to large number of API calls. It may help to note that the maximum number of API calls allowed by uClassify per day is 5000, but since the maximum is 3200 I am pretty sure that it is not exceeding this number.
This is pretty much my first time working on this kind of web development, so I am not sure if I'm making a rookie mistake here. I am not sure what I could be doing wrong and don't know where to start looking. Any advice/insight will help a lot!
EDIT: added source code in question
Update index method
function updateIndex($timeline, $connection, $user_handle, $json_index, $most_recent) {
// URL arrays for uClassify API calls
$urls = [ ];
$urls_id = [ ];
// halt if no more new tweets are found
$halt = false;
// set to 1 to skip first tweet after 1st batch
$j = 0;
// count number of new tweets indexed
$count = 0;
while ( (count ( $timeline ) != 1 || $j == 0) && $halt == false ) {
$no_of_tweets_in_batch = 0;
$n = $j;
while ( ($n < count ( $timeline )) && $halt == false ) {
$tweet_id = $timeline [$n]->id_str;
if ($tweet_id > $most_recent) {
$text = $timeline [$n]->text;
$tokens = parseTweet ( $text );
$coord = extractLocation ( $timeline, $n );
addSentimentURL ( $text, $tweet_id, $urls, $urls_id );
$keywords = makeEntry ( $tokens, $tweet_id, $coord, $text );
foreach ( $keywords as $type ) {
$json_index [] = $type;
}
$n ++;
$no_of_tweets_in_batch ++;
} else {
$halt = true;
}
}
if ($halt == false) {
$tweet_id = $timeline [$n - 1]->id_str;
$timeline = $connection->get ( 'statuses/user_timeline', array (
'screen_name' => $user_handle,
'count' => 200,
'max_id' => $tweet_id
) );
// skip 1st tweet after 1st batch
$j = 1;
}
$count += $no_of_tweets_in_batch;
}
$json_index = extractSentiments ( $urls, $urls_id, $json_index );
echo 'Number of tweets indexed: ' . ($count);
return $json_index;
}
extract sentiment method
function extractSentiments($urls, $urls_id, &$json_index) {
$responses = multiHandle ( $urls );
// add sentiments to all index entries
foreach ( $json_index as $i => $term ) {
$tweet_id = $term ['tweet_id'];
foreach ( $urls_id as $j => $id ) {
if ($tweet_id == $id) {
$sentiment = parseSentiment ( $responses [$j] );
$json_index [$i] ['sentiment'] = $sentiment;
}
}
}
return $json_index;
}
Method for handling multiple API calls
This is where the uClassify API calls are being processed at once:
function multiHandle($urls) {
// curl handles
$curls = array ();
// results returned in xml
$xml = array ();
// init multi handle
$mh = curl_multi_init ();
foreach ( $urls as $i => $d ) {
// init curl handle
$curls [$i] = curl_init ();
$url = (is_array ( $d ) && ! empty ( $d ['url'] )) ? $d ['url'] : $d;
// set url to curl handle
curl_setopt ( $curls [$i], CURLOPT_URL, $url );
// on success, return actual result rather than true
curl_setopt ( $curls [$i], CURLOPT_RETURNTRANSFER, 1 );
// add curl handle to multi handle
curl_multi_add_handle ( $mh, $curls [$i] );
}
// execute the handles
$active = null;
do {
curl_multi_exec ( $mh, $active );
} while ( $active > 0 );
// get xml and flush handles
foreach ( $curls as $i => $ch ) {
$xml [$i] = curl_multi_getcontent ( $ch );
curl_multi_remove_handle ( $mh, $ch );
}
// close multi handle
curl_multi_close ( $mh );
return $xml;
}
The problem is with giving curl too many URLs in one go. I am surprised you can manage 500 in parallel, as I've seen people complain of problems with even 200. This guy has some clever code to just 100 at a time, but then add the next one each time one finishes, but I notice he edited it down to just do 5 at a time.
I just noticed the author of that code released an open source library around this idea, so I think this is the solution for you: https://github.com/joshfraser/rolling-curl
As to why you get a crash, a comment on this question suggests the cause might be reaching the maximum number of OS file handles: What is the maximum number of cURL connections set by? and other suggestions are simply using a lot of bandwidth, CPU and memory. (If you are on windows, opening the task manager should allow you to see if this is the case; on linux use top)

How to run a file_get_content in a loop?

I'm trying to loop it. But when executing this script:
for ( $page_number = 1; $page_number <= $pages_count ; $page_number++ )
{
$url = $this->makeURL( $limit, $category->getSources()[0]->getSourceId(), $subCat->getSources()[0]->getSourceId() );
echo $url."</br>"; // Returns the correct URL
$json = json_decode( file_get_contents( $url ) );
echo "<pre>";
print_r( $json ); //Only return the proper date the first time. Then its always the same whatever the url
echo "</pre>";
//$this->insert( $json, $subCat );
$limit = $limit + 10;
}
The reponse I get in the file_get_contents() doesn't match the url called (the parameters changes during the loop of course, its like a pagination, but I always get the first page for no reason.). Even if the URL is ok, it doesn't seem to call this page and always returns the same results. But when I copy and paste the URL from the what I get in the echo to my browser search/url bar, I get the right results.
I have the feeling I'm missing something with file_get_contents() maybe to clear the previous call or something.
EDIT: This is the makeURL()
public function makeURL( $limit, $mainCat, $subCat )
{
$pagi_start = $limit;
$url = "http://some.api/V3/event_api/getEvents.php?"
. "&cityId=" . "39"
. "&subcat=" . $subCat
. "&cat=" . $mainCat
. "&link=" . "enable"
. "&tags=" . "enable"
. "&strip_html=". "name,description"
. "&moreInfo=" . "popularity,fallbackimage,artistdesc"
. "&tz=" . "America/Los_ Angeles"
. "&limit=" . $pagi_start.",10";
return $url;
}
Without providing fuller code, hard to know exactly what is happening. But I have two ideas based on past experience.
First, perhaps the server is simply unable to handle too many requests at once. So I would suggest adding a sleep() setting in your script to pause between request to give the server a chance to catch up.
for ( $page_number = 1; $page_number <= $pages_count ; $page_number++ )
{
$url = $this->makeURL( $limit, $category->getSources()[0]->getSourceId(), $subCat->getSources()[0]->getSourceId() );
echo $url."</br>";
$json = json_decode( file_get_contents( $url ) );
// Adding a 'sleep()' command with a 10 second pause.
sleep(10);
echo "<pre>";
print_r( $json );
echo "</pre>";
//$this->insert( $json, $subCat );
$limit = $limit + 10;
}
The other idea is perhaps the server you are trying to connect to is blocking curl requests? What happens if you go to the command line and type the following"
curl "http://some.api/V3/event_api/getEvents.php?[all parameters here]"
Or even check the headers with the curl -I option like this:
curl -I "http://some.api/V3/event_api/getEvents.php?[all parameters here]"
EDIT: Looking at your makeURL() function shows me another glaring issue. Pretty sure you should be using urlencode() on the values.
This is how I would recode your function:
public function makeURL( $limit, $mainCat, $subCat )
{
$pagi_start = $limit;
// Set the param values.
$param_array = array();
$param_array['cityId'] = 39;
$param_array['subcat'] = $subCat;
$param_array['cat'] = $mainCat;
$param_array['link'] = "enable";
$param_array['tags'] = "enable";
$param_array['strip_html'] = "name,description";
$param_array['moreInfo'] = "popularity,fallbackimage,artistdesc";
$param_array['tz'] = "America/Los_ Angeles";
$param_array['limit'] = $pagi_start . ",10";
// Now roll through the param values urlencode them.
$param_array_urlencoded = array();
foreach($param_array as $param_key => $param_value) {
$param_array_urlencoded[$param_key] = urlencode($param_value);
}
// Create the final param array.
$param_array_final = array();
foreach($param_array_urlencoded as $final_param_key => $final_param_value) {
$param_array_final[] = $final_param_key . "=" . $final_param_value;
}
// Create the final URL with the imploded `$param_array_final`.
$url = "http://some.api/V3/event_api/getEvents.php?" . implode("&", $param_array_final) ;
return $url;
}

How to improve performance iterating a DOMDocument?

I'm using cURL to pull a webpage from a server. I pass it to Tidy and throw the output into a DOMDocument. Then the trouble starts.
The webpage contains about three thousand (yikes) table tags, and I'm scraping data from them. There are two kinds of tables, where one or more type B follow a type A.
I've profiled my script using microtome(true) calls. I've placed calls before and after each stage of my script and subtracted the times from each other. So, if you'll follow me through my code, I'll explain it, share the profile results, and point out where the problem is. Maybe you can even help me solve the problem. Here we go:
First, I include two files. One handles some parsing, and the other defines two "data structure" classes.
// Imports
include('./course.php');
include('./utils.php');
Includes are inconsequential as far as I know, and so let's proceed to the cURL import.
// Execute cURL
$response = curl_exec($curl_handle);
I've configured cURL to not time out, and to post some header data, which is required to get a meaningful response. Next, I clean up the data to prepare it for DOMDocument.
// Run about 25 str_replace calls here, to clean up
// then run tidy.
$html = $response;
//
// Prepare some config for tidy
//
$config = array(
'indent' => true,
'output-xhtml' => true,
'wrap' => 200);
//
// Tidy up the HTML
//
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$html = $tidy;
Up until now, the code has taken about nine seconds. Considering this to be a cron job, running infrequently, I'm fine with that. However, the next part of the code really barfs. Here's where I take what I want from the HTML and shove it into my custom classes. (I plan to stuff this into a MySQL database too, but this is a first step.)
// Get all of the tables in the page
$tables = $dom->getElementsByTagName('table');
// Create a buffer for the courses
$courses = array();
// Iterate
$numberOfTables = $tables->length;
for ($i=1; $i <$numberOfTables ; $i++) {
$sectionTable = $tables->item($i);
$courseTable = $tables->item($i-1);
// We've found a course table, parse it.
if (elementIsACourseSectionTable($sectionTable)) {
$course = courseFromTable($courseTable);
$course = addSectionsToCourseUsingTable($course, $sectionTable);
$courses[] = $course;
}
}
For reference, here's the utility functions that I call:
//
// Tell us if a given element is
// a course section table.
//
function elementIsACourseSectionTable(DOMElement $element){
$tableHasClass = $element->hasAttribute('class');
$tableIsCourseTable = $element->getAttribute("class") == "coursetable";
return $tableHasClass && $tableIsCourseTable;
}
//
// Takes a table and parses it into an
// instance of the Course class.
//
function courseFromTable(DOMElement $table){
$secondRow = $table->getElementsByTagName('tr')->item(1);
$cells = $secondRow->getElementsByTagName('td');
$course = new Course;
$course->startDate = valueForElementInList(0, $cells);
$course->endDate = valueForElementInList(1, $cells);
$course->name = valueForElementInList(2, $cells);
$course->description = valueForElementInList(3, $cells);
$course->credits = valueForElementInList(4, $cells);
$course->hours = valueForElementInList(5, $cells);
$course->division = valueForElementInList(6, $cells);
$course->subject = valueForElementInList(7, $cells);
return $course;
}
//
// Takes a table and parses it into an
// instance of the Section class.
//
function sectionFromRow(DOMElement $row){
$cells = $row->getElementsByTagName('td');
//
// Skip any row with a single cell
//
if ($cells->length == 1) {
# code...
return NULL;
}
//
// Skip header rows
//
if (valueForElementInList(0, $cells) == "Section" || valueForElementInList(0, $cells) == "") {
return NULL;
}
$section = new Section;
$section->section = valueForElementInList(0, $cells);
$section->code = valueForElementInList(1, $cells);
$section->openSeats = valueForElementInList(2, $cells);
$section->dayAndTime = valueForElementInList(3, $cells);
$section->instructor = valueForElementInList(4, $cells);
$section->buildingAndRoom = valueForElementInList(5, $cells);
$section->isOnline = valueForElementInList(6, $cells);
return $section;
}
//
// Take a table containing course sections
// and parse it put the results into a
// give course object.
//
function addSectionsToCourseUsingTable(Course $course, DOMElement $table){
$rows = $table->getElementsByTagName('tr');
$numRows = $rows->length;
for ($i=0; $i < $numRows; $i++) {
$section = sectionFromRow($rows->item($i));
// Make sure we have an array to put sections into
if (is_null($course->sections)) {
$course->sections = array();
}
// Skip "meta" rows, since they're not really sections
if (is_null($section)) {
continue;
}
$course->addSection($section);
}
return $course;
}
//
// Returns the text from a cell
// with a
//
function valueForElementInList($index, $list){
$value = $list->item($index)->nodeValue;
$value = trim($value);
return $value;
}
This code takes 63 seconds. That's over a minute for a PHP script to pull data from a webpage. Sheesh!
I've been advised to split up the workload of my main work loop, but considering the homogenous nature of my data, I'm not entirely sure how. Any suggestions on improving this code are greatly appreciated.
What can I do to improve my code execution time?
It turns out that my loop is terribly inefficient.
Using a foreach cut time in half to about 31 seconds. But that wasn't fast enough. So I reticulated some splines and did some brainstorming with about half of the programmers that I know how to poke online. Here's what we found:
Using DOMNodeList's item() accessor is linear, producing exponentially slow processing times in loops. So, removing the first element after each iteration makes the loop faster. Now, we always access the first element of the list. This brought me down to 8 seconds.
After playing some more, I realized that the ->length property of DOMNodeList is just as bad as item(), since it also incurs linear cost. So I changed my for loop to this:
$table = $tables->item(0);
while ($table != NULL) {
$table = $tables->item(0);
if ($table === NULL) {
break;
}
//
// We've found a section table, parse it.
//
if (elementIsACourseSectionTable($table)) {
$course = addSectionsToCourseUsingTable($course, $table);
}
//
// Skip the last table if it's not a course section
//
else if(elementIsCourseHeaderTable($table)){
$course = courseFromTable($table);
$courses[] = $course;
}
//
// Remove the first item from the list
//
$first = $tables->item(0);
$first->parentNode->removeChild($first);
//
// Get the next table to parse
//
$table = $tables->item(0);
}
Note that I've done some other optimizations in terms of targeting the data I want, but the relevant part is how I handle progressing from one item to the next.

Categories