Using MediaWiki API with the Continue Command

Using MediaWiki API with the Continue Command - php

I need some help Using the Mediawiki API with the "Continue" or "query-continue" Command to pull information from my wiki articles. I have a large number of wiki articles (more than 800 currently) and I need to use the api to pull them in batches of 50 and then print ou sections.
My API call works properly:
//Stackoverflow making me use a valid URL here, this api is actually my own localhost server
http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=a&apto=z&apnamespace=0&format=xml&aplimit=50 I am querying all pages, therefore "apfrom" and "apto".
I just need help processing the code with PHP and CURL accessing the API and processing the batches of 50 and using the "continue" to access more records until I hit the end. So far my php code is:
//the CURL commands here work and outputs a data set but only for the first 50 records, so I need to call "continue" to get to the end.
//My api url is localhost but I'm forced to use a valid URL by Stackoverflow.com
$url = sprintf('http://en.wikipedia.org/w/api.php?
action=query&list=allpages&apfrom=a&apto=z&apnamespace=0&format=xml&aplimit=50');
$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'My site');
$res = curl_exec($ch);
$continue = '';
while ( // I don't know what to set here as true to get the while loop going, maybe continue = true? maybe set query-continue as true?)
{
//Maybe I need something other than $res['query-continue]??
if (empty($res['query-continue']))
{
exit;
}
else
{
$continue = '&apcontinue='.urlencode($res['query-continue']);
foreach ($res['query']['allpages'] as $v)
{
echo $v['title'];
}
}
}
Can someone correct my while loop code above so I can do a simple print out of the title from each wiki article in the loop? I've done a lot of searching online but I'm stuck!! I found a python loop example at http://www.mediawiki.org/wiki/API:Query but I have to do it in PHP. And I am not sure if I call continue or query-continue.

As svick said, please use a client library which handles continuation for you.
The query continuation mechanism has changed multiple times in MediaWiki, you don't want to understand it or even less rely on it.

Related

How to get json data without api

I'm trying to get the viewer count so I can check if a streamer is online on https://www.dlive.tv/. If you view the page source on a streamer's page (https://www.dlive.tv/thelongestchain), there's a bunch of json and "watchingCount" is there.
Basically, I want to have the streamer appear on the "Live Now" section of my site if their viewer count is 1 or more, but I can't figure out anyway on how to get the viewer count. I know I could use something like Selenium if I was using python and could run it from my pc, but I need the site to know it.
DLive doesn't have an api yet, so I don't know how to make a call(or request I don't know the terminology) to get this info. When I look in the network tab on chrome I see that there's a call (https://graphigo.prd.dlive.tv/) that provides stream info I think. Would I also need my authkey?
I realize this question is broad and all over the place but it's because so am I with me trying to solve this the last couple days. If I had the viewercount as a variable, I know how to display the streamer on the "Live Now" section of my site, I just don't know how to get the necessary data.
If there's another way I should be checking if a streamer is online or offline other than getting the viewercount, that would work too. If anyone could help me out I would greatly appreciate it, thanks.
I tried scraping the page but I don't think you can scrape dynamic content. When I tried to use SimpleHTMLDom it just returned static elements.
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://www.dlive.tv/thelongestchain')
if(($html->find('video', 0))) {
echo 'online';
}else{
echo 'offline';
}
/* The video element is only on the page if the streamer is live, but it doesn't return because it's not static I presume */
?>
I have no idea at all how to go about making a call/request to get the json data for the viewer count, or how to get any other data that could check if a streamer is online. All the scraping I've done did not return any elements that weren't static (the same no matter if the streamer was online or offline).

Try cURL, it's like a magic wand.
This will return the entire page, and I think including the JSON you're looking for:
<?php
$curl = curl_init('https://www.dlive.tv/thelongestchain');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($curl);
var_dump($result); // you do as you need to here
curl_close($curl);
?>
Here's the <script> containing the data I believe you need. Assuming "watchingCount" is the same thing you're looking for?
<script>window.__INITIAL_STATE__={"version":"1.2.0","snackbar":{"snackbar":null},"accessToken":{"accessToken":null},"userMeta":{"userMeta":{"fingerprint":null,"referrer":{"isFirstTime":true,"streamer":null,"happyHour":null,"user":null},"ipStats":null,"ip":"34.216.252.62","langCode":"en","trackingInfo":{"rank":"not available","prevPage":"not available","postStatus":"not available"},"darkMode":true,"NSFW":false,"prefetchID":"e278e744-f522-480e-a290-8eed0fe83b07","cashinEmail":""}},"me":{"me":null},"dialog":{"dialog":{"login":false,"subscribe":false,"cashIn":false,"chest":false,"chestWinners":false,"downloadApp":false}},"tabs":{"tabs":{"livestreamMobileActiveTab":"tab-chat"}},"globalInfo":{"globalInfo":{"languages":[],"communityUpdates":[],"recommendChannels":[]}},"ui":{"ui":{"viewPointWidth":1920,"mq":4,"isMobile":false}}};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script> <script>window.__APOLLO_STATE__={"defaultClient":{"user:dlive-00431789":{"id":"user:dlive-00431789","avatar":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Favatar\u002F5c1330d8-5bc8-11e9-ab17-865634f95b6b","__typename":"User","displayname":"thelongestchain","partnerStatus":"NONE","username":"dlive-00431789","canSubscribe":false,"subSetting":null,"followers":{"type":"id","generated":true,"id":"$user:dlive-00431789.followers","typename":"UserConnection"},"livestream":{"type":"id","generated":false,"id":"livestream:dlive-00431789+i7rCywMWg","typename":"Livestream"},"hostingLivestream":null,"offlineImage":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Fofflineimage\u002Fvideo-placeholder.png","banStatus":"NO_BAN","deactivated":false,"about":"#lovejonah\n\nJonah's NEW FRIENDS:\nhttps:\u002F\u002Fdlive.tv\u002FFlamenco https:\u002F\u002Fi.gyazo.com\u002F88416fca5047381105da289faba60e7c.png\nhttps:\u002F\u002Fdlive.tv\u002FHamsterSamster https:\u002F\u002Fi.gyazo.com\u002F984b19f77a1de5e3028e42ccd71052a0.png\nhttps:\u002F\u002Fdlive.tv\u002Fjayis4justice \nhttps:\u002F\u002Fdlive.tv\u002FDenomic\nhttps:\u002F\u002Fdlive.tv\u002FCutie\nhttps:\u002F\u002Fdlive.tv\u002FTruly_A_No_Life\n\n\n\n\n\n\n\n\n\n\n\n\nOur Socials:\nhttps:\u002F\u002Fwww.twitch.tv\u002Fthelongestchain\nhttps:\u002F\u002Fdiscord.gg\u002Fsagd68Z\n\nLINO website: https:\u002F\u002Flino.network\nLINO Whitepaper: https:\u002F\u002Fdocsend.com\u002Fview\u002Fy9qtwb6\nLINO Tracker : https:\u002F\u002Ftracker.lino.network\u002F#\u002F\nLINO Discord : https:\u002F\u002Fdiscord.gg\u002FTUxp3ww\n\nThe Legend of Lemon's: https:\u002F\u002Fbubbl.us\u002FNTE1OTA4MS85ODY3ODQwL2M0Y2NjNjRlYmI0ZGNkNDllOTljNDMxODExNjFmZDRk-X?utm_source=shared-link&utm_medium=link&s=9867840\n\nPC:\nAMD FX 6core 3.0ghz ddr3\n12GB RAM HyperFury X Blue ddr3\nCooler Master Hyper 6heatpipe cpu cooler\nGigabyte MB\n2 x EVGA 1070 FTW\nKingston SSD 120gb\nKingston SSD 240GB\nREDDRAGON Keyboard\nREDDRAGON Mouse\nBlack Out Blue Yeti Microphone\nLogitech C922\n\nApps Used:\nBig Trades Tracker: https:\u002F\u002Ftucsky.github.io\u002FSignificantTrades\u002F#\nMultiple Charts: \nhttps:\u002F\u002Fcryptotrading.toys\u002Fcrypto-panel\u002F\nhttps:\u002F\u002Fcryptowatch.net\n\n\n\n","treasureChest":{"type":"id","generated":true,"id":"$user:dlive-00431789.treasureChest","typename":"TreasureChest"},"videos":{"type":"id","generated":true,"id":"$user:dlive-00431789.videos","typename":"VideoConnection"},"pastBroadcasts":{"type":"id","generated":true,"id":"$user:dlive-00431789.pastBroadcasts","typename":"PastBroadcastConnection"},"following":{"type":"id","generated":true,"id":"$user:dlive-00431789.following","typename":"UserConnection"}},"$user:dlive-00431789.followers":{"totalCount":1000,"__typename":"UserConnection"},"livestream:dlive-00431789+i7rCywMWg":{"id":"livestream:dlive-00431789+i7rCywMWg","totalReward":"3243600","watchingCount":5,"permlink":"dlive-00431789+i7rCywMWg","title":"bybit 0.1eth HIGH LEVERAGE","content":"","category":{"type":"id","generated":false,"id":"category:11455","typename":"Category"},"creator":{"type":"id","generated":false,"id":"user:dlive-00431789","typename":"User"},"__typename":"Livestream","language":{"type":"id","generated":false,"id":"language:1","typename":"Language"},"watchTime({\"add\":false})":true,"disableAlert":false},"category:11455":{"id":"category:11455","backendID":11455,"title":"Cryptocurrency","__typename":"Category","imgUrl":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Fcategory\u002FCBAOENLDK"},"language:1":{"id":"language:1","language":"English","__typename":"Language"},"$user:dlive-00431789.treasureChest":{"value":"2144482","state":"COLLECTING","ongoingGiveaway":null,"__typename":"TreasureChest","expireAt":"1560400949000","buffs":[],"startGiveawayValueThreshold":"500000"},"$user:dlive-00431789.videos":{"totalCount":0,"__typename":"VideoConnection"},"$user:dlive-00431789.pastBroadcasts":{"totalCount":13,"__typename":"PastBroadcastConnection"},"$user:dlive-00431789.following":{"totalCount":41,"__typename":"UserConnection"},"ROOT_QUERY":{"userByDisplayName({\"displayname\":\"thelongestchain\"})":{"type":"id","generated":false,"id":"user:dlive-00431789","typename":"User"}}}};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script>
I assume you'll then just have to throw in a loop and make the url dynamic to get through whatever streamers you're monitoring with your site.

How do you submit a PHP form that doesn't return results immediately using Python?

There is a PHP form which queries a massive database. The URL for the form is https://db.slickbox.net/venues.php. It takes up to 10 minutes after the form is sent for results to be returned, and the results are returned inline on the same page. I've tried using Requests, URLLib2, LXML, and Selenium but I cannot come up with a solution using any of these libraries. Does anyone know of a way to retrieve the page source of the results after submitting this form?
If you know of a solution for this, for the sake of testing just fill out the name field ("vname") with the name of any store/gas station that comes to mind. Ultimately, I need to also set the checkboxes with the "checked" attribute but that's a subsequent goal after I get this working. Thank you!

I usually rely on Curl to do these kind of thing.
Instead of sending the form with the button to retrieve the source, call directly the response page (giving it your request).
As i work under PHP, it's quite easy to do this. With python, you will need pycURL to manage the same thing.
So the only thing to do is to call venues.php with the good arguments values thrown using POST method with Curl.
This way, you will need to prepare your request (country code, cat name), but you won't need to check the checkbox nor load the website page on your browser.
set_ini(max_execution_time,1200) // wait 20 minutes before quitting
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "https://db.slickbox.net/venues.php");
curl_setopt($ch, CURLOPT_HEADER, 0);
// prepare arguments for the form
$data = array('adlock ' => 1, 'age' => 0,'country' => 145,'imgcnt'=>0, 'lock'=>0,'regex'=>1,'submit'=>'Search','vname'=>'test');
//add arguments to our request
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
//launch request
if( ! $result = curl_exec($ch))
{
trigger_error(curl_error($ch));
}
echo $result;

How about ghost?
from ghost import Ghost
ghost = Ghost()
with ghost.start() as session:
page, extra_resources = session.open("https://db.slickbox.net/venues.php", wait_onload_event=True)
ghost.set_field_value("input[name=vname]", "....")
# Any other values
page.fire_on('form', 'submit')
page, resources = ghost.wait_for_page_loaded()
content = session.content # or page.content I forgot which
After you can use beautifulsoup to parse the HTML or Ghost may have some rudimentary utilities to do that.

How to collect HTML source response from a remote server?

From within the HTML code in one of my server pages I need to address a search of a specific item on a database placed in another remote server that I don’t own myself.
Example of the search type that performs my request: http://www.remoteserver.com/items/search.php?search_size=XXL
The remote server provides to me - as client - the response displaying a page with several items that match my search criteria.
I don’t want to have this page displayed. What I want is to collect into a string (or local file) the full contents of the remote server HTML response (the code we have access when we click on ‘View Source’ in my IE browser client).
If I collect that data (it could easily reach reach 50000 bytes) I can then filter the one in which I am interested (substrings) and assemble a new request to the remote server for only one of the specific items in the response provided.
Is there any way through which I can get HTML from the response provided by the remote server with Javascript or PHP, and also avoid the display of the response in the browser itself?
I hope I have not confused your minds …
Thanks for any help you may provide.

As #mario mentioned, there are several different ways to do it.
Using file_get_contents():
$txt = file_get_contents('http://www.example.com/');
echo $txt;
Using php's curl functions:
$url = 'http://www.mysite.com';
$ch = curl_init($url);
// Tell curl_exec to return the text instead of sending it to STDOUT
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Don't include return header in output
curl_setopt($ch, CURLOPT_HEADER, 0);
$txt = curl_exec($ch);
curl_close($ch);
echo $txt;
curl is probably the most robust option because you have options for more control over the exact request parameters and possibilities for error handling when things don't go as planned

Looping of a function is so time consuming

i have php function that parses a xml url and gives me an array.this functions uses a perticular id and gives all information related to that id which is passed in the form. now i have 20 different ids and i am passing these id's in this form using foreach loop like below
<?php
$relatedSlides = $result['RelatedSlideshows'];
if(!empty($relatedSlides)){
$k=1;
foreach($relatedSlides as $Related){
RelatedSlides($Related);
if($k%6==0){
echo '</tr><tr>';
}
$k++;
}
}
?>
This is the foreach loop. $relatedSlides is an array of all slide id's. Now I am writing the function that parses the information about a particular id.
function RelatedSlides($slideId){
$secret_key = 'my api key';
$ts=time();
$hash=sha1($secret_key.$ts);
$key = 'my secret key';
$url = 'http://www.slideshare.net/api/2/get_slideshow?api_key='.$key.'&ts='.$ts.'& hash='.$hash.'&slideshow_id='.$slideId.'&detailed=1';
echo $url;
$ch=curl_init() or die (curl_error());
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla Firefox');
$query = curl_exec($ch);
$errorCode = curl_errno($ch);
curl_close($ch);
$array = (array) simplexml_load_string($query);
//echo '<pre>';
//print_r($array);
//return $array;
echo "<font size=\"18\">return code is ".$errorCode."</font>";
echo '<td valign="top"><div id="slide_thumb"><img src=" '.$array['ThumbnailURL'].'" width="100" height="100"/></div><div id="slide_thum_des"><strong>Views:</strong>'.$array['NumViews'].'<br />'.$array['Title'].'....</div></td>';
}
When I call this function my connection times out every time. The function is absolutely correct. It gives all data about a particular id but when I run it in a foreach loop for many id's, "connection has been reset" or "connection timed out" displays.

You could try a couple of things:
Setup your curl handler outside of the RelateSlides() function. This way you don't have to keep building and tearing down the $ch resource every iteration.
Check the slideshare.net api and see if there are params you can pass to pull down smaller files.
As Luke wisely mentioned, you could make the page asyncronous, meaning you can render the page with 6 tiles, then have each tile make an ajax call for the slide you want. This way at least the user gets to see something while the tiles load, as opposed to being 'hung up' while you pull all the images at once.
I trust slideshare has a pretty robust cdn hosting these images, you may want to see if they have servers closer to your web server.
Quick question, is the curl option how slideshare.net suggested you go about pulling images? Chances are you could just create an image tag with a link directly to their api:
echo '<img src="http://www.slideshare.net/api/2/get_slideshow?api_key='.$key.'&ts='.$ts.'& hash='.$hash.'&slideshow_id='.$slideId.'&detailed=1' />';
If you are doing the curl option for extended data, you may want to consider caching the extended data so you don't have to keep making the extraneous simplexml_load_string call.

Timeout is due to your function taking its time as you have said already. It is normal, it also can be adjusted in either PHP config or Apache (don't remember, I would however check PHP config first). Remember that timeout is there for a reason - eg. good to time out when you run into inf loop - rare but possible.
I think one way to tackle this problem is to split this problem into parts and use AJAX to actually make individual calls that wont take as long.
eg.
Load the page with some JS/JQuery scripts.
Call async to get list of IDs (done by ajax call via jquery - the easiest)
Parse response (JSON?) on client side and do each request per each id async.
Wait for all results to come back and display them in a way you want.

How can I send GET data to multiple URLs at the same time using cURL?

My apologies, I've actually asked this question multiple times, but never quite understood the answers.
Here is my current code:
while($resultSet = mysql_fetch_array($SQL)){
$ch = curl_init($resultSet['url'] . $fullcurl); //load the urls and send GET data
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //Only load it for two seconds (Long enough to send the data)
curl_exec($ch); //Execute the cURL
curl_close($ch); //Close it off
} //end while loop
What I'm doing here, is taking URLs from a MySQL Database ($resultSet['url']), appending some extra variables to it, just some GET data ($fullcurl), and simply requesting the pages. This starts the script running on those pages, and that's all that this script needs to do, is start those scripts. It doesn't need to return any output. Just the load the page long enough for the script to start.
However, currently it's loading each URL (currently 11) one at a time. I need to load all of them simultaneously. I understand I need to use curl_multi_, but I haven't the slightest idea on how cURL functions work, so I don't know how to change my code to use curl_multi_ in a while loop.
So my questions are:
How can I change this code to load all of the URLs simultaneously? Please explain it and not just give me code. I want to know what each individual function does exactly. Will curl_multi_exec even work in a while loop, since the while loop is just sending each row one at a time?
And of course, any references, guides, tutorials about cURL functions would be nice, as well. Preferably not so much from php.net, as while it does a good job of giving me the syntax, its just a little dry and not so good with the explanations.
EDIT: Okay zaf, here is my current code as of now:
$mh = curl_multi_init(); //set up a cURL multiple execution handle
$SQL = mysql_query("SELECT url FROM urls") or die(mysql_error()); //Query the shell table
while($resultSet = mysql_fetch_array($SQL)){
$ch = curl_init($resultSet['url'] . $fullcurl); //load the urls and send GET data
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //Only load it for two seconds (Long enough to send the data)
curl_multi_add_handle($mh, $ch);
} //No more shells, close the while loop
curl_multi_exec($mh); //Execute the multi execution
curl_multi_close($mh); //Close it when it's finished.

In your while loop, you need to do the following for each URL:
create a curl resource by using curl_init()
set options for resource by curl_setopt(..)
Then you need to create a multiple curl handle by using curl_multi_init() and adding all the previous individual curl resources by using curl_multi_add_handle(...)
Then finally you can do curl_multi_exec(...).
A good example can be found here: http://us.php.net/manual/en/function.curl-multi-exec.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using MediaWiki API with the Continue Command - php

As svick said, please use a client library which handles continuation for you. The query continuation mechanism has changed multiple times in MediaWiki, you don't want to understand it or even less rely on it.

Related

How to get json data without api

How do you submit a PHP form that doesn't return results immediately using Python?

How to collect HTML source response from a remote server?

Looping of a function is so time consuming

How can I send GET data to multiple URLs at the same time using cURL?

Categories

Resources