Quickest & Efficient way of retrieving article final URL and images - php

I've written a PHP script to parse an RSS feed and try and get the open graph images from the og:image meta tags.
In order to get the images I need to check if the urls in the RSS feed are 301 redirects. This often happens and it means I need to follow any redirects to the resultant URLs. That means the script runs really slowly. Is there a quicker and more efficient way of achieving this?
Here is the function for getting the final URL:
function curl_get_contents($url) {
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result=curl_exec($ch);
return $result;
}
And this is the function to retrieve the og images (if they exist):
function getog($url) {
$doc = new DomDocument();
$doc->loadHTML(curl_get_contents($url));
if($doc == "") {return;}
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(#property, \'og:\')]';
$queryT = '';
$metas = $xpath->query($query);
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
if($property == "og:url" && $ogProperty['url'] == "") {$ogProperty['url'] = $content;}
if($property == "og:title" && $ogProperty['title'] == "") {$ogProperty['title'] = $content;}
if($property == "og:image" && $ogProperty['image'] == "") {$ogProperty['image'] = $content;}
}
return $ogProperty;
}
There is quite a bit more to the script, but these functions are the bottle neck. I'm also caching to a text file, which means it's faster after the first run.
How can I speed up my script to retrieve the final url and get the image urls from the links in the RSS feed?

You can use Facebook's OG API. Facebook use it to scrap important info from any URL.
It is pretty much fast as compared the usual scraping method.
You can work it like this..
og_scrapping.php:
function meta_scrap($url){
$link = 'https://graph.facebook.com/?id='.$url.'&scrape=true&method=post';
$ch = curl_get_contents($link);
return json_decode($ch);
}
Then simply call it anywhere after including the og_scrapping.php
print_r(meta_scrap('http://www.example.com');
You will get an array and then you can get selective content according to your need.
For title, image, url and description you can get them by:
$title = $output->title;
$image = $output->image[0]->url;
$description = $output->description;
$url = $output->url;
Major issue occurs while scrapping for images. Getting title and description is easy. Read this article to get images in a faster way. Also this will help you to save a few seconds.

I'm afraid there isn't much you can do to speed up the extraction process itself. One possible improvement would be approaching the image extraction string-wise, that is - while usually strongly advised against - focusing on the og: tags using regex.
This has the major downside of breaking easily if a change to the source is ever made, and not having significant speed advantage over the more stable DOM parsing approach.
I'm also caching to a text file, which means it's faster after the first run.
On the other hand, you might go with an approach that always serves only the cache to the user, and renews it using an asynchronous call if needed upon each request.
As CBroe commented on your answer:
There is no way to speed up following redirects. The client has to make a new request, and that takes the time it takes. With CURLOPT_FOLLOWLOCATION cURL does this automatically already, so there is no point where you could possibly interject to make anything faster.
Which means it is not a heavy task on your webserver, but instead a lengthy one because of the numerous requests it has to perform. This is a very good ground to start thinking asynchronous:
you receive a request that is looking for the RSS items,
you serve a response very quickly from the cache,
you send an asynchronous request to rebuild the cache if needed - this is the longest part due to the redirects and DOM parsing, but the original client/peer requesting the list of RSS items does not have to wait for this operation to complete; that is, for this list, it only takes time to send the rebuild request itself, a few microseconds,
you return with the cached items.
Asynchronous shell exec in PHP
If you'd go down this route, in your case, you'd meet the following advantages:
rapid content serving with high loading speed,
no loading speed reduction when the cache is being rebuilt.
But also, the following disadvantages:
the first user to request an updated feed does not immediately* receive the newest item(s),
subsequent users after the first one do not immediately* receive the newest item(s) until the cache is ready.
*Good news is, you can almost perfectly eliminate all disadvantages using a cyclic, timed AJAX request that checks if there are any new items in the RSS items cache.
If there are, you can display a message on top (or on bottom), informing the user about the arrival of new content, and append that content when the user clicks on the notice.
This approach - compared to simply always serving cached content without the cyclic AJAX call - reduces the delay between live RSS appearance and item appearance on your site to a maximum time of n + m, where n is the AJAX-request interval, and m is the time it takes to rebuild the cache.

Meta are stored in the "head" element.
In your Xpath, you have to consider the head element :
$query = '//head/meta[starts-with(#property, \'og:\')]';
You lose some time retrieving, storing and parsing the whole html file when you can stop the retrieval after ending of the "head" element. Also, why getting a 40k web page when you only want 1k?
You "might" consider stopping the retrieval after seing the ending "head" element. It can speed up the thing when there is no other thing to do, but it is a naughty-not-always-working-hack.

Related

How to get json data without api

I'm trying to get the viewer count so I can check if a streamer is online on https://www.dlive.tv/. If you view the page source on a streamer's page (https://www.dlive.tv/thelongestchain), there's a bunch of json and "watchingCount" is there.
Basically, I want to have the streamer appear on the "Live Now" section of my site if their viewer count is 1 or more, but I can't figure out anyway on how to get the viewer count. I know I could use something like Selenium if I was using python and could run it from my pc, but I need the site to know it.
DLive doesn't have an api yet, so I don't know how to make a call(or request I don't know the terminology) to get this info. When I look in the network tab on chrome I see that there's a call (https://graphigo.prd.dlive.tv/) that provides stream info I think. Would I also need my authkey?
I realize this question is broad and all over the place but it's because so am I with me trying to solve this the last couple days. If I had the viewercount as a variable, I know how to display the streamer on the "Live Now" section of my site, I just don't know how to get the necessary data.
If there's another way I should be checking if a streamer is online or offline other than getting the viewercount, that would work too. If anyone could help me out I would greatly appreciate it, thanks.
I tried scraping the page but I don't think you can scrape dynamic content. When I tried to use SimpleHTMLDom it just returned static elements.
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://www.dlive.tv/thelongestchain')
if(($html->find('video', 0))) {
echo 'online';
}else{
echo 'offline';
}
/* The video element is only on the page if the streamer is live, but it doesn't return because it's not static I presume */
?>
I have no idea at all how to go about making a call/request to get the json data for the viewer count, or how to get any other data that could check if a streamer is online. All the scraping I've done did not return any elements that weren't static (the same no matter if the streamer was online or offline).
Try cURL, it's like a magic wand.
This will return the entire page, and I think including the JSON you're looking for:
<?php
$curl = curl_init('https://www.dlive.tv/thelongestchain');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($curl);
var_dump($result); // you do as you need to here
curl_close($curl);
?>
Here's the <script> containing the data I believe you need. Assuming "watchingCount" is the same thing you're looking for?
<script>window.__INITIAL_STATE__={"version":"1.2.0","snackbar":{"snackbar":null},"accessToken":{"accessToken":null},"userMeta":{"userMeta":{"fingerprint":null,"referrer":{"isFirstTime":true,"streamer":null,"happyHour":null,"user":null},"ipStats":null,"ip":"34.216.252.62","langCode":"en","trackingInfo":{"rank":"not available","prevPage":"not available","postStatus":"not available"},"darkMode":true,"NSFW":false,"prefetchID":"e278e744-f522-480e-a290-8eed0fe83b07","cashinEmail":""}},"me":{"me":null},"dialog":{"dialog":{"login":false,"subscribe":false,"cashIn":false,"chest":false,"chestWinners":false,"downloadApp":false}},"tabs":{"tabs":{"livestreamMobileActiveTab":"tab-chat"}},"globalInfo":{"globalInfo":{"languages":[],"communityUpdates":[],"recommendChannels":[]}},"ui":{"ui":{"viewPointWidth":1920,"mq":4,"isMobile":false}}};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script> <script>window.__APOLLO_STATE__={"defaultClient":{"user:dlive-00431789":{"id":"user:dlive-00431789","avatar":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Favatar\u002F5c1330d8-5bc8-11e9-ab17-865634f95b6b","__typename":"User","displayname":"thelongestchain","partnerStatus":"NONE","username":"dlive-00431789","canSubscribe":false,"subSetting":null,"followers":{"type":"id","generated":true,"id":"$user:dlive-00431789.followers","typename":"UserConnection"},"livestream":{"type":"id","generated":false,"id":"livestream:dlive-00431789+i7rCywMWg","typename":"Livestream"},"hostingLivestream":null,"offlineImage":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Fofflineimage\u002Fvideo-placeholder.png","banStatus":"NO_BAN","deactivated":false,"about":"#lovejonah\n\nJonah's NEW FRIENDS:\nhttps:\u002F\u002Fdlive.tv\u002FFlamenco https:\u002F\u002Fi.gyazo.com\u002F88416fca5047381105da289faba60e7c.png\nhttps:\u002F\u002Fdlive.tv\u002FHamsterSamster https:\u002F\u002Fi.gyazo.com\u002F984b19f77a1de5e3028e42ccd71052a0.png\nhttps:\u002F\u002Fdlive.tv\u002Fjayis4justice \nhttps:\u002F\u002Fdlive.tv\u002FDenomic\nhttps:\u002F\u002Fdlive.tv\u002FCutie\nhttps:\u002F\u002Fdlive.tv\u002FTruly_A_No_Life\n\n\n\n\n\n\n\n\n\n\n\n\nOur Socials:\nhttps:\u002F\u002Fwww.twitch.tv\u002Fthelongestchain\nhttps:\u002F\u002Fdiscord.gg\u002Fsagd68Z\n\nLINO website: https:\u002F\u002Flino.network\nLINO Whitepaper: https:\u002F\u002Fdocsend.com\u002Fview\u002Fy9qtwb6\nLINO Tracker : https:\u002F\u002Ftracker.lino.network\u002F#\u002F\nLINO Discord : https:\u002F\u002Fdiscord.gg\u002FTUxp3ww\n\nThe Legend of Lemon's: https:\u002F\u002Fbubbl.us\u002FNTE1OTA4MS85ODY3ODQwL2M0Y2NjNjRlYmI0ZGNkNDllOTljNDMxODExNjFmZDRk-X?utm_source=shared-link&utm_medium=link&s=9867840\n\nPC:\nAMD FX 6core 3.0ghz ddr3\n12GB RAM HyperFury X Blue ddr3\nCooler Master Hyper 6heatpipe cpu cooler\nGigabyte MB\n2 x EVGA 1070 FTW\nKingston SSD 120gb\nKingston SSD 240GB\nREDDRAGON Keyboard\nREDDRAGON Mouse\nBlack Out Blue Yeti Microphone\nLogitech C922\n\nApps Used:\nBig Trades Tracker: https:\u002F\u002Ftucsky.github.io\u002FSignificantTrades\u002F#\nMultiple Charts: \nhttps:\u002F\u002Fcryptotrading.toys\u002Fcrypto-panel\u002F\nhttps:\u002F\u002Fcryptowatch.net\n\n\n\n","treasureChest":{"type":"id","generated":true,"id":"$user:dlive-00431789.treasureChest","typename":"TreasureChest"},"videos":{"type":"id","generated":true,"id":"$user:dlive-00431789.videos","typename":"VideoConnection"},"pastBroadcasts":{"type":"id","generated":true,"id":"$user:dlive-00431789.pastBroadcasts","typename":"PastBroadcastConnection"},"following":{"type":"id","generated":true,"id":"$user:dlive-00431789.following","typename":"UserConnection"}},"$user:dlive-00431789.followers":{"totalCount":1000,"__typename":"UserConnection"},"livestream:dlive-00431789+i7rCywMWg":{"id":"livestream:dlive-00431789+i7rCywMWg","totalReward":"3243600","watchingCount":5,"permlink":"dlive-00431789+i7rCywMWg","title":"bybit 0.1eth HIGH LEVERAGE","content":"","category":{"type":"id","generated":false,"id":"category:11455","typename":"Category"},"creator":{"type":"id","generated":false,"id":"user:dlive-00431789","typename":"User"},"__typename":"Livestream","language":{"type":"id","generated":false,"id":"language:1","typename":"Language"},"watchTime({\"add\":false})":true,"disableAlert":false},"category:11455":{"id":"category:11455","backendID":11455,"title":"Cryptocurrency","__typename":"Category","imgUrl":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Fcategory\u002FCBAOENLDK"},"language:1":{"id":"language:1","language":"English","__typename":"Language"},"$user:dlive-00431789.treasureChest":{"value":"2144482","state":"COLLECTING","ongoingGiveaway":null,"__typename":"TreasureChest","expireAt":"1560400949000","buffs":[],"startGiveawayValueThreshold":"500000"},"$user:dlive-00431789.videos":{"totalCount":0,"__typename":"VideoConnection"},"$user:dlive-00431789.pastBroadcasts":{"totalCount":13,"__typename":"PastBroadcastConnection"},"$user:dlive-00431789.following":{"totalCount":41,"__typename":"UserConnection"},"ROOT_QUERY":{"userByDisplayName({\"displayname\":\"thelongestchain\"})":{"type":"id","generated":false,"id":"user:dlive-00431789","typename":"User"}}}};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script>
I assume you'll then just have to throw in a loop and make the url dynamic to get through whatever streamers you're monitoring with your site.

preg_match_all regex

Having issues using regex to grab HTML contained in a certain span.
Trying to get it to get safeytrfyh is available! on NameMC.com to make a fast checker that will check a pre-specified list if usernames are available instead of constantly typing in the username and clicking check.
An example page you guys can use is https://namemc.com/u/safeytrfyh
Im using cURL for this:
<?php
//Urls to scrape from.
$URLs = array();
$URLs[] = 'https://namemc.com/u/safeytrfyh';
$working = '';
//Curl scraper.
foreach($URLs as $URL){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $URL);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$page = curl_exec($ch);
$accounts = array();
preg_match_all('#<div><span[^>]*>(.*?)</span></div>#',$page,$accounts);
foreach($accounts[0] as $account){
$working .= ''.$account.''. PHP_EOL . '';
}
}
//Put the scraped check into the new .txt file.
file_put_contents('accounts.txt', $working, FILE_APPEND);
?>
The usually simpler / less efficient approach is typically traversing the HTML structure with a neat frontend, such as QueryPath etc. qp($html)->find(".alert-danger .alert-link")->text(). Albeit that actually looks less reliable for the concrete task.
Now if for some reason you don't want to look at the HTML source, and adapt your regex, or don't know how placeholders work; then a simpler alternative is just matching for raw text:
$text = strip_tags($html);
preg_match_all("/(\w+) \s+ is \s+ available/x", $text, $matches);
Where \w+ stands for word characters, \s+ for spaces, and /x for readability.
You can convert page in to DOM object can get what ever you want as:
<?php
$url = "http://stackoverflow.com/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); // if page is https (use if you are using local host)
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$page = curl_exec($ch); // Can echo to check page
$dom = new DOMDocument();
#$dom->loadHTML($page);
$xpath = new DOMXPath( $dom );
$query21 = '//div[#id="question-mini-list"]//h3//a[#class="question-hyperlink"]' ;
$nodes21 = $xpath->query( $query21 );
$title = "questions.txt";
$file_title = fopen($title, 'w');
foreach( $nodes21 as $node21 )
{
$tit = trim($node21->nodeValue); // HEADING
fwrite($file_title, $tit . "\r\n");
}
?>
OUTPUT as:
I have an araay in one file and i want to find the size of it in another file using “sizeof” , i dont want to use any extra variables?
No Activity found to handle Intent act=android.intent.action.VIEW when trying to play an audio file
Naming a variable dynamically in Ruby
uanble to use Bootstrap Notify with angular js in mvc application
How do I combine a bootstrap carousel with a sidebar menu?
Stop pausing when mouse hover -Full Slider
How to Let Recordset #2 in the Same Position as the Similar Recordset#1
Bash backup script. Read list of files. [OS X]
extracting multiple columns from mt0 in hspice simulations using awk command
Can't invoke *method=* type methods in instance_eval
Couldnt understand the Array behavior in ruby
Could not connect to sql server using msado15.dll in c++
Slick 3.0.0 AutoIncrement Composite Key
Swift Error type 'usersVC' does not conform to protocol 'UITableViewDataSource'
Installation Error Unknown Failure
Is it possible to 'emulate' a regular post that loads a new page in angularjs? or plain java as a backup?
puppet file protocol handle throws Could not evaluate
Hazard of load address in mips
how to post multipal files to a url from jscript?
How to organize the viewmodel of tableview with section in reactiveUI
CQRS with legacy MSSQL database
Should I use Blob storage or Azure VM storage for files?
Copy cell content from a column to another column in matlab
How do I debug a crash on iOS device from a crash log
Combobox in windows phone 8.1 not showing 4th and 5th element in emulator
How to add padding in printing table in F#?
I don't understand the SpriteAccessor class (Universal Tween Engine)
mule reliable pattern with file streaming and JMS
How to tell Faraday to preserve hashbang in site URL?
maven-license-plugin by mycila (replacing license header)
Customise `JOptionPane.YES_NO_OPTION`
AWS: Boto SQS writing isn't saving
Android expandable listview always scrolls down to bottom
Inconsistency in TypeConverter behavior?
Using function as prototype
Adjust width of inline buttons automatically based on parent width
GetWeek of Month, Week starts from Monday
Has anybody tried to recreate UITableViewController with static cells?
Why shows --“cannot pass objects of non-trivially-copyable type”?
Search and update a string in a text file in JAVA
What is Countdown Latch in Java MultiThreading?
Slim Framework with ORM (Eloquent) connect multiple db
Why isn't the frame centred in this GUI program when it is run?
Custom Logout Handler Not Working Grails
Response to post request to AWS “breaks the pipe”, cannot read
how to set focus to a SearchBox control in windows 8.1 store app?
Removing a word from after a string
need to generate css from scss file on windows 8.1 using gruntjs compass
Arduino YUN - complex JSON response
How to use expandable list view in the following scenario
Unique DB entry to the user
R : Save big objects to disk then only load parts of them
What is wrong based on these dbus system bus log files?
NLP Shift reduce parser is throwing null pointer Exception for Sentiment calculation
Excel VBA - Combine Rows with duplicate values, merge cells if different
what's TransactionID and RowID and Roll Point size in InnoDB
File associations in vscode
Difference Between IEnumerable Model and Model
efficient way of passing Data between Matlab functions
Open new Form in same window silverlight app via c#?
Hibernate configuration to create hbm and POJO
FTP Client gives “ECONNREFUSED - Connection refused by server”
Timer in Selective Repeat ARQ
Can TXL be used for code clone detection
MATLAB - Callback after reparenting
Asynchronous execution with datastax mapper
Stopping gobbler threads in blocking reads on Process InputStream
how to get gabor filter image using opencv?
WebView shows source html with loadDataWithBaseURL, not rendered view
git merge forked repo to local repo
Scrapy (Python): Iterating over 'next' page without multiple functions
android:uiOption=“SplitActionBarWhenNarrow” does not work
md5 hash a large file incrementally?
Instagram relationship request endpoint registration issue
cuda calc distance of two points
How to share contents of ListView row on facebook in Android?
how will the socket act when the receiving speed is larger than process speed
cannot see particle (cocos2d-x 3.5 with Particle Designer2)
Couldn't find FoodObject without an ID
CardView and RecyclerView divider behaviour
Verification google play purchase from server side
dyld: Symbol not found: _iconv when using javac to compile on MacOS
R not producing a figure in jupyter (IPython notebook)
Entity Framework 6 update a table and insert into foreign key related tables
I have integrated CLIPS with VC++(MFC), why there are some function does't execute,such as “strcmp”
Using SelectBoxIt in AngularJS Directive
Where is the Google Information Rights Management API?
Open Graph in Laravel 5
CodeIgniter 3 Unable to locate the model you have specified
how to have a static url for shopify oauth?
Use AnnotationReader under namespace
No such .h file or directory(Android, Cocos2d-x, NDK)
Getting total sum of rows and adding and removing rows using knockoutjs
Dynamic default value for Kendo Grid
Ruby's class expression---how is it different from `Class.new`?
socket.emit is not working in mobile chrome (but it works in incognito mode)

PHP & Facebook: facebook-debug a URL using CURL and Facebook debugger

Facts: I run a simple website that contains articles, articles dynamically acquired by scraping third-party websites/blogs etc (new articles arrive to my website every half an hour or so), articles which I wish to post on my facebook page. Each article typically includes an image, a title and some text.
Problem: Most (almost all) of the articles that I post on Facebook are not posted correctly - their images are missing.
Inefficient Solution: Using Facebook's debugger (this one) I submit an article's URL to it (URL from my website, not the original source's URL) and Facebook then scans/scrapes the URL and correctly extracts the needed information (image, title, text etc). After this action, the article can be posted on Facebook correctly - no missing images or anything.
Goal: What I am after is a way to create a process which will submit a URL to Facebook's debugger, thus forcing Facebook to scan/scrape the URL so that it can then be posted correctly. I believe that what I need to do is to create an HTML POST request containing the URL and submit it to Facebook's debugger. Is this the correct way to go? And if yes, as I have no previous experience with CURL, what is the correct way to do it using CURL in PHP?
Side Notes: As a side note, I should mention that I am using short URLs for my articles, although I do not think that this is the cause of the problem because the problem persists even when I use the canonical URLs.
Also, the Open Graph meta tags are correctly set (og:image, og:description, etc).
You can debug a graph object using Facebook graph API with PHP-cURL, by doing a POST to
https://graph.facebook.com/v1.0/?id={Object_URL}&scrape=1
to make thing easier, we can wrap our debugger within a function:
function facebookDebugger($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://graph.facebook.com/v1.0/?id='. urlencode($url). '&scrape=1');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$r = curl_exec($ch);
return $r;
}
though this will update & clear Facebook cache for the passed URL, it's a bit hard to print out each key & its content and avoid errors in the same time, however I recommended using var_dump() or print_r() OR PHP-ref
usage with PHP-ref
r( facebookDebugger('http://retrogramexplore.tumblr.com/') );

Using MediaWiki API with the Continue Command

I need some help Using the Mediawiki API with the "Continue" or "query-continue" Command to pull information from my wiki articles. I have a large number of wiki articles (more than 800 currently) and I need to use the api to pull them in batches of 50 and then print ou sections.
My API call works properly:
//Stackoverflow making me use a valid URL here, this api is actually my own localhost server
http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=a&apto=z&apnamespace=0&format=xml&aplimit=50 I am querying all pages, therefore "apfrom" and "apto".
I just need help processing the code with PHP and CURL accessing the API and processing the batches of 50 and using the "continue" to access more records until I hit the end. So far my php code is:
//the CURL commands here work and outputs a data set but only for the first 50 records, so I need to call "continue" to get to the end.
//My api url is localhost but I'm forced to use a valid URL by Stackoverflow.com
$url = sprintf('http://en.wikipedia.org/w/api.php?
action=query&list=allpages&apfrom=a&apto=z&apnamespace=0&format=xml&aplimit=50');
$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'My site');
$res = curl_exec($ch);
$continue = '';
while ( // I don't know what to set here as true to get the while loop going, maybe continue = true? maybe set query-continue as true?)
{
//Maybe I need something other than $res['query-continue]??
if (empty($res['query-continue']))
{
exit;
}
else
{
$continue = '&apcontinue='.urlencode($res['query-continue']);
foreach ($res['query']['allpages'] as $v)
{
echo $v['title'];
}
}
}
Can someone correct my while loop code above so I can do a simple print out of the title from each wiki article in the loop? I've done a lot of searching online but I'm stuck!! I found a python loop example at http://www.mediawiki.org/wiki/API:Query but I have to do it in PHP. And I am not sure if I call continue or query-continue.
As svick said, please use a client library which handles continuation for you.
The query continuation mechanism has changed multiple times in MediaWiki, you don't want to understand it or even less rely on it.

Looping of a function is so time consuming

i have php function that parses a xml url and gives me an array.this functions uses a perticular id and gives all information related to that id which is passed in the form. now i have 20 different ids and i am passing these id's in this form using foreach loop like below
<?php
$relatedSlides = $result['RelatedSlideshows'];
if(!empty($relatedSlides)){
$k=1;
foreach($relatedSlides as $Related){
RelatedSlides($Related);
if($k%6==0){
echo '</tr><tr>';
}
$k++;
}
}
?>
This is the foreach loop. $relatedSlides is an array of all slide id's. Now I am writing the function that parses the information about a particular id.
function RelatedSlides($slideId){
$secret_key = 'my api key';
$ts=time();
$hash=sha1($secret_key.$ts);
$key = 'my secret key';
$url = 'http://www.slideshare.net/api/2/get_slideshow?api_key='.$key.'&ts='.$ts.'& hash='.$hash.'&slideshow_id='.$slideId.'&detailed=1';
echo $url;
$ch=curl_init() or die (curl_error());
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla Firefox');
$query = curl_exec($ch);
$errorCode = curl_errno($ch);
curl_close($ch);
$array = (array) simplexml_load_string($query);
//echo '<pre>';
//print_r($array);
//return $array;
echo "<font size=\"18\">return code is ".$errorCode."</font>";
echo '<td valign="top"><div id="slide_thumb"><img src=" '.$array['ThumbnailURL'].'" width="100" height="100"/></div><div id="slide_thum_des"><strong>Views:</strong>'.$array['NumViews'].'<br />'.$array['Title'].'....</div></td>';
}
When I call this function my connection times out every time. The function is absolutely correct. It gives all data about a particular id but when I run it in a foreach loop for many id's, "connection has been reset" or "connection timed out" displays.
You could try a couple of things:
Setup your curl handler outside of the RelateSlides() function. This way you don't have to keep building and tearing down the $ch resource every iteration.
Check the slideshare.net api and see if there are params you can pass to pull down smaller files.
As Luke wisely mentioned, you could make the page asyncronous, meaning you can render the page with 6 tiles, then have each tile make an ajax call for the slide you want. This way at least the user gets to see something while the tiles load, as opposed to being 'hung up' while you pull all the images at once.
I trust slideshare has a pretty robust cdn hosting these images, you may want to see if they have servers closer to your web server.
Quick question, is the curl option how slideshare.net suggested you go about pulling images? Chances are you could just create an image tag with a link directly to their api:
echo '<img src="http://www.slideshare.net/api/2/get_slideshow?api_key='.$key.'&ts='.$ts.'& hash='.$hash.'&slideshow_id='.$slideId.'&detailed=1' />';
If you are doing the curl option for extended data, you may want to consider caching the extended data so you don't have to keep making the extraneous simplexml_load_string call.
Timeout is due to your function taking its time as you have said already. It is normal, it also can be adjusted in either PHP config or Apache (don't remember, I would however check PHP config first). Remember that timeout is there for a reason - eg. good to time out when you run into inf loop - rare but possible.
I think one way to tackle this problem is to split this problem into parts and use AJAX to actually make individual calls that wont take as long.
eg.
Load the page with some JS/JQuery scripts.
Call async to get list of IDs (done by ajax call via jquery - the easiest)
Parse response (JSON?) on client side and do each request per each id async.
Wait for all results to come back and display them in a way you want.

Categories