I'm trying to get the viewer count so I can check if a streamer is online on https://www.dlive.tv/. If you view the page source on a streamer's page (https://www.dlive.tv/thelongestchain), there's a bunch of json and "watchingCount" is there.
Basically, I want to have the streamer appear on the "Live Now" section of my site if their viewer count is 1 or more, but I can't figure out anyway on how to get the viewer count. I know I could use something like Selenium if I was using python and could run it from my pc, but I need the site to know it.
DLive doesn't have an api yet, so I don't know how to make a call(or request I don't know the terminology) to get this info. When I look in the network tab on chrome I see that there's a call (https://graphigo.prd.dlive.tv/) that provides stream info I think. Would I also need my authkey?
I realize this question is broad and all over the place but it's because so am I with me trying to solve this the last couple days. If I had the viewercount as a variable, I know how to display the streamer on the "Live Now" section of my site, I just don't know how to get the necessary data.
If there's another way I should be checking if a streamer is online or offline other than getting the viewercount, that would work too. If anyone could help me out I would greatly appreciate it, thanks.
I tried scraping the page but I don't think you can scrape dynamic content. When I tried to use SimpleHTMLDom it just returned static elements.
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://www.dlive.tv/thelongestchain')
if(($html->find('video', 0))) {
echo 'online';
}else{
echo 'offline';
}
/* The video element is only on the page if the streamer is live, but it doesn't return because it's not static I presume */
?>
I have no idea at all how to go about making a call/request to get the json data for the viewer count, or how to get any other data that could check if a streamer is online. All the scraping I've done did not return any elements that weren't static (the same no matter if the streamer was online or offline).
Try cURL, it's like a magic wand.
This will return the entire page, and I think including the JSON you're looking for:
<?php
$curl = curl_init('https://www.dlive.tv/thelongestchain');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($curl);
var_dump($result); // you do as you need to here
curl_close($curl);
?>
Here's the <script> containing the data I believe you need. Assuming "watchingCount" is the same thing you're looking for?
<script>window.__INITIAL_STATE__={"version":"1.2.0","snackbar":{"snackbar":null},"accessToken":{"accessToken":null},"userMeta":{"userMeta":{"fingerprint":null,"referrer":{"isFirstTime":true,"streamer":null,"happyHour":null,"user":null},"ipStats":null,"ip":"34.216.252.62","langCode":"en","trackingInfo":{"rank":"not available","prevPage":"not available","postStatus":"not available"},"darkMode":true,"NSFW":false,"prefetchID":"e278e744-f522-480e-a290-8eed0fe83b07","cashinEmail":""}},"me":{"me":null},"dialog":{"dialog":{"login":false,"subscribe":false,"cashIn":false,"chest":false,"chestWinners":false,"downloadApp":false}},"tabs":{"tabs":{"livestreamMobileActiveTab":"tab-chat"}},"globalInfo":{"globalInfo":{"languages":[],"communityUpdates":[],"recommendChannels":[]}},"ui":{"ui":{"viewPointWidth":1920,"mq":4,"isMobile":false}}};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script> <script>window.__APOLLO_STATE__={"defaultClient":{"user:dlive-00431789":{"id":"user:dlive-00431789","avatar":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Favatar\u002F5c1330d8-5bc8-11e9-ab17-865634f95b6b","__typename":"User","displayname":"thelongestchain","partnerStatus":"NONE","username":"dlive-00431789","canSubscribe":false,"subSetting":null,"followers":{"type":"id","generated":true,"id":"$user:dlive-00431789.followers","typename":"UserConnection"},"livestream":{"type":"id","generated":false,"id":"livestream:dlive-00431789+i7rCywMWg","typename":"Livestream"},"hostingLivestream":null,"offlineImage":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Fofflineimage\u002Fvideo-placeholder.png","banStatus":"NO_BAN","deactivated":false,"about":"#lovejonah\n\nJonah's NEW FRIENDS:\nhttps:\u002F\u002Fdlive.tv\u002FFlamenco https:\u002F\u002Fi.gyazo.com\u002F88416fca5047381105da289faba60e7c.png\nhttps:\u002F\u002Fdlive.tv\u002FHamsterSamster https:\u002F\u002Fi.gyazo.com\u002F984b19f77a1de5e3028e42ccd71052a0.png\nhttps:\u002F\u002Fdlive.tv\u002Fjayis4justice \nhttps:\u002F\u002Fdlive.tv\u002FDenomic\nhttps:\u002F\u002Fdlive.tv\u002FCutie\nhttps:\u002F\u002Fdlive.tv\u002FTruly_A_No_Life\n\n\n\n\n\n\n\n\n\n\n\n\nOur Socials:\nhttps:\u002F\u002Fwww.twitch.tv\u002Fthelongestchain\nhttps:\u002F\u002Fdiscord.gg\u002Fsagd68Z\n\nLINO website: https:\u002F\u002Flino.network\nLINO Whitepaper: https:\u002F\u002Fdocsend.com\u002Fview\u002Fy9qtwb6\nLINO Tracker : https:\u002F\u002Ftracker.lino.network\u002F#\u002F\nLINO Discord : https:\u002F\u002Fdiscord.gg\u002FTUxp3ww\n\nThe Legend of Lemon's: https:\u002F\u002Fbubbl.us\u002FNTE1OTA4MS85ODY3ODQwL2M0Y2NjNjRlYmI0ZGNkNDllOTljNDMxODExNjFmZDRk-X?utm_source=shared-link&utm_medium=link&s=9867840\n\nPC:\nAMD FX 6core 3.0ghz ddr3\n12GB RAM HyperFury X Blue ddr3\nCooler Master Hyper 6heatpipe cpu cooler\nGigabyte MB\n2 x EVGA 1070 FTW\nKingston SSD 120gb\nKingston SSD 240GB\nREDDRAGON Keyboard\nREDDRAGON Mouse\nBlack Out Blue Yeti Microphone\nLogitech C922\n\nApps Used:\nBig Trades Tracker: https:\u002F\u002Ftucsky.github.io\u002FSignificantTrades\u002F#\nMultiple Charts: \nhttps:\u002F\u002Fcryptotrading.toys\u002Fcrypto-panel\u002F\nhttps:\u002F\u002Fcryptowatch.net\n\n\n\n","treasureChest":{"type":"id","generated":true,"id":"$user:dlive-00431789.treasureChest","typename":"TreasureChest"},"videos":{"type":"id","generated":true,"id":"$user:dlive-00431789.videos","typename":"VideoConnection"},"pastBroadcasts":{"type":"id","generated":true,"id":"$user:dlive-00431789.pastBroadcasts","typename":"PastBroadcastConnection"},"following":{"type":"id","generated":true,"id":"$user:dlive-00431789.following","typename":"UserConnection"}},"$user:dlive-00431789.followers":{"totalCount":1000,"__typename":"UserConnection"},"livestream:dlive-00431789+i7rCywMWg":{"id":"livestream:dlive-00431789+i7rCywMWg","totalReward":"3243600","watchingCount":5,"permlink":"dlive-00431789+i7rCywMWg","title":"bybit 0.1eth HIGH LEVERAGE","content":"","category":{"type":"id","generated":false,"id":"category:11455","typename":"Category"},"creator":{"type":"id","generated":false,"id":"user:dlive-00431789","typename":"User"},"__typename":"Livestream","language":{"type":"id","generated":false,"id":"language:1","typename":"Language"},"watchTime({\"add\":false})":true,"disableAlert":false},"category:11455":{"id":"category:11455","backendID":11455,"title":"Cryptocurrency","__typename":"Category","imgUrl":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Fcategory\u002FCBAOENLDK"},"language:1":{"id":"language:1","language":"English","__typename":"Language"},"$user:dlive-00431789.treasureChest":{"value":"2144482","state":"COLLECTING","ongoingGiveaway":null,"__typename":"TreasureChest","expireAt":"1560400949000","buffs":[],"startGiveawayValueThreshold":"500000"},"$user:dlive-00431789.videos":{"totalCount":0,"__typename":"VideoConnection"},"$user:dlive-00431789.pastBroadcasts":{"totalCount":13,"__typename":"PastBroadcastConnection"},"$user:dlive-00431789.following":{"totalCount":41,"__typename":"UserConnection"},"ROOT_QUERY":{"userByDisplayName({\"displayname\":\"thelongestchain\"})":{"type":"id","generated":false,"id":"user:dlive-00431789","typename":"User"}}}};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script>
I assume you'll then just have to throw in a loop and make the url dynamic to get through whatever streamers you're monitoring with your site.
Related
Facts: I run a simple website that contains articles, articles dynamically acquired by scraping third-party websites/blogs etc (new articles arrive to my website every half an hour or so), articles which I wish to post on my facebook page. Each article typically includes an image, a title and some text.
Problem: Most (almost all) of the articles that I post on Facebook are not posted correctly - their images are missing.
Inefficient Solution: Using Facebook's debugger (this one) I submit an article's URL to it (URL from my website, not the original source's URL) and Facebook then scans/scrapes the URL and correctly extracts the needed information (image, title, text etc). After this action, the article can be posted on Facebook correctly - no missing images or anything.
Goal: What I am after is a way to create a process which will submit a URL to Facebook's debugger, thus forcing Facebook to scan/scrape the URL so that it can then be posted correctly. I believe that what I need to do is to create an HTML POST request containing the URL and submit it to Facebook's debugger. Is this the correct way to go? And if yes, as I have no previous experience with CURL, what is the correct way to do it using CURL in PHP?
Side Notes: As a side note, I should mention that I am using short URLs for my articles, although I do not think that this is the cause of the problem because the problem persists even when I use the canonical URLs.
Also, the Open Graph meta tags are correctly set (og:image, og:description, etc).
You can debug a graph object using Facebook graph API with PHP-cURL, by doing a POST to
https://graph.facebook.com/v1.0/?id={Object_URL}&scrape=1
to make thing easier, we can wrap our debugger within a function:
function facebookDebugger($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://graph.facebook.com/v1.0/?id='. urlencode($url). '&scrape=1');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$r = curl_exec($ch);
return $r;
}
though this will update & clear Facebook cache for the passed URL, it's a bit hard to print out each key & its content and avoid errors in the same time, however I recommended using var_dump() or print_r() OR PHP-ref
usage with PHP-ref
r( facebookDebugger('http://retrogramexplore.tumblr.com/') );
I need some help Using the Mediawiki API with the "Continue" or "query-continue" Command to pull information from my wiki articles. I have a large number of wiki articles (more than 800 currently) and I need to use the api to pull them in batches of 50 and then print ou sections.
My API call works properly:
//Stackoverflow making me use a valid URL here, this api is actually my own localhost server
http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=a&apto=z&apnamespace=0&format=xml&aplimit=50 I am querying all pages, therefore "apfrom" and "apto".
I just need help processing the code with PHP and CURL accessing the API and processing the batches of 50 and using the "continue" to access more records until I hit the end. So far my php code is:
//the CURL commands here work and outputs a data set but only for the first 50 records, so I need to call "continue" to get to the end.
//My api url is localhost but I'm forced to use a valid URL by Stackoverflow.com
$url = sprintf('http://en.wikipedia.org/w/api.php?
action=query&list=allpages&apfrom=a&apto=z&apnamespace=0&format=xml&aplimit=50');
$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'My site');
$res = curl_exec($ch);
$continue = '';
while ( // I don't know what to set here as true to get the while loop going, maybe continue = true? maybe set query-continue as true?)
{
//Maybe I need something other than $res['query-continue]??
if (empty($res['query-continue']))
{
exit;
}
else
{
$continue = '&apcontinue='.urlencode($res['query-continue']);
foreach ($res['query']['allpages'] as $v)
{
echo $v['title'];
}
}
}
Can someone correct my while loop code above so I can do a simple print out of the title from each wiki article in the loop? I've done a lot of searching online but I'm stuck!! I found a python loop example at http://www.mediawiki.org/wiki/API:Query but I have to do it in PHP. And I am not sure if I call continue or query-continue.
As svick said, please use a client library which handles continuation for you.
The query continuation mechanism has changed multiple times in MediaWiki, you don't want to understand it or even less rely on it.
Yeah, I'm stumped. I'm getting nothing. curl_exec is returning no content. I've tried file_get_contents, but that completely times out. I'm attempting to get an API XML from my Subsonic media server and display it on my web server (different servers). The end result would be that I can have people log in to my web server with the media server account. I can deal with the actual parsing later, but I can't even grab the XML right now. I've tried their forums, but haven't gotten much help since they're not really PHP inclined. Figure I'd ask here.
$url = "http://{$subserver}/rest/getUser.view?u={$username}&p={$password}&username={$username}&v=1.8.0&c={$appID}";
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_HEADER, 0);
$result = curl_exec($c);
curl_close($c);
echo $result;
This returns nothing. The variables are defined correctly, and I get the same response as if I typed in the whole URL. Here is their API page: http://www.subsonic.org/pages/api.jsp I've even tried with their "ping" function - still empty
The url itself looks fine. In the web browser, it returns:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<subsonic-response xmlns="http://subsonic.org/restapi" status="ok" version="1.8.0">
<user username="xxxxxx" email="xxxxxx#xxxxxx.com" scrobblingEnabled="false" adminRole="true" settingsRole="true" downloadRole="true" uploadRole="true" playlistRole="true" coverArtRole="true" commentRole="true" podcastRole="true" streamRole="true" jukeboxRole="true" shareRole="true"/>
</subsonic-response>
I admit I've never used XML, but according to everything I've read... this should work. And it does work, with other random XML files I found on the web.
it might have something to do with the fact that it's not an ".xml" file, but a generated via url xml, as this same exact code will work with some random xml file I found ( http://www.w3schools.com/xml/note.xml )
Any thoughts?
i have php function that parses a xml url and gives me an array.this functions uses a perticular id and gives all information related to that id which is passed in the form. now i have 20 different ids and i am passing these id's in this form using foreach loop like below
<?php
$relatedSlides = $result['RelatedSlideshows'];
if(!empty($relatedSlides)){
$k=1;
foreach($relatedSlides as $Related){
RelatedSlides($Related);
if($k%6==0){
echo '</tr><tr>';
}
$k++;
}
}
?>
This is the foreach loop. $relatedSlides is an array of all slide id's. Now I am writing the function that parses the information about a particular id.
function RelatedSlides($slideId){
$secret_key = 'my api key';
$ts=time();
$hash=sha1($secret_key.$ts);
$key = 'my secret key';
$url = 'http://www.slideshare.net/api/2/get_slideshow?api_key='.$key.'&ts='.$ts.'& hash='.$hash.'&slideshow_id='.$slideId.'&detailed=1';
echo $url;
$ch=curl_init() or die (curl_error());
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla Firefox');
$query = curl_exec($ch);
$errorCode = curl_errno($ch);
curl_close($ch);
$array = (array) simplexml_load_string($query);
//echo '<pre>';
//print_r($array);
//return $array;
echo "<font size=\"18\">return code is ".$errorCode."</font>";
echo '<td valign="top"><div id="slide_thumb"><img src=" '.$array['ThumbnailURL'].'" width="100" height="100"/></div><div id="slide_thum_des"><strong>Views:</strong>'.$array['NumViews'].'<br />'.$array['Title'].'....</div></td>';
}
When I call this function my connection times out every time. The function is absolutely correct. It gives all data about a particular id but when I run it in a foreach loop for many id's, "connection has been reset" or "connection timed out" displays.
You could try a couple of things:
Setup your curl handler outside of the RelateSlides() function. This way you don't have to keep building and tearing down the $ch resource every iteration.
Check the slideshare.net api and see if there are params you can pass to pull down smaller files.
As Luke wisely mentioned, you could make the page asyncronous, meaning you can render the page with 6 tiles, then have each tile make an ajax call for the slide you want. This way at least the user gets to see something while the tiles load, as opposed to being 'hung up' while you pull all the images at once.
I trust slideshare has a pretty robust cdn hosting these images, you may want to see if they have servers closer to your web server.
Quick question, is the curl option how slideshare.net suggested you go about pulling images? Chances are you could just create an image tag with a link directly to their api:
echo '<img src="http://www.slideshare.net/api/2/get_slideshow?api_key='.$key.'&ts='.$ts.'& hash='.$hash.'&slideshow_id='.$slideId.'&detailed=1' />';
If you are doing the curl option for extended data, you may want to consider caching the extended data so you don't have to keep making the extraneous simplexml_load_string call.
Timeout is due to your function taking its time as you have said already. It is normal, it also can be adjusted in either PHP config or Apache (don't remember, I would however check PHP config first). Remember that timeout is there for a reason - eg. good to time out when you run into inf loop - rare but possible.
I think one way to tackle this problem is to split this problem into parts and use AJAX to actually make individual calls that wont take as long.
eg.
Load the page with some JS/JQuery scripts.
Call async to get list of IDs (done by ajax call via jquery - the easiest)
Parse response (JSON?) on client side and do each request per each id async.
Wait for all results to come back and display them in a way you want.
When scraping page, I would like the images included with the text.
Currently I'm only able to scrape the text. For example, as a test script, I scraped Google's homepage and it only displayed the text, no images(Google logo).
I also created another test script using Redbox, with no success, same result.
Here's my attempt at scraping the Redbox 'Find a Movie' page:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
the page was broken, missing box art, missing scripts, etc.
Looking at FF's Firebug's Extension 'Net' tool(allows me to check headers and file paths), I discovered that Redbox's images and css files were not loaded/missing (404 not found). I noticed why, it was because my browser was looking for Redbox's images and css files in the wrong place.
Apperently the Redbox images and css files are located relative to the domain, likewise for Google's logo. So if my script above is using its domain as the base for the files path, how could I change this?
I tried altering the host and referer request headers with the script below, and I've googled extensively, but no luck.
My fix attempt:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$referer = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Host: www.redbox.com") );
curl_setopt ($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
I hope I made sense, if not, let me know and I'll try to explain it better.
Any help would be great! Thanks.
UPDATE
Thanks to everyone(especially Marc, and Wyatt), your answers helped me figure out a method to implement.
I was able to succesfully test by following the steps below:
Download the page and its requisites via Wget.
Add <base href="..." /> to downloaded page's header.
Upload the revised downloaded page and its original requisites via Wput to a temporary server.
Test uploaded page on temporary server via browser
If the uploaded page is not displayed properly, some of the requisites might be missing still(css,jss,ect). View which are missing via a tool that lets you view header responses(eg. the 'net' tool from FF's Firebug Addon). After locating the missing requisites, visit original page that the uploaded page is based on, take note of proper requisite locations that were missing, then revise the downloaded page from step 1 to
accommodate the new proper locations and begin at step 3 again. Else, if page is rendered properly, then success!
Note: When revising the downloaded page I manually edited the code, I'm sure you could use regEX or a parsing library on cUrl's request to automate the process.
When you scrape a URL, you're retrieving a single file, be it html, image, css, javascript, etc... The document you see displayed in a browser is almost always the result of MULTIPLE files: the original html, each seperate image, each css file, each javascript file. You enter only a single address, but fully building/displaying the page will require many HTTP requests.
When you scrape the google home page via curl and output that HTML to the user, there's no way for the user to know that they're actually viewing Google-sourced HTML - it appears as if the HTML came from your server, and your server only. The user's browser will happily suck in this HTML, find the images, and request the images from YOUR server, not google's. Since you're not hosting any of google's images, your server responds with a properly 404 "not found" error.
To make the page work properly, you've got a few choices. The easiest is to parse the HTML of the page and insert a <base href="..." /> tag into the document's header block. This will tell any viewing browsers that "relatively" links within the document should be fetched from this 'base' source (e.g. google).
A harder option is to parse the document and rewrite any references to external files (images ,css, js, etc...) and put in the URL of the originating server, so the user's browser goes to the original site and fetches from there.
The hardest option is to essentially set up a proxy server, and if a request comes in for a file that doesn't exist on your server, to try and fetch the corresponding file from Google via curl and output it to the user.
If the site you're loading is using relative paths for its resource URLs (i.e. /images/whatever.gif instead of http://www.site.com/images/whatever.gif), you're going to need to do some rewriting of those URLs in the source you get back, since cURL won't do that itself, though Wget (official site seems to be down) does (and will even download and mirror the resources for you), but does not provide PHP bindings.
So, you need to come up with a methodology to scrape through the resulting source and change relative paths into absolute paths. A naive way would be something like this:
if (!preg_match('/src="https?:\/\/"/', $result))
$result = preg_replace('/src="(.*)"/', "src=\"$MY_BASE_URL\\1\"", $result);
where $MY_BASE_URL is the base URL you want to rewrite, i.e. http://www.mydomain.com. That won't work for everything, but it should get you started. It's not an easy thing to do, and you might be better off just spawning off a wget command in the background and letting it mirror or rewrite the HTML for you.
Try obtaining the images by having the raw output returned, using the CURLOPT_BINARYTRANSFER option set to true, as below
curl_setopt($ch,CURLOPT_BINARYTRANSFER, true);
I've used this successfully to obtain images and audio from a webpage.