Facts: I run a simple website that contains articles, articles dynamically acquired by scraping third-party websites/blogs etc (new articles arrive to my website every half an hour or so), articles which I wish to post on my facebook page. Each article typically includes an image, a title and some text.
Problem: Most (almost all) of the articles that I post on Facebook are not posted correctly - their images are missing.
Inefficient Solution: Using Facebook's debugger (this one) I submit an article's URL to it (URL from my website, not the original source's URL) and Facebook then scans/scrapes the URL and correctly extracts the needed information (image, title, text etc). After this action, the article can be posted on Facebook correctly - no missing images or anything.
Goal: What I am after is a way to create a process which will submit a URL to Facebook's debugger, thus forcing Facebook to scan/scrape the URL so that it can then be posted correctly. I believe that what I need to do is to create an HTML POST request containing the URL and submit it to Facebook's debugger. Is this the correct way to go? And if yes, as I have no previous experience with CURL, what is the correct way to do it using CURL in PHP?
Side Notes: As a side note, I should mention that I am using short URLs for my articles, although I do not think that this is the cause of the problem because the problem persists even when I use the canonical URLs.
Also, the Open Graph meta tags are correctly set (og:image, og:description, etc).
You can debug a graph object using Facebook graph API with PHP-cURL, by doing a POST to
https://graph.facebook.com/v1.0/?id={Object_URL}&scrape=1
to make thing easier, we can wrap our debugger within a function:
function facebookDebugger($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://graph.facebook.com/v1.0/?id='. urlencode($url). '&scrape=1');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$r = curl_exec($ch);
return $r;
}
though this will update & clear Facebook cache for the passed URL, it's a bit hard to print out each key & its content and avoid errors in the same time, however I recommended using var_dump() or print_r() OR PHP-ref
usage with PHP-ref
r( facebookDebugger('http://retrogramexplore.tumblr.com/') );
Related
I'm trying to get the viewer count so I can check if a streamer is online on https://www.dlive.tv/. If you view the page source on a streamer's page (https://www.dlive.tv/thelongestchain), there's a bunch of json and "watchingCount" is there.
Basically, I want to have the streamer appear on the "Live Now" section of my site if their viewer count is 1 or more, but I can't figure out anyway on how to get the viewer count. I know I could use something like Selenium if I was using python and could run it from my pc, but I need the site to know it.
DLive doesn't have an api yet, so I don't know how to make a call(or request I don't know the terminology) to get this info. When I look in the network tab on chrome I see that there's a call (https://graphigo.prd.dlive.tv/) that provides stream info I think. Would I also need my authkey?
I realize this question is broad and all over the place but it's because so am I with me trying to solve this the last couple days. If I had the viewercount as a variable, I know how to display the streamer on the "Live Now" section of my site, I just don't know how to get the necessary data.
If there's another way I should be checking if a streamer is online or offline other than getting the viewercount, that would work too. If anyone could help me out I would greatly appreciate it, thanks.
I tried scraping the page but I don't think you can scrape dynamic content. When I tried to use SimpleHTMLDom it just returned static elements.
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://www.dlive.tv/thelongestchain')
if(($html->find('video', 0))) {
echo 'online';
}else{
echo 'offline';
}
/* The video element is only on the page if the streamer is live, but it doesn't return because it's not static I presume */
?>
I have no idea at all how to go about making a call/request to get the json data for the viewer count, or how to get any other data that could check if a streamer is online. All the scraping I've done did not return any elements that weren't static (the same no matter if the streamer was online or offline).
Try cURL, it's like a magic wand.
This will return the entire page, and I think including the JSON you're looking for:
<?php
$curl = curl_init('https://www.dlive.tv/thelongestchain');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($curl);
var_dump($result); // you do as you need to here
curl_close($curl);
?>
Here's the <script> containing the data I believe you need. Assuming "watchingCount" is the same thing you're looking for?
<script>window.__INITIAL_STATE__={"version":"1.2.0","snackbar":{"snackbar":null},"accessToken":{"accessToken":null},"userMeta":{"userMeta":{"fingerprint":null,"referrer":{"isFirstTime":true,"streamer":null,"happyHour":null,"user":null},"ipStats":null,"ip":"34.216.252.62","langCode":"en","trackingInfo":{"rank":"not available","prevPage":"not available","postStatus":"not available"},"darkMode":true,"NSFW":false,"prefetchID":"e278e744-f522-480e-a290-8eed0fe83b07","cashinEmail":""}},"me":{"me":null},"dialog":{"dialog":{"login":false,"subscribe":false,"cashIn":false,"chest":false,"chestWinners":false,"downloadApp":false}},"tabs":{"tabs":{"livestreamMobileActiveTab":"tab-chat"}},"globalInfo":{"globalInfo":{"languages":[],"communityUpdates":[],"recommendChannels":[]}},"ui":{"ui":{"viewPointWidth":1920,"mq":4,"isMobile":false}}};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script> <script>window.__APOLLO_STATE__={"defaultClient":{"user:dlive-00431789":{"id":"user:dlive-00431789","avatar":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Favatar\u002F5c1330d8-5bc8-11e9-ab17-865634f95b6b","__typename":"User","displayname":"thelongestchain","partnerStatus":"NONE","username":"dlive-00431789","canSubscribe":false,"subSetting":null,"followers":{"type":"id","generated":true,"id":"$user:dlive-00431789.followers","typename":"UserConnection"},"livestream":{"type":"id","generated":false,"id":"livestream:dlive-00431789+i7rCywMWg","typename":"Livestream"},"hostingLivestream":null,"offlineImage":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Fofflineimage\u002Fvideo-placeholder.png","banStatus":"NO_BAN","deactivated":false,"about":"#lovejonah\n\nJonah's NEW FRIENDS:\nhttps:\u002F\u002Fdlive.tv\u002FFlamenco https:\u002F\u002Fi.gyazo.com\u002F88416fca5047381105da289faba60e7c.png\nhttps:\u002F\u002Fdlive.tv\u002FHamsterSamster https:\u002F\u002Fi.gyazo.com\u002F984b19f77a1de5e3028e42ccd71052a0.png\nhttps:\u002F\u002Fdlive.tv\u002Fjayis4justice \nhttps:\u002F\u002Fdlive.tv\u002FDenomic\nhttps:\u002F\u002Fdlive.tv\u002FCutie\nhttps:\u002F\u002Fdlive.tv\u002FTruly_A_No_Life\n\n\n\n\n\n\n\n\n\n\n\n\nOur Socials:\nhttps:\u002F\u002Fwww.twitch.tv\u002Fthelongestchain\nhttps:\u002F\u002Fdiscord.gg\u002Fsagd68Z\n\nLINO website: https:\u002F\u002Flino.network\nLINO Whitepaper: https:\u002F\u002Fdocsend.com\u002Fview\u002Fy9qtwb6\nLINO Tracker : https:\u002F\u002Ftracker.lino.network\u002F#\u002F\nLINO Discord : https:\u002F\u002Fdiscord.gg\u002FTUxp3ww\n\nThe Legend of Lemon's: https:\u002F\u002Fbubbl.us\u002FNTE1OTA4MS85ODY3ODQwL2M0Y2NjNjRlYmI0ZGNkNDllOTljNDMxODExNjFmZDRk-X?utm_source=shared-link&utm_medium=link&s=9867840\n\nPC:\nAMD FX 6core 3.0ghz ddr3\n12GB RAM HyperFury X Blue ddr3\nCooler Master Hyper 6heatpipe cpu cooler\nGigabyte MB\n2 x EVGA 1070 FTW\nKingston SSD 120gb\nKingston SSD 240GB\nREDDRAGON Keyboard\nREDDRAGON Mouse\nBlack Out Blue Yeti Microphone\nLogitech C922\n\nApps Used:\nBig Trades Tracker: https:\u002F\u002Ftucsky.github.io\u002FSignificantTrades\u002F#\nMultiple Charts: \nhttps:\u002F\u002Fcryptotrading.toys\u002Fcrypto-panel\u002F\nhttps:\u002F\u002Fcryptowatch.net\n\n\n\n","treasureChest":{"type":"id","generated":true,"id":"$user:dlive-00431789.treasureChest","typename":"TreasureChest"},"videos":{"type":"id","generated":true,"id":"$user:dlive-00431789.videos","typename":"VideoConnection"},"pastBroadcasts":{"type":"id","generated":true,"id":"$user:dlive-00431789.pastBroadcasts","typename":"PastBroadcastConnection"},"following":{"type":"id","generated":true,"id":"$user:dlive-00431789.following","typename":"UserConnection"}},"$user:dlive-00431789.followers":{"totalCount":1000,"__typename":"UserConnection"},"livestream:dlive-00431789+i7rCywMWg":{"id":"livestream:dlive-00431789+i7rCywMWg","totalReward":"3243600","watchingCount":5,"permlink":"dlive-00431789+i7rCywMWg","title":"bybit 0.1eth HIGH LEVERAGE","content":"","category":{"type":"id","generated":false,"id":"category:11455","typename":"Category"},"creator":{"type":"id","generated":false,"id":"user:dlive-00431789","typename":"User"},"__typename":"Livestream","language":{"type":"id","generated":false,"id":"language:1","typename":"Language"},"watchTime({\"add\":false})":true,"disableAlert":false},"category:11455":{"id":"category:11455","backendID":11455,"title":"Cryptocurrency","__typename":"Category","imgUrl":"https:\u002F\u002Fimages.prd.dlivecdn.com\u002Fcategory\u002FCBAOENLDK"},"language:1":{"id":"language:1","language":"English","__typename":"Language"},"$user:dlive-00431789.treasureChest":{"value":"2144482","state":"COLLECTING","ongoingGiveaway":null,"__typename":"TreasureChest","expireAt":"1560400949000","buffs":[],"startGiveawayValueThreshold":"500000"},"$user:dlive-00431789.videos":{"totalCount":0,"__typename":"VideoConnection"},"$user:dlive-00431789.pastBroadcasts":{"totalCount":13,"__typename":"PastBroadcastConnection"},"$user:dlive-00431789.following":{"totalCount":41,"__typename":"UserConnection"},"ROOT_QUERY":{"userByDisplayName({\"displayname\":\"thelongestchain\"})":{"type":"id","generated":false,"id":"user:dlive-00431789","typename":"User"}}}};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script>
I assume you'll then just have to throw in a loop and make the url dynamic to get through whatever streamers you're monitoring with your site.
I'm trying to retrieve articles through wikipedia API using this code
$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=example&format=json&prop=text';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
$json = json_decode($c);
$content = $json->{'parse'}->{'text'}->{'*'};
I can view the content in my website and everything is fine but I have a problem with the links inside the article that I have retrieved. If you open the url you can see that all the links start with href=\"/
meaning that if someone clicks on any related link in the article it redirects him to www.mysite.com/wiki/.. (Error 404) instead of en.wikipedia.com/wiki/..
Is there any piece of code that I can add to the existing one to fix this issue?
This seems to be a shortcoming in the MediaWiki action=parse API. In fact, someone already filed a feature request asking for an option to make action=parse return full URLs.
As a workaround, you could either try to mangle the links yourself (like adil suggests), or use index.php?action=render like this:
http://en.wikipedia.org/w/index.php?action=render&title=Example
This will only give you the page HTML with no API wrapper, but if that's all you want anyway then it should be fine. (For example, this is the method used internally by InstantCommons to show remote file description pages.)
You should be able to fix the links like this:
$content = str_replace('<a href="/w', '<a href="//en.wikipedia.org/w', $content);
In case anyone else needs to replace all instances of the URL.
You'll need to use regex and the g flag
/<a href="\/w/g
I am a green programmer and I was originally trying to make cross domain requests in JS. I quickly learned that this is not allowed. Unlike similar questions posted on here, I would like to see if I can use PHP to make them for me instead of JSONP requests. Is this possible?
Simple workflow...
BROWSER: POST to my PHP the request-payload & request-headers
PHP: POST to Other Domain's URL the request-payload & request-headers
Other Domain: Process Request and send response
PHP: Send the Response-Content and Response-Header Info back to the browser
Here is what I am trying to work with http://msdn.microsoft.com/en-us/library/bb969500%28v=office.12%29.aspx
My goal is to make a Communicator Web Access Client that is web based and mobile friendly.
A link to a working example would be awesome!
CURL yould be your option in this case, something simple as:
<?php
$ch = curl_init('http://otherdomain.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, false);
$result = curl_exec($ch);
var_dump($result);
?>
In this case, $result would contain the html code of the site. Please be aware that it doesn't going to execute any javascript as if you were visiting the site on the browser.
You are talking about web services and seems that the goal is process payments. Any major payment gateway have APIs prepared for that. In any case you can study by your own. Here a good starting point http://ajaxonomy.com/2008/xml/web-services-part-1-soap-vs-rest
When scraping page, I would like the images included with the text.
Currently I'm only able to scrape the text. For example, as a test script, I scraped Google's homepage and it only displayed the text, no images(Google logo).
I also created another test script using Redbox, with no success, same result.
Here's my attempt at scraping the Redbox 'Find a Movie' page:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
the page was broken, missing box art, missing scripts, etc.
Looking at FF's Firebug's Extension 'Net' tool(allows me to check headers and file paths), I discovered that Redbox's images and css files were not loaded/missing (404 not found). I noticed why, it was because my browser was looking for Redbox's images and css files in the wrong place.
Apperently the Redbox images and css files are located relative to the domain, likewise for Google's logo. So if my script above is using its domain as the base for the files path, how could I change this?
I tried altering the host and referer request headers with the script below, and I've googled extensively, but no luck.
My fix attempt:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$referer = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Host: www.redbox.com") );
curl_setopt ($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
I hope I made sense, if not, let me know and I'll try to explain it better.
Any help would be great! Thanks.
UPDATE
Thanks to everyone(especially Marc, and Wyatt), your answers helped me figure out a method to implement.
I was able to succesfully test by following the steps below:
Download the page and its requisites via Wget.
Add <base href="..." /> to downloaded page's header.
Upload the revised downloaded page and its original requisites via Wput to a temporary server.
Test uploaded page on temporary server via browser
If the uploaded page is not displayed properly, some of the requisites might be missing still(css,jss,ect). View which are missing via a tool that lets you view header responses(eg. the 'net' tool from FF's Firebug Addon). After locating the missing requisites, visit original page that the uploaded page is based on, take note of proper requisite locations that were missing, then revise the downloaded page from step 1 to
accommodate the new proper locations and begin at step 3 again. Else, if page is rendered properly, then success!
Note: When revising the downloaded page I manually edited the code, I'm sure you could use regEX or a parsing library on cUrl's request to automate the process.
When you scrape a URL, you're retrieving a single file, be it html, image, css, javascript, etc... The document you see displayed in a browser is almost always the result of MULTIPLE files: the original html, each seperate image, each css file, each javascript file. You enter only a single address, but fully building/displaying the page will require many HTTP requests.
When you scrape the google home page via curl and output that HTML to the user, there's no way for the user to know that they're actually viewing Google-sourced HTML - it appears as if the HTML came from your server, and your server only. The user's browser will happily suck in this HTML, find the images, and request the images from YOUR server, not google's. Since you're not hosting any of google's images, your server responds with a properly 404 "not found" error.
To make the page work properly, you've got a few choices. The easiest is to parse the HTML of the page and insert a <base href="..." /> tag into the document's header block. This will tell any viewing browsers that "relatively" links within the document should be fetched from this 'base' source (e.g. google).
A harder option is to parse the document and rewrite any references to external files (images ,css, js, etc...) and put in the URL of the originating server, so the user's browser goes to the original site and fetches from there.
The hardest option is to essentially set up a proxy server, and if a request comes in for a file that doesn't exist on your server, to try and fetch the corresponding file from Google via curl and output it to the user.
If the site you're loading is using relative paths for its resource URLs (i.e. /images/whatever.gif instead of http://www.site.com/images/whatever.gif), you're going to need to do some rewriting of those URLs in the source you get back, since cURL won't do that itself, though Wget (official site seems to be down) does (and will even download and mirror the resources for you), but does not provide PHP bindings.
So, you need to come up with a methodology to scrape through the resulting source and change relative paths into absolute paths. A naive way would be something like this:
if (!preg_match('/src="https?:\/\/"/', $result))
$result = preg_replace('/src="(.*)"/', "src=\"$MY_BASE_URL\\1\"", $result);
where $MY_BASE_URL is the base URL you want to rewrite, i.e. http://www.mydomain.com. That won't work for everything, but it should get you started. It's not an easy thing to do, and you might be better off just spawning off a wget command in the background and letting it mirror or rewrite the HTML for you.
Try obtaining the images by having the raw output returned, using the CURLOPT_BINARYTRANSFER option set to true, as below
curl_setopt($ch,CURLOPT_BINARYTRANSFER, true);
I've used this successfully to obtain images and audio from a webpage.
I have a form on my site which sends data to some remote site - simple html form.
What I want to do is to use data user enters into form for statistical purposes.
So I instead of sending data to the remote page I send it first to my script which resends it the remote site.
The thing is I need it to behave in exact way the usual form would behave taking user to the remote site and displaying resources.
When I use this code it kinda works but not in the way I want it to:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $action);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
curl_close($ch);
Problem is that it displays response in the same script. For example if $action is for example:
somesite.com/processform.php and my script name is mysqcript.php it would display the response of "somesite.com/processform.php" inside "mysqcript.php" so all the relative links are not working.
How do I make it to send the user to "somesite.com/processform.php"? Same thing that pressing the button would do?
Leonti
I think you will have to do this on your end, as translating relative paths is the client's job. It should be simple: Just take the base directory of the request you made
http://otherdomain.com/my/request/path.php
and add it in front of every outgoing link that does not begin with "/" or a protocol ("http://", "ftp://").
Detecting all the outgoing links is hard, but I am 100% sure there are ready-made PHP classes that do that. Check for example this article and the getLinks() function in the user comments. I am not 100% sure whether this is what you need but it certainly goes to the right direction.
Here are a couple of possible solutions, which I post separately so they don't get mixed up with the one I recommend:
1 - keep using cURL, parse the response and add a <base/> tag to it. It should work for pretty much everything on that page.
<base href="http://realsite.com/form_url.php" />
2 - do not alter the submit URL. Submit the form to the real URL, but capture its content using some Javascript library (YUI does that) and send it to your script via XHR. It's still kind of hacky though.
There are several ways to do that. Here's one of the easiest: just use a 307 redirect.
header('Location: http://realsite.com/form_url.php', true, 307');
You can do your logging and stuff either before or after header() but if you do it after calling header() you will need to start your script with
ignore_user_abort(true);
Note that browsers are supposed to notify the user that their form is being redirected.