How to get Wikipedia page HTML with absolute URLs using the API? - php

I'm trying to retrieve articles through wikipedia API using this code
$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=example&format=json&prop=text';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
$json = json_decode($c);
$content = $json->{'parse'}->{'text'}->{'*'};
I can view the content in my website and everything is fine but I have a problem with the links inside the article that I have retrieved. If you open the url you can see that all the links start with href=\"/
meaning that if someone clicks on any related link in the article it redirects him to www.mysite.com/wiki/.. (Error 404) instead of en.wikipedia.com/wiki/..
Is there any piece of code that I can add to the existing one to fix this issue?

This seems to be a shortcoming in the MediaWiki action=parse API. In fact, someone already filed a feature request asking for an option to make action=parse return full URLs.
As a workaround, you could either try to mangle the links yourself (like adil suggests), or use index.php?action=render like this:
http://en.wikipedia.org/w/index.php?action=render&title=Example
This will only give you the page HTML with no API wrapper, but if that's all you want anyway then it should be fine. (For example, this is the method used internally by InstantCommons to show remote file description pages.)

You should be able to fix the links like this:
$content = str_replace('<a href="/w', '<a href="//en.wikipedia.org/w', $content);

In case anyone else needs to replace all instances of the URL.
You'll need to use regex and the g flag
/<a href="\/w/g

Related

PHP & Facebook: facebook-debug a URL using CURL and Facebook debugger

Facts: I run a simple website that contains articles, articles dynamically acquired by scraping third-party websites/blogs etc (new articles arrive to my website every half an hour or so), articles which I wish to post on my facebook page. Each article typically includes an image, a title and some text.
Problem: Most (almost all) of the articles that I post on Facebook are not posted correctly - their images are missing.
Inefficient Solution: Using Facebook's debugger (this one) I submit an article's URL to it (URL from my website, not the original source's URL) and Facebook then scans/scrapes the URL and correctly extracts the needed information (image, title, text etc). After this action, the article can be posted on Facebook correctly - no missing images or anything.
Goal: What I am after is a way to create a process which will submit a URL to Facebook's debugger, thus forcing Facebook to scan/scrape the URL so that it can then be posted correctly. I believe that what I need to do is to create an HTML POST request containing the URL and submit it to Facebook's debugger. Is this the correct way to go? And if yes, as I have no previous experience with CURL, what is the correct way to do it using CURL in PHP?
Side Notes: As a side note, I should mention that I am using short URLs for my articles, although I do not think that this is the cause of the problem because the problem persists even when I use the canonical URLs.
Also, the Open Graph meta tags are correctly set (og:image, og:description, etc).
You can debug a graph object using Facebook graph API with PHP-cURL, by doing a POST to
https://graph.facebook.com/v1.0/?id={Object_URL}&scrape=1
to make thing easier, we can wrap our debugger within a function:
function facebookDebugger($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://graph.facebook.com/v1.0/?id='. urlencode($url). '&scrape=1');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$r = curl_exec($ch);
return $r;
}
though this will update & clear Facebook cache for the passed URL, it's a bit hard to print out each key & its content and avoid errors in the same time, however I recommended using var_dump() or print_r() OR PHP-ref
usage with PHP-ref
r( facebookDebugger('http://retrogramexplore.tumblr.com/') );

Extract HTML from a site using PHP [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 8 years ago.
This is the site which I am referring.
I have search through stackoverflow and tried various suggested php methods like file_get_contents() and readfile() method but it cannot retrieve the table value from the site.
i tried to view the source from the page and I could not locate the table value as well. I tried looking for iframe src but to no avail.
Not sure if there is any method which I can use to retrieve such value from the site?
Please advise.
The table's html seems to be generated on the client side (in your browser) with javascript, so it won't show up in the server's response in the way you see it in the browser (you can try disabling javascript and check the site). You can either:
Switch technology, and use some kind of remote controller browser like phantomJS
You can use try to use their raw data. Just open up your browser's developer tools (usually F12) and check what URL's are fetched. You might need to try to analyze the site's javascript code to make sense of these. You should see something like this:
In both cases, check with the site's owners if they are OK with this kind of use (read their data use policy if they have one or just e-mail them), most site owners are not exactly too happy this kind of crawling.
Use the logic of curl, please refer this example
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>

Using curl for scraping large pages

I'm trying to scrape comments from a popular news site for an academic study using curl. It works fine for articles with <300 comments but after that it struggles.
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
echo $html; //just to see what's been scraped
At the moment this page works fine: http://www.guardian.co.uk/commentisfree/2012/aug/22/letter-from-india-women-drink?commentpage=all#start-of-comments
But this one only returns 36 comments despite there being 700+ in total: http://www.guardian.co.uk/commentisfree/2012/aug/21/everyones-talking-about-rape?commentpage=all#start-of-comments
Why is it struggling for articles with a ton of comments?
You comments page is pageinated. Each page contains differerent comments. You will have to request all comment pagination links.
The parameter page=x is appended to the url for a different page.
It might be good to get base page then search for all links with page paarameter and request each of those in turn?
As Mike Christensen pointed out if you could use python and scrapy that functionality is built in. You just have to specify the element the comment is located in and python will crawl all links on the page for you:)

How to display images when using cURL?

When scraping page, I would like the images included with the text.
Currently I'm only able to scrape the text. For example, as a test script, I scraped Google's homepage and it only displayed the text, no images(Google logo).
I also created another test script using Redbox, with no success, same result.
Here's my attempt at scraping the Redbox 'Find a Movie' page:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
the page was broken, missing box art, missing scripts, etc.
Looking at FF's Firebug's Extension 'Net' tool(allows me to check headers and file paths), I discovered that Redbox's images and css files were not loaded/missing (404 not found). I noticed why, it was because my browser was looking for Redbox's images and css files in the wrong place.
Apperently the Redbox images and css files are located relative to the domain, likewise for Google's logo. So if my script above is using its domain as the base for the files path, how could I change this?
I tried altering the host and referer request headers with the script below, and I've googled extensively, but no luck.
My fix attempt:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$referer = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Host: www.redbox.com") );
curl_setopt ($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
I hope I made sense, if not, let me know and I'll try to explain it better.
Any help would be great! Thanks.
UPDATE
Thanks to everyone(especially Marc, and Wyatt), your answers helped me figure out a method to implement.
I was able to succesfully test by following the steps below:
Download the page and its requisites via Wget.
Add <base href="..." /> to downloaded page's header.
Upload the revised downloaded page and its original requisites via Wput to a temporary server.
Test uploaded page on temporary server via browser
If the uploaded page is not displayed properly, some of the requisites might be missing still(css,jss,ect). View which are missing via a tool that lets you view header responses(eg. the 'net' tool from FF's Firebug Addon). After locating the missing requisites, visit original page that the uploaded page is based on, take note of proper requisite locations that were missing, then revise the downloaded page from step 1 to
accommodate the new proper locations and begin at step 3 again. Else, if page is rendered properly, then success!
Note: When revising the downloaded page I manually edited the code, I'm sure you could use regEX or a parsing library on cUrl's request to automate the process.
When you scrape a URL, you're retrieving a single file, be it html, image, css, javascript, etc... The document you see displayed in a browser is almost always the result of MULTIPLE files: the original html, each seperate image, each css file, each javascript file. You enter only a single address, but fully building/displaying the page will require many HTTP requests.
When you scrape the google home page via curl and output that HTML to the user, there's no way for the user to know that they're actually viewing Google-sourced HTML - it appears as if the HTML came from your server, and your server only. The user's browser will happily suck in this HTML, find the images, and request the images from YOUR server, not google's. Since you're not hosting any of google's images, your server responds with a properly 404 "not found" error.
To make the page work properly, you've got a few choices. The easiest is to parse the HTML of the page and insert a <base href="..." /> tag into the document's header block. This will tell any viewing browsers that "relatively" links within the document should be fetched from this 'base' source (e.g. google).
A harder option is to parse the document and rewrite any references to external files (images ,css, js, etc...) and put in the URL of the originating server, so the user's browser goes to the original site and fetches from there.
The hardest option is to essentially set up a proxy server, and if a request comes in for a file that doesn't exist on your server, to try and fetch the corresponding file from Google via curl and output it to the user.
If the site you're loading is using relative paths for its resource URLs (i.e. /images/whatever.gif instead of http://www.site.com/images/whatever.gif), you're going to need to do some rewriting of those URLs in the source you get back, since cURL won't do that itself, though Wget (official site seems to be down) does (and will even download and mirror the resources for you), but does not provide PHP bindings.
So, you need to come up with a methodology to scrape through the resulting source and change relative paths into absolute paths. A naive way would be something like this:
if (!preg_match('/src="https?:\/\/"/', $result))
$result = preg_replace('/src="(.*)"/', "src=\"$MY_BASE_URL\\1\"", $result);
where $MY_BASE_URL is the base URL you want to rewrite, i.e. http://www.mydomain.com. That won't work for everything, but it should get you started. It's not an easy thing to do, and you might be better off just spawning off a wget command in the background and letting it mirror or rewrite the HTML for you.
Try obtaining the images by having the raw output returned, using the CURLOPT_BINARYTRANSFER option set to true, as below
curl_setopt($ch,CURLOPT_BINARYTRANSFER, true);
I've used this successfully to obtain images and audio from a webpage.

Make cURL behave like exactly like form

I have a form on my site which sends data to some remote site - simple html form.
What I want to do is to use data user enters into form for statistical purposes.
So I instead of sending data to the remote page I send it first to my script which resends it the remote site.
The thing is I need it to behave in exact way the usual form would behave taking user to the remote site and displaying resources.
When I use this code it kinda works but not in the way I want it to:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $action);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
curl_close($ch);
Problem is that it displays response in the same script. For example if $action is for example:
somesite.com/processform.php and my script name is mysqcript.php it would display the response of "somesite.com/processform.php" inside "mysqcript.php" so all the relative links are not working.
How do I make it to send the user to "somesite.com/processform.php"? Same thing that pressing the button would do?
Leonti
I think you will have to do this on your end, as translating relative paths is the client's job. It should be simple: Just take the base directory of the request you made
http://otherdomain.com/my/request/path.php
and add it in front of every outgoing link that does not begin with "/" or a protocol ("http://", "ftp://").
Detecting all the outgoing links is hard, but I am 100% sure there are ready-made PHP classes that do that. Check for example this article and the getLinks() function in the user comments. I am not 100% sure whether this is what you need but it certainly goes to the right direction.
Here are a couple of possible solutions, which I post separately so they don't get mixed up with the one I recommend:
1 - keep using cURL, parse the response and add a <base/> tag to it. It should work for pretty much everything on that page.
<base href="http://realsite.com/form_url.php" />
2 - do not alter the submit URL. Submit the form to the real URL, but capture its content using some Javascript library (YUI does that) and send it to your script via XHR. It's still kind of hacky though.
There are several ways to do that. Here's one of the easiest: just use a 307 redirect.
header('Location: http://realsite.com/form_url.php', true, 307');
You can do your logging and stuff either before or after header() but if you do it after calling header() you will need to start your script with
ignore_user_abort(true);
Note that browsers are supposed to notify the user that their form is being redirected.

Categories