When scraping page, I would like the images included with the text.
Currently I'm only able to scrape the text. For example, as a test script, I scraped Google's homepage and it only displayed the text, no images(Google logo).
I also created another test script using Redbox, with no success, same result.
Here's my attempt at scraping the Redbox 'Find a Movie' page:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
the page was broken, missing box art, missing scripts, etc.
Looking at FF's Firebug's Extension 'Net' tool(allows me to check headers and file paths), I discovered that Redbox's images and css files were not loaded/missing (404 not found). I noticed why, it was because my browser was looking for Redbox's images and css files in the wrong place.
Apperently the Redbox images and css files are located relative to the domain, likewise for Google's logo. So if my script above is using its domain as the base for the files path, how could I change this?
I tried altering the host and referer request headers with the script below, and I've googled extensively, but no luck.
My fix attempt:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$referer = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Host: www.redbox.com") );
curl_setopt ($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
I hope I made sense, if not, let me know and I'll try to explain it better.
Any help would be great! Thanks.
UPDATE
Thanks to everyone(especially Marc, and Wyatt), your answers helped me figure out a method to implement.
I was able to succesfully test by following the steps below:
Download the page and its requisites via Wget.
Add <base href="..." /> to downloaded page's header.
Upload the revised downloaded page and its original requisites via Wput to a temporary server.
Test uploaded page on temporary server via browser
If the uploaded page is not displayed properly, some of the requisites might be missing still(css,jss,ect). View which are missing via a tool that lets you view header responses(eg. the 'net' tool from FF's Firebug Addon). After locating the missing requisites, visit original page that the uploaded page is based on, take note of proper requisite locations that were missing, then revise the downloaded page from step 1 to
accommodate the new proper locations and begin at step 3 again. Else, if page is rendered properly, then success!
Note: When revising the downloaded page I manually edited the code, I'm sure you could use regEX or a parsing library on cUrl's request to automate the process.
When you scrape a URL, you're retrieving a single file, be it html, image, css, javascript, etc... The document you see displayed in a browser is almost always the result of MULTIPLE files: the original html, each seperate image, each css file, each javascript file. You enter only a single address, but fully building/displaying the page will require many HTTP requests.
When you scrape the google home page via curl and output that HTML to the user, there's no way for the user to know that they're actually viewing Google-sourced HTML - it appears as if the HTML came from your server, and your server only. The user's browser will happily suck in this HTML, find the images, and request the images from YOUR server, not google's. Since you're not hosting any of google's images, your server responds with a properly 404 "not found" error.
To make the page work properly, you've got a few choices. The easiest is to parse the HTML of the page and insert a <base href="..." /> tag into the document's header block. This will tell any viewing browsers that "relatively" links within the document should be fetched from this 'base' source (e.g. google).
A harder option is to parse the document and rewrite any references to external files (images ,css, js, etc...) and put in the URL of the originating server, so the user's browser goes to the original site and fetches from there.
The hardest option is to essentially set up a proxy server, and if a request comes in for a file that doesn't exist on your server, to try and fetch the corresponding file from Google via curl and output it to the user.
If the site you're loading is using relative paths for its resource URLs (i.e. /images/whatever.gif instead of http://www.site.com/images/whatever.gif), you're going to need to do some rewriting of those URLs in the source you get back, since cURL won't do that itself, though Wget (official site seems to be down) does (and will even download and mirror the resources for you), but does not provide PHP bindings.
So, you need to come up with a methodology to scrape through the resulting source and change relative paths into absolute paths. A naive way would be something like this:
if (!preg_match('/src="https?:\/\/"/', $result))
$result = preg_replace('/src="(.*)"/', "src=\"$MY_BASE_URL\\1\"", $result);
where $MY_BASE_URL is the base URL you want to rewrite, i.e. http://www.mydomain.com. That won't work for everything, but it should get you started. It's not an easy thing to do, and you might be better off just spawning off a wget command in the background and letting it mirror or rewrite the HTML for you.
Try obtaining the images by having the raw output returned, using the CURLOPT_BINARYTRANSFER option set to true, as below
curl_setopt($ch,CURLOPT_BINARYTRANSFER, true);
I've used this successfully to obtain images and audio from a webpage.
Related
Facts: I run a simple website that contains articles, articles dynamically acquired by scraping third-party websites/blogs etc (new articles arrive to my website every half an hour or so), articles which I wish to post on my facebook page. Each article typically includes an image, a title and some text.
Problem: Most (almost all) of the articles that I post on Facebook are not posted correctly - their images are missing.
Inefficient Solution: Using Facebook's debugger (this one) I submit an article's URL to it (URL from my website, not the original source's URL) and Facebook then scans/scrapes the URL and correctly extracts the needed information (image, title, text etc). After this action, the article can be posted on Facebook correctly - no missing images or anything.
Goal: What I am after is a way to create a process which will submit a URL to Facebook's debugger, thus forcing Facebook to scan/scrape the URL so that it can then be posted correctly. I believe that what I need to do is to create an HTML POST request containing the URL and submit it to Facebook's debugger. Is this the correct way to go? And if yes, as I have no previous experience with CURL, what is the correct way to do it using CURL in PHP?
Side Notes: As a side note, I should mention that I am using short URLs for my articles, although I do not think that this is the cause of the problem because the problem persists even when I use the canonical URLs.
Also, the Open Graph meta tags are correctly set (og:image, og:description, etc).
You can debug a graph object using Facebook graph API with PHP-cURL, by doing a POST to
https://graph.facebook.com/v1.0/?id={Object_URL}&scrape=1
to make thing easier, we can wrap our debugger within a function:
function facebookDebugger($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://graph.facebook.com/v1.0/?id='. urlencode($url). '&scrape=1');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$r = curl_exec($ch);
return $r;
}
though this will update & clear Facebook cache for the passed URL, it's a bit hard to print out each key & its content and avoid errors in the same time, however I recommended using var_dump() or print_r() OR PHP-ref
usage with PHP-ref
r( facebookDebugger('http://retrogramexplore.tumblr.com/') );
Yeah, I'm stumped. I'm getting nothing. curl_exec is returning no content. I've tried file_get_contents, but that completely times out. I'm attempting to get an API XML from my Subsonic media server and display it on my web server (different servers). The end result would be that I can have people log in to my web server with the media server account. I can deal with the actual parsing later, but I can't even grab the XML right now. I've tried their forums, but haven't gotten much help since they're not really PHP inclined. Figure I'd ask here.
$url = "http://{$subserver}/rest/getUser.view?u={$username}&p={$password}&username={$username}&v=1.8.0&c={$appID}";
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_HEADER, 0);
$result = curl_exec($c);
curl_close($c);
echo $result;
This returns nothing. The variables are defined correctly, and I get the same response as if I typed in the whole URL. Here is their API page: http://www.subsonic.org/pages/api.jsp I've even tried with their "ping" function - still empty
The url itself looks fine. In the web browser, it returns:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<subsonic-response xmlns="http://subsonic.org/restapi" status="ok" version="1.8.0">
<user username="xxxxxx" email="xxxxxx#xxxxxx.com" scrobblingEnabled="false" adminRole="true" settingsRole="true" downloadRole="true" uploadRole="true" playlistRole="true" coverArtRole="true" commentRole="true" podcastRole="true" streamRole="true" jukeboxRole="true" shareRole="true"/>
</subsonic-response>
I admit I've never used XML, but according to everything I've read... this should work. And it does work, with other random XML files I found on the web.
it might have something to do with the fact that it's not an ".xml" file, but a generated via url xml, as this same exact code will work with some random xml file I found ( http://www.w3schools.com/xml/note.xml )
Any thoughts?
We are developing a web app that enables the end user to place a bit of code on their webpage, so that anyone who visits their page will see a little pop up button on the edge of the browser window. When clicked, it opens a small panel where they can enter a telephone number for the website owner to call them back.
I am running up against a security issue. I am attempting to use server side include to place the button on the client's website. However, because this included site is on a different domain than the client's website, it is not allowed.
I have tried these two methods that I got from online forums, neither of which worked for me.
Use the file_get_contents handler, like this;
$includeFile =
file_get_contents("http://bizzocall.com/subdomains/dev/httpdocs/slideouttesttopDATA.php");
echo $includeFile;
ERROR failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found.
Use curl, like this:
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$feed = 'http://bizzocall.com/subdomains/dev/httpdocs/slideouttesttopDATA.php';
$bizzobtn = curl($feed);
echo $bizzobtn;
This didn't work either.
ERROR The requested URL /subdomains/dev/httpdocs/slideouttesttopDATA.php was not found on this server.
This looks like its truncating the bizzocall.com/ off of the url. Perhaps If I knew how to write this chunk of code correctly,this would work.
Any help here would be welcome!
The 404 error is coming from the remote server, not your own. Try to reach that URL - You'll get the same 404.
When I visit http://bizzocall.com/subdomains/dev/httpdocs/slideouttesttopDATA.php, I get the following:
Therefore, the issue is that you're using the wrong URL. Your code is sound, however.
Your curl code is working fine, the problem is the URL isn't found.
Put it in your browser to see.
If you own bizzocall.com, you'll want to put slideouttesttopDATA.php in a web accessible area, then you'll be able to access it with curl.
Visit your link a browser and you'll see that the resource you're trying to request clearly doesn't exist. More info on the HTTP 404 status code found here: http://en.wikipedia.org/wiki/HTTP_404.
Perhaps it should be http://dev.bizzocall.com/slideouttesttopDATA.php as your URL does not work when pasted into a browser...
From within the HTML code in one of my server pages I need to address a search of a specific item on a database placed in another remote server that I don’t own myself.
Example of the search type that performs my request: http://www.remoteserver.com/items/search.php?search_size=XXL
The remote server provides to me - as client - the response displaying a page with several items that match my search criteria.
I don’t want to have this page displayed. What I want is to collect into a string (or local file) the full contents of the remote server HTML response (the code we have access when we click on ‘View Source’ in my IE browser client).
If I collect that data (it could easily reach reach 50000 bytes) I can then filter the one in which I am interested (substrings) and assemble a new request to the remote server for only one of the specific items in the response provided.
Is there any way through which I can get HTML from the response provided by the remote server with Javascript or PHP, and also avoid the display of the response in the browser itself?
I hope I have not confused your minds …
Thanks for any help you may provide.
As #mario mentioned, there are several different ways to do it.
Using file_get_contents():
$txt = file_get_contents('http://www.example.com/');
echo $txt;
Using php's curl functions:
$url = 'http://www.mysite.com';
$ch = curl_init($url);
// Tell curl_exec to return the text instead of sending it to STDOUT
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Don't include return header in output
curl_setopt($ch, CURLOPT_HEADER, 0);
$txt = curl_exec($ch);
curl_close($ch);
echo $txt;
curl is probably the most robust option because you have options for more control over the exact request parameters and possibilities for error handling when things don't go as planned
I have a form on my site which sends data to some remote site - simple html form.
What I want to do is to use data user enters into form for statistical purposes.
So I instead of sending data to the remote page I send it first to my script which resends it the remote site.
The thing is I need it to behave in exact way the usual form would behave taking user to the remote site and displaying resources.
When I use this code it kinda works but not in the way I want it to:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $action);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
curl_close($ch);
Problem is that it displays response in the same script. For example if $action is for example:
somesite.com/processform.php and my script name is mysqcript.php it would display the response of "somesite.com/processform.php" inside "mysqcript.php" so all the relative links are not working.
How do I make it to send the user to "somesite.com/processform.php"? Same thing that pressing the button would do?
Leonti
I think you will have to do this on your end, as translating relative paths is the client's job. It should be simple: Just take the base directory of the request you made
http://otherdomain.com/my/request/path.php
and add it in front of every outgoing link that does not begin with "/" or a protocol ("http://", "ftp://").
Detecting all the outgoing links is hard, but I am 100% sure there are ready-made PHP classes that do that. Check for example this article and the getLinks() function in the user comments. I am not 100% sure whether this is what you need but it certainly goes to the right direction.
Here are a couple of possible solutions, which I post separately so they don't get mixed up with the one I recommend:
1 - keep using cURL, parse the response and add a <base/> tag to it. It should work for pretty much everything on that page.
<base href="http://realsite.com/form_url.php" />
2 - do not alter the submit URL. Submit the form to the real URL, but capture its content using some Javascript library (YUI does that) and send it to your script via XHR. It's still kind of hacky though.
There are several ways to do that. Here's one of the easiest: just use a 307 redirect.
header('Location: http://realsite.com/form_url.php', true, 307');
You can do your logging and stuff either before or after header() but if you do it after calling header() you will need to start your script with
ignore_user_abort(true);
Note that browsers are supposed to notify the user that their form is being redirected.