Is it possible to save all resources downloaded from a HTTP request with PHP?
For example: using curl, wget or similar to get all files necessary to load the page in browser instead of only getting the HTML content of the page.
I don't want to get all links and then download each link with a direct curl. I would like a way to do it only once. I assume it's possible since in a browser I also only do one url request to get all resources.
Edit:
The point here is to simulate the browser behavior. How can I save an entire page with PHP. If it must be done with several steps what should be the logic to follow?
I have huge problems in getting all files from a page even after extracting links since I find it very hard to store session data and reuse it for further curls.
You can use Ajax for that and save that as objects and show them when needed. does that works for you?
Related
I'm trying to capture some images from an old database.
When writing scrapers, I use ruby (but am comfortable with php as well) to directly open() a website and read its contents. I sometimes also use the script to call the appropriate curl ... command.
However, the database I'm scraping some pieces out of returns a page and then embeds the target image with an image name using a series of random numbers I assume by the server side script. For example:
<img ... show_image.jsp?343523.jpg
However, I cannot call this show_image script directly (denied), it only works when embedded in the website as a whole.
Can I use curl, or within ruby or php do something download the entire page, for example, 1929.2.14.aspx in such a way that it includes the embedded image generated by show_image.jsp?343523.jpg?
If I simply curl the aspx file directly, I naturally just get the html - how might one save both the html and embedded image via scripting in the manner that a browser-based "web archive" feature works manually?
Any tips, links to tutorials, etc. appreciated...
You should probably be using mechanize to scrape websites in ruby. When you do it will set cookies and referer for you so getting the image will be as easy as:
agent.get(image_url).save_as 'local_filename.jpg'
If the script (show_image.jsp - for example) is doing a simple referrer check, you may be able to work around it by writing your PHP (or Ruby) scraper in such a way so as to set the referrer before making the GET:
curl --referer http://www.example.com http://www.example.com/show_image.jsp?bar.jpg
Hi,
I download a large amount of files for data mining. I used to use PHP for this purpose but I am finding it to be too slow. Also I just want a small part of the web page. I want to achieve two things
Curl should be able to utilize all my download bandwidth
Is there any way to download only a part of the web page where my data resides.
I am not confined to PHP. If curl works better in terminal I would use that.
Yes, you can download only a part of the page by using the CURLOPT_RANGE option, and you can also provide a write callback function that simply returns an error when you've received "enough" data and you want to stop and move on.
Are you downloading HTML? Your comment leads me to believe that you are. If that's the case, simply load up the html with Simple PHP DOM and get only the part that you want. Although, I find it hard to believe that grabbing just the HTML is slowing you down. Are you downloading any files or media as well?
Link : http://simplehtmldom.sourceforge.net/
There is no way to download only part of a page. When you request a URL, the server response is what it is.
Utilize more of your bandwidth by using cURL's ability to make multiple connections at once.
I'm working on a site in Drupal which will reside at http://sub.clientdomain.com, pulling in an static html fragment from a server on at http://clientdomain.com/path/to/fragment.html, which I need to display in a page, to give a parent menu for a site.
I'm working on a VPS, so I have access to curl, wget and pretty much anything I want to install.
What's the optimal way to do this?
My first approach here would be to use curl on a cron job every n minutes to pull in the html fragment, and then store it somewhere like /var/www/html/site, to save making the http request on each load.
I'd then use file_get_contents function to pull the content into the page , with a fallback if the file isn't there (lets say if the file isn't there for example).
I'm adding this here mainly as a sanity check - would love to see other approaches to it.
Your first approach sounds like a good idea. By using a cronjob, the pulling of the content is separated from your site, so the user doesn't have to wait.
If you have access to the page that generates the fragment, you could reduce net traffic by actively sending the new content instead of requesting it every couple of minutes.
I'd store the fragment in a database, though, as I believe database-queries are faster than access to the filesystem.
If that main domain is on the same server, you will probably have access to directly, so use just the simple include and point the static html file by absolute path
Currently I've got a web app that retrieves the URL of a mp3 on an external server, but to conserve data I'd like to check first that the page my server is retrieving is actually a redirect, not the actual content (so I can grab the URL of the mp3 and NOT the actual mp3 itself.
The external PHP script requires that json data is POSTed to it, making it hard to get the client to do it themselves.
The problem is that although the external PHP script usually redirects me to a standard URL to GET from, sometimes it returns the actual mp3 itself in the body, using up my bandwidth rather than the user's.
What would be the best solution to fix this to make me not waste my bandwidth?
Thanks.
The best solution would be to use the Http verb HEAD.
From RFC2616
The HEAD method is identical to GET
except that the server MUST NOT return
a message-body in the response.
However, the question is, does the remote server support HEAD?
You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.