I'm trying to capture some images from an old database.
When writing scrapers, I use ruby (but am comfortable with php as well) to directly open() a website and read its contents. I sometimes also use the script to call the appropriate curl ... command.
However, the database I'm scraping some pieces out of returns a page and then embeds the target image with an image name using a series of random numbers I assume by the server side script. For example:
<img ... show_image.jsp?343523.jpg
However, I cannot call this show_image script directly (denied), it only works when embedded in the website as a whole.
Can I use curl, or within ruby or php do something download the entire page, for example, 1929.2.14.aspx in such a way that it includes the embedded image generated by show_image.jsp?343523.jpg?
If I simply curl the aspx file directly, I naturally just get the html - how might one save both the html and embedded image via scripting in the manner that a browser-based "web archive" feature works manually?
Any tips, links to tutorials, etc. appreciated...
You should probably be using mechanize to scrape websites in ruby. When you do it will set cookies and referer for you so getting the image will be as easy as:
agent.get(image_url).save_as 'local_filename.jpg'
If the script (show_image.jsp - for example) is doing a simple referrer check, you may be able to work around it by writing your PHP (or Ruby) scraper in such a way so as to set the referrer before making the GET:
curl --referer http://www.example.com http://www.example.com/show_image.jsp?bar.jpg
Related
I want the option of converting HTML to image and showing the result to the user. I would be creating an $html variable with PHP, and instead of displaying using echo $html, I want to display it as an image so the user can save the file if they needed to.
I was hoping there would something as simple as $image = convertHTML2Image($html); :p if that exists?!
Thanks!!
As #Pekka says, the job of turning HTML code into an image is the job of a full-blown web browser.
If you want to do this sort of thing, you therefore need to have a script that does the following:
Opens the page in a browser.
Captures the rendered page from the browser as a graphic.
Outputs that graphic to your user.
Traditionally, this would have been a tough task, because web browsers are typically driven by the user and not easy to automate in this way.
Fortunately, there is now a solution, in the form of PhantomJS.
PhantomJS is a headless browser, designed for exactly this kind of thing -- automated tasks that require a full-blown rendering engine.
It's basically a full browser, but without the user interface. It renders the page content exactly as another browser would (it's based on Webkit, so results are similar to Chrome), and it can be controlled by a script.
As it says on the PhantomJS homepage, one of its target use-cases is for taking screenshots or thumbnail images of websites.
(another good use for it is automated testing of your site, where it is also a great tool)
Hope that helps.
This is not possible in pure PHP.
What you call "converting" is in fact a huge, non-trivial task: the HTML page has to be rendered. To do this in PHP, you'd have to rewrite an entire web browser.
You'll either have to use an external tool (which usually taps into a browser's rendering engine) or a web service (which does the same).
It is possible to convert html to image. However, first you must convert to PDF. see link
You may have a look at dompdf which is a php framework to convert a html file to a pdf.
use WKHTMLTOPDF. works like a charm. it converts to any page to PDF ..
a jpeg can be obtained by performing later operation.
http://code.google.com/p/wkhtmltopdf/
I'm building a web app where users can build custom web pages that pull content from other web pages. I know of a few options for doing this, and I'm not sure which is best, and if there are better solutions out there. Right now, I could:
Use iframes, which will (sort of) accomplish what I want, but will force the client to download and render all the web content, which seems slow. I've heard a lot of people say iframes are passe and should not be used, etc.
Use a library like wkhtmltopdf, which will render the html on the server side and generate a pdf image of it. This would work nicely, but the result is just an image, so text won't be selectable, links won't be clickable, etc. Also, I've heard that you can get in legal trouble for hosting other people's web content on your site without permission.
Use something like phpquery to literally scrape content off of other sites. This option could have the same legal issues as the above option.
Has anyone done anything like this, or does anyone have any thoughts?
The cleanest solution would be send off a http request server side, then render the html into your page as you require, this will also require changing all the urls of content and links to be absolute
eg:
<img src="\images\banner.png">
will work on the remote server, but once inside your page, the image will not exist. The most workable solution would be limit the functionality to images and links, then do a find / replace with regex to match relative urls and add the source address to it.
You will however run into legal issues if you are resending other peoples content from your server, even just html.
Using an iframe would be the quick dirty solution and probably have the least legal ramifications, as the browser sends a normal request to the site for the content.
I'd recommend DocRaptor for generating PDF files from HTML. It works in a similar fashion as wkhtmltopdf, but produces fully functional PDF files.
Here's a link to its homepage:
http://docraptor.com/
And a link to its API documentation:
http://docraptor.com/documentation
Hi,
I download a large amount of files for data mining. I used to use PHP for this purpose but I am finding it to be too slow. Also I just want a small part of the web page. I want to achieve two things
Curl should be able to utilize all my download bandwidth
Is there any way to download only a part of the web page where my data resides.
I am not confined to PHP. If curl works better in terminal I would use that.
Yes, you can download only a part of the page by using the CURLOPT_RANGE option, and you can also provide a write callback function that simply returns an error when you've received "enough" data and you want to stop and move on.
Are you downloading HTML? Your comment leads me to believe that you are. If that's the case, simply load up the html with Simple PHP DOM and get only the part that you want. Although, I find it hard to believe that grabbing just the HTML is slowing you down. Are you downloading any files or media as well?
Link : http://simplehtmldom.sourceforge.net/
There is no way to download only part of a page. When you request a URL, the server response is what it is.
Utilize more of your bandwidth by using cURL's ability to make multiple connections at once.
We are downloading images to our computers when we open new webpages. For example: If a webpage has an image(image.jpg), our computer downloads it while we are surfing that page.
Some webpages are using ajax methods. For example: You don't see an image on the page's source codes, however your computer downloads an image. Because, if you click a link on that page, ajax will be showing that image...
Let me show an example:
<div id="ajax_will_load_image_here"></div>
Okay, how can php curl see (or download) that image? Curl can't see that image when I try to use preg_match function. Actually there is an image. I want to download that image by using php curl. Any advice?
If i understand the question correctly there is no convinient way of doing that.
Your crawler/spider would have to parse the website and evaluate javascript.
There are libraries for that but support is very limited.
There are however methods where an actual browser is used to evaulate the page (without displaying it but setting proper environment variables like resolution etc).
Then the generated source including javascript dom modifications is available.
This is for example how the google search previews are generated.
But if you require user interaction it gets pretty specific and complicated.
I am sorry to dissapoint you, but using curl and preg metch the old school way we used to when javascript was not yet so common wont work.
However for most legit use cases this is more than sufficient and websites are today more and more designed to be non-javascript compliant. Especially the content for crawling purposes. It is a must in search engine optimization, and which website doesnt want that?
You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.