Curl preg_match - php

We are downloading images to our computers when we open new webpages. For example: If a webpage has an image(image.jpg), our computer downloads it while we are surfing that page.
Some webpages are using ajax methods. For example: You don't see an image on the page's source codes, however your computer downloads an image. Because, if you click a link on that page, ajax will be showing that image...
Let me show an example:
<div id="ajax_will_load_image_here"></div>
Okay, how can php curl see (or download) that image? Curl can't see that image when I try to use preg_match function. Actually there is an image. I want to download that image by using php curl. Any advice?

If i understand the question correctly there is no convinient way of doing that.
Your crawler/spider would have to parse the website and evaluate javascript.
There are libraries for that but support is very limited.
There are however methods where an actual browser is used to evaulate the page (without displaying it but setting proper environment variables like resolution etc).
Then the generated source including javascript dom modifications is available.
This is for example how the google search previews are generated.
But if you require user interaction it gets pretty specific and complicated.
I am sorry to dissapoint you, but using curl and preg metch the old school way we used to when javascript was not yet so common wont work.
However for most legit use cases this is more than sufficient and websites are today more and more designed to be non-javascript compliant. Especially the content for crawling purposes. It is a must in search engine optimization, and which website doesnt want that?

Related

Convert HTML to Image in PHP without shell

I want the option of converting HTML to image and showing the result to the user. I would be creating an $html variable with PHP, and instead of displaying using echo $html, I want to display it as an image so the user can save the file if they needed to.
I was hoping there would something as simple as $image = convertHTML2Image($html); :p if that exists?!
Thanks!!
As #Pekka says, the job of turning HTML code into an image is the job of a full-blown web browser.
If you want to do this sort of thing, you therefore need to have a script that does the following:
Opens the page in a browser.
Captures the rendered page from the browser as a graphic.
Outputs that graphic to your user.
Traditionally, this would have been a tough task, because web browsers are typically driven by the user and not easy to automate in this way.
Fortunately, there is now a solution, in the form of PhantomJS.
PhantomJS is a headless browser, designed for exactly this kind of thing -- automated tasks that require a full-blown rendering engine.
It's basically a full browser, but without the user interface. It renders the page content exactly as another browser would (it's based on Webkit, so results are similar to Chrome), and it can be controlled by a script.
As it says on the PhantomJS homepage, one of its target use-cases is for taking screenshots or thumbnail images of websites.
(another good use for it is automated testing of your site, where it is also a great tool)
Hope that helps.
This is not possible in pure PHP.
What you call "converting" is in fact a huge, non-trivial task: the HTML page has to be rendered. To do this in PHP, you'd have to rewrite an entire web browser.
You'll either have to use an external tool (which usually taps into a browser's rendering engine) or a web service (which does the same).
It is possible to convert html to image. However, first you must convert to PDF. see link
You may have a look at dompdf which is a php framework to convert a html file to a pdf.
use WKHTMLTOPDF. works like a charm. it converts to any page to PDF ..
a jpeg can be obtained by performing later operation.
http://code.google.com/p/wkhtmltopdf/

Capturing image generated dynamically when scraping a page

I'm trying to capture some images from an old database.
When writing scrapers, I use ruby (but am comfortable with php as well) to directly open() a website and read its contents. I sometimes also use the script to call the appropriate curl ... command.
However, the database I'm scraping some pieces out of returns a page and then embeds the target image with an image name using a series of random numbers I assume by the server side script. For example:
<img ... show_image.jsp?343523.jpg
However, I cannot call this show_image script directly (denied), it only works when embedded in the website as a whole.
Can I use curl, or within ruby or php do something download the entire page, for example, 1929.2.14.aspx in such a way that it includes the embedded image generated by show_image.jsp?343523.jpg?
If I simply curl the aspx file directly, I naturally just get the html - how might one save both the html and embedded image via scripting in the manner that a browser-based "web archive" feature works manually?
Any tips, links to tutorials, etc. appreciated...
You should probably be using mechanize to scrape websites in ruby. When you do it will set cookies and referer for you so getting the image will be as easy as:
agent.get(image_url).save_as 'local_filename.jpg'
If the script (show_image.jsp - for example) is doing a simple referrer check, you may be able to work around it by writing your PHP (or Ruby) scraper in such a way so as to set the referrer before making the GET:
curl --referer http://www.example.com http://www.example.com/show_image.jsp?bar.jpg

Best way to serve third party html on your site?

I'm building a web app where users can build custom web pages that pull content from other web pages. I know of a few options for doing this, and I'm not sure which is best, and if there are better solutions out there. Right now, I could:
Use iframes, which will (sort of) accomplish what I want, but will force the client to download and render all the web content, which seems slow. I've heard a lot of people say iframes are passe and should not be used, etc.
Use a library like wkhtmltopdf, which will render the html on the server side and generate a pdf image of it. This would work nicely, but the result is just an image, so text won't be selectable, links won't be clickable, etc. Also, I've heard that you can get in legal trouble for hosting other people's web content on your site without permission.
Use something like phpquery to literally scrape content off of other sites. This option could have the same legal issues as the above option.
Has anyone done anything like this, or does anyone have any thoughts?
The cleanest solution would be send off a http request server side, then render the html into your page as you require, this will also require changing all the urls of content and links to be absolute
eg:
<img src="\images\banner.png">
will work on the remote server, but once inside your page, the image will not exist. The most workable solution would be limit the functionality to images and links, then do a find / replace with regex to match relative urls and add the source address to it.
You will however run into legal issues if you are resending other peoples content from your server, even just html.
Using an iframe would be the quick dirty solution and probably have the least legal ramifications, as the browser sends a normal request to the site for the content.
I'd recommend DocRaptor for generating PDF files from HTML. It works in a similar fashion as wkhtmltopdf, but produces fully functional PDF files.
Here's a link to its homepage:
http://docraptor.com/
And a link to its API documentation:
http://docraptor.com/documentation

How to take page screenshots like Google+

I noticed after using Google+'s feedback that it takes a screenshot and also allows you to do things such as highlight and black out sections. I'm wondering how this can be achieved; based on the fact that you can modify the DOM with highlights and black outs I assume it's just taking the entire DOM and turning that into an image, however, I'm not sure how they're doing that aspect of it.
I know that PHP has a couple functions, 'imagegrabscreen' and 'imagegrabwindow' but they only work for Windows users so I have my doubts that this is what they're using.
So, my question is how are they turning the DOM into an image?
Google+ doesn't take the screenshot entirely on the client side. It sends the local (rendered) DOM to the server, renders it to an image, and returns the created image.
You can test this by adding a local image to the page (using Firebug), and then trying to create a feedback. That image won't be present.
JavaScript can read the DOM and render a fairly accurate representation of that using canvas. I have been working on a script which converts html into an canvas image. Decided today to make an implementation of it into sending feedbacks like you described.
The script allows you to create feedback forms which include a screenshot, created on the clients browser, along with the form. The screenshot is based on the DOM and as such may not be 100% accurate to the real representation as it does not make an actual screenshot, but builds the screenshot based on the information available on the page.
It does not require any rendering from the server, as the whole image is created on the clients browser. The HTML2Canvas script itself is still in a very experimental state, as it does not parse nearly as much of the CSS3 attributes I would want it to, nor does it have any support to load CORS images even if a proxy was available.
Still quite limited browser compatibility (not because more couldn't be supported, just haven't had time to make it more cross browser supported).
For more information, have a look at the examples here:
http://hertzen.com/experiments/jsfeedback/
My guess is that they gather information about page in question, highlighted blocks etc. render that page in an in memory version of a web browser and takes a screenshot of it.
Edit
To clearify:
If client is viewing http://some.page?someArg=someValue
The server renders http://some.page?someArg=someValue in an in memory browser takes a screenshot, sends the image to the client.

How do I extract images from Google Images based on user input and place them on a page?

I'm new to web programming, but I had an idea I could use as an instructional tool, and I was hoping I could get some guidance.
Here's my idea: I want to have some form that takes the data entered by the user, submits each word in the form to google images, and retrieves the first image returned by Google Image Search. Each image should be then be pasted in the current document
What language would I need for this (I read about it being disallowed in Javascript due to cross-site scripting?), what kind of topics would I need to learn, and what would the basic template look like for doing such a task?
Thanks.
personally i would have a PHP interface with a python script that does the work. Have a look at urllib in python. You could also try cURL within PHP and just construct the url yourself. If memory serves however google like you to actually be on their website to see their adverts. Pretty sure they block requests that appear like they come from scripts / webhosts.

Categories