How to scrape websites when cURL and allow_url_fopen is disabled - php

I know the question regarding PHP web page scrapers has been asked time and time and using this, I discovered SimpleHTMLDOM. After working seamlessly on my local server, I uploaded everything to my online server only to find out something wasn't working right. A quick look at the FAQ lead me to this. I'm currently using a free hosting service so edit any php.ini settings. So using the FAQ's suggestion, I tried using cURL, only to find out that this too is turned off by my hosting service. Are there any other simple solutions to scrape contents of a of another web page without the use or cURL or SimpleHTMLDOM?

If cURL and allow_url_fopen are not enabled you can try to fetch the content via
fsockopen — Open Internet or Unix domain socket connection
In other words, you have to do HTTP Requests manually. See the example in the manual for how to do a GET Request. The returned content can then be further processed. If sockets are enabled, you can also use any third party lib utilitzing them, for instance Zend_Http_Client.
On a sidenote, check out Best Methods to Parse HTML for alternatives to SimpleHTMLDom.

cURL is a specialty API. It's not the http library it's often made out to be, but a generic data transfer library for FTP,SFTP,SCP,HTTP PUT,SMTP,TELNET,etc. If you want to use just HTTP, there is an according PEAR library for that. Or check if your PHP version has the official http extension enabled.
For scraping try phpQuery or querypath. Both come with builtin http support.

Here's a simple way to grab images when allow_url_fopen is set to false, without studying up on estoteric tools.
Create a web page on your dev environment that loads all the images you're scraping. You can then use your browser to save the images. File -> "Save Page As".
This is handy if you need a one time solution for downloading a bunch of images from a remote server that has allow_url_fopen set to 0.
This worked for me after file_get_contents and curl failed.

file_get_contents() is the simplest method to grab a page without installing extra libraries.

Related

Take screen shoot of client windows using php script

Is this possible to take screenshots of windows's screen of client users using a PHP application?
If you want to capture the current client web page, you should look into JS solutions (see this related SO answer):
html2canvas
npm packages with keyword ‘screenshot’
You may also be able to programatically load and render HTML with a GET request, though you will still need to run JavaScript to render the full web page on the server side.
If you mean a screenshot of the actual desktop, I fear only the browser would be able to do it and this is still at an experimental stage, probably due to security concerns:
Chrome desktopCapture extension
MDN Media Capture and Streams API
Can I use media capture?
Edit: Possible duplicate of this question. Some back-end libraries to render a web page:
dompdf (PHP5)
wkhtmltopdf (C++)

authorize.net response_url https/http issue

Authorize.net consumes my response_url, which is on HTTP, into their HTTPS hosted dll. How can I specify that their dll should be on HTTP, so that my CSS and JS files get pulled in correctly?
I don't have a way of getting access to an SSL host at the moment.
Edit: First, we send from HTTP to their HTTPS hosted form. On their server. Then, their server consumes our HTTP page and dispalys it in their HTTPS response dll.
I only want their response_dll to be on HTTP. I don't see a security issue with that, and imagine there is a way to do this, as their service offering is meant for people without SSL enabled.
Edit2: I'm using their Simple Checkout API.
Answering my own question, based on some advice to contact tech support directly. I did, and their response was:
"We can only relay the html content. We do not offer to relay other content such as images. It is necessary that any images you want to include are hosted using https."
Edit --
I attempted to include a single logo image as a base64 string inside a static html page. That failed with a "script timeout". I speculate this is due to the size of the html file after using that string.
This is a total failure on the part of Authorize.Net

Why Chrome does not trust SSL?

SSL is installed to my VPS correctly. I want to use ssl in some pages of my website. Every form in these pages are starting with "https://", too. But browsers don't accept it.
What are the possible reasons?
There may be a number of reasons. Last time I got it on my site was when I was using an iframe with external content and a flash widget loaded via an external javascript. Both were accessed via HTTP and messed my site's trustworthiness.
So. Check all your external content: javascripts, widgets, iframes, images, stylesheets... You may be loading them via HTTP, which in turn may make Chrome claim the SSL certificate has a problem.
I would try checking it via something like this first : http://www.sslshopper.com/ssl-checker.html
You might also try running curl -verbose https://yourlink.com from the console in order to get detailed printout of where the hiccups are.

Problem with Digital Certificates using OpenLayers and Javascript

I'm developing a project using Javascript, PHP and OpenLayers. A lot of maps are loaded using and HTTPS connection against an external OGC server.
When I try to load the map using HTTPS, they doesn't load (instead of, they show me an "Error loading the map, try again later").
I think that the problem is because of Digital Certificate. If I load directly from the server (using a WMS call) like this (look the last parameter):
https://serverurl/ogc/wms?service=WMS&version=1.1.0&request=GetMap&layers=ms1:lp_anual_250&styles=&bbox=205125.0,3150125.0,234875.0,3199875.0&width=306&height=512&srs=EPSG:4326&format=application/openlayers
The browser ask me for my authorization to see it. If i accept the Digital Certificate, I can see the map. After that, and because of my browser now accepts the certificate, I can see my own map from my own application.
So, the question is: Is there any way to ask for the Digital Certificate mannually when the user access to my web?
Thanks in advance!
PS: solutions using PHP are welcome too because I'm using CodeIgniter to load views
You could try opening the WMS URL in a div or perhaps a hidden iframe - that may cause the browser to pop up its 'Unknown cert' dialogue.
Im going to quote another user (geographika) from gis.stackexchange. I hope can help to someone with my issue:
You can use a proxy on your server so
all client requests are made to your
server, which deals with the
certificate, gets the request and
passes it back to the client. For PHP
have a look at
http://tr.php.net/manual/en/function.openssl-verify.php
If you are also using WMS software
(MapServer, GeoServer) you could
implement the same technique using a
cascading WMS server.
For details on how to do this in
MapServer see
http://geographika.co.uk/setting-up-a-secure-cascading-wms-on-mapserver

What are the right uses for cURL?

I have already heard about the curl library, and that I get interest about...
and as i read that there are many uses for it, can you provide me with some
Are there any security problems with it?
one of the many useful features of curl is to interact with web pages, which means that you can send and receive http request and manipulate the data. which means you can login to web sites and actually send commands as if you where interacting from your web browser.
i found a very good web page titled 10 awesome things to do with curl. it's at http://www.catswhocode.com/blog/10-awesome-things-to-do-with-curl
One of it's big use cases is for automating activities such as getting content from another websites by the application. It can also be used to post data to another website and download files via FTP or HTTP. In other words it allows your application or script to act as a user accessing a website as they would do browsing manually.
There are no inherent security problems with it but it should be used appropriately, e.g. use https where required.
cURL Features
It's for spamming comment forms. ;)
cURL is great for working with APIs, especially when you need to POST data. I've heard that it's quicker to use file_get_contents() for basic GET requests (e.g. grabbing an RSS feed that doesn't require authentication), but I haven't tried myself.
If you're using it in a publicly distributed script, such as a WordPress plugin, be sure to check for it with function_exists('curl_open'), as some hosts don't install it...
In addition to the uses suggested in the other answers, I find it quite useful for testing web-service calls. Especially on *nix servers where I can't install other tools and want to test the connection to a 3rd party webservice (ensuring network connectivity / firewall rules etc.) in advance of installing the actual application that will be communicating with the web-services. That way if there are problems, the usual response of 'something must be wrong with your application' can be avoided and I can focus on diagnosing the network / other issues that are preventing the connection from being made.
It certainly can simplify simple programs you need to write that require higher level protocols for communication.
I do recall a contractor, however, attempting to use it with a high load Apache web server module and it was simply too heavy-weight for that particular application.

Categories