I know how to load a page from another website and analyze it but the website I'm trying to load some pages from, doesn't let unregistered users to visit those pages. I do have a username and a password to load those pages normally in my browser, but I'm wondering if I could do it in PHP or not? :/
I'm not sure what information I should give you about the website but if what I already told is incomplete just ask what information I should give.
Thanks.
Most websites use cookies to store information related to your authentication status.
To get past this programatically, you'll have to send this information everytime you make a request. Here is how you can get it -
Using a tool like firebug, inspect the cookies the site sends when you login manually.
Write code to capture this information and send this cookie with subsequent requests
Note: Do check the ToS of the site you are trying to scrape. Some sites do not permit you to scrape or use their content without prior permission
Research php SOAP. It includes the tools to log in to another site, giving your php script access to the HTML which would normally be presented to the browser. BTW, this is called scraping
Related
I have to download a page to parse some value in it. I would like to use PHP, it download the page, parse the data and return html with results. But i have to login before on the site to get the target page. How can i do it with php?
Best option is to follow the logon process when logging in manually using a browser like Google Chrome. You need to enable the network monitor. Press F12 to enable the developer tools, navigate to tab 'Network' and enable the checkmark at 'Preserve Log'. You can optionally select the disable cache checkmark.
Then clear all history and cookies so you're sure the site doesn't logon automatically.
No you're set to logon manually through the website. Type the site's URL for the login page and watch the items in the developer tools roll by. When your login is complete, head over to the top of the list of items in the network and look for a POST entry in the second column. This usually indicates the browser posting the logon information to the website.
Most sites respond using a 30x response and place a cookie. Now you now how the site operates.
Have a look at my answer to a similar question: PHP curl login couldn't pass login page
and use the CURL library to first logon, receive the cookie and while keeping the connection open get the page after the login you need.
I posted this question earlier but it was misinterpreted by those reading it and was closed before I had time to clarify. If you don't understand what I mean, please ask!
I have a site, let's call it "site A". On "site A", there is a log in page. On this page, you POST a username and password to a PHP script. If the login details are correct, the PHP script sets a cookie on the browser. This cookie is called "SESSION".
When you view the site, it checks whether "SESSION" is valid, and displays either the information or the login page.
I want to connect to the page via PHP and POST the login details. I then want to store the "SESSION" cookie via PHP, and display the contents of the page (again, via PHP).
How would I do this?
You can use PHP as a web client as well. You can use the cUrl library to make requests from PHP.
You can use setopt to set all kinds op options for your cUrl session, including POST (CURLOPT_POST) and the POST variables (CURLOPT_POSTFIELDS), but also choose a kind of authentication (CURLOPT_HTTPAUTH) in case the site doesn't use normal post for this.
I found an example that might be useful here: http://davidwalsh.name/curl-post, although you can find many other examples by Googling for something like 'php curl post'.
I have a google apps domain that i'd like to create a custom login page for but am having problems.
Google provides documentation for SSO/OpenID/userApi that will do this. The implementation on these docs that I can understand states once a user hits your site they will be sent to the regular gmail login and then sent back to your site once logged in. I'm trying to have them login in a custom page and not be sent over to googles default gmail login. There is other documentation that seems to require SSO and a lot of integration that I am too incompetent to understand which would let you do that, but as I said it's way over my head.
Then I thought I could just copy the form element and create custom css seeing as the action value on the form would authenticate via google. This worked sporadically until I figured out that when you go to https://accounts.google.com/ServiceLoginAuth (the default gmail login) it creates a value (name="GALX" value=Randomletters) in the html form that must match a cookies name and value to be able to submit and authenticate to google.
From here I thought no problem I'll create a hidden iframe to the google login so the cookie populates (it does get the cookie) and then read and insert the value in the html form. That is until I discovered you cannot alter or read another domains cookies for security reasons which makes perfect sense.
Then I thought I could just use php's file_get_contents on the gmail login url to get cookie and the right html and just insert the html into my custom page. I received the html but no cookie this time.
Is there anyway to send a request that would return the html/cookie pair with something like php's file_get_contents('url') or curl? This way I could traverse the file_get_contents object and insert the html into the page via the DOM. Or am I barking up a tree that will never work because security reasons specifically prevent this?
If the above isn't possible could someone explain how I could login my users via a custom login screen?
the google docs for such a project are:
https://developers.google.com/appengine/docs/python/gettingstartedpython27/usingusers
https://developers.google.com/appengine/docs/python/users/#Python_Signing_in_and_out
https://developers.google.com/google-apps/sso/saml_reference_implementation
I believe this is what you're looking for...
http://curl.haxx.se/docs/http-cookies.html
I want to track the site URL from where user reached my site.
From where he came i.el, Google, GMail, Facebook, etc.
I tried $_SERVER['HTTP_REFERER'] but it does not contain anything when user click on my site link from any external site but resides the value when I visit among my site pages and this is also not trusted.
So, What I can do from here?
Is there any other way to track the external URL through PHP?
Any idea?
EDIT: Now HTTP_REFERER is able to get the url from most of sites but not able to get the url if user came through Gmail and AOL. What could be the causes?
HTTP_REFERER is the only way to get any information about previous site.
And that is also up to the broser if it supplies that information, most do as default.
Its a header that is set by the browser in the request to your server, if it is not present, then you will never know where the user came from.
If the browser is sending and you still to not get anything on the server check if you have any code that interferes with the $_SERVER variable.
Try this URL, its a google search result that goes to a page that just dumps the HTTP_REFERER.
As the pages indicates, if the box lists (none), then your browser is not sending HTTP_REFERER but if you get a result then the problem is in sour server.
http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBIQFjAA&url=http%3A%2F%2Fkarmak.org%2F2004%2Freftest%2Ftest&rct=j&q=http_referer%20test&ei=cNQ2TdGYGsmUOp_ExPoD&usg=AFQjCNFVSmYmQBUcL2l3_ZpmZzVWZztjWg&cad=rja
You can compare it to when you load the page withour google to redirect you:
http://karmak.org/2004/reftest/test
Here is their own start page with link:
http://karmak.org/2004/reftest/
Have you tried it in a variety of browsers? It's down to the browser (As far as I'm aware) to set HTTP_REFERER and sometimes privacy settings can prevent this.
Visitors coming from google can be tracked using google analytics, it gives you the search query terms used before.
This solution also track a lot of other things from your visitors. I undertand it's not PHP based, but it's the only other kind of solution I know if HTTP_REFERRER is not enough to you, and as you quoted google...
While I was trying to refresh page contents dynamically using Ajax/JQuery, I have learned about the S-O-P issue and restrictions, however I was wondering if there could be a way to solve my little problem.
To make it easier to understand I will first explain the workflow.
I do receive web pages via email, that is HTML emails. The web pages contain HTML forms in such a way, once the form is complete it is sent to the proper web server (php) to store data.
I mostly use Outlook 2007 as my email client (don't say anything here, I know!!!), but for some security restrictions, IFRAMES are disabled when "opening" the email. I have circumvented this problem using a VBA script that copies the whole page content, saves it on the filesystem as a stand-alone web page and loads into the browser (Firefox).
Once the page is loaded into the browser, the address bar shows a local/filesystem URL, such as
file:///C:/Users/Bob/Desktop/outlookpage.htm
Till here no problem, works fine; now the problem:
I wished to dynamic update page contents using Ajax, using jQuery.load, however that's where the S-O-P comes in. The PHP page being loaded to dynamically update the web page is seen as running on another domain, thus being blocked.
I was wondering how to circumvent this.
That's not going to work because in order to bypass the same origin policy, you would need to use a proxy on the same domain, which will then communicate to the page that's handling the data on a different domain. There's no way to generate a proxy script on another user's computer (or at least, there SHOULDN'T BE A WAY). I would either just post the form normally, which will open the user's default browser, or provide a link to an online form in the email. The link should be provided anyway, in case their email client doesn't support HTML email.