There's a PHP based website that I'd like to replicate the data from.
The problem is that the website's data is only accessible via a company name search page - www.example.com/companynamesearch.php
The results are displayed under the same URL, so it does not have separate company name URLs to crawl for data.
Can anyone suggest an easy way to extract the data from the site?
Thanks
First, you need to query the data. Figure out if the data is truly on this page and the data comes in via AJAX as suggested by #JonathanM. You can use a tool like Fiddler or your browser's developer tools to monitor for this.
If you find the data comes in via AJAX, you're all set. It's probably JSON, but can be in any type so watch for that.
If the data is on this page and the page is queried by POST data, then you are going to have to make those POST requests and then parse the page. Now, don't do this yourself. Use DOMDocument to dig at the page for you. See this question for details: How do you parse and process HTML/XML in PHP?
If your chosen language is php you should look at curl's automated form submission capabilities, which will enable you to automate the internal search engine's form.
There is a useful stackoverflow answer here
fill out a form automaticly using curl and php
Or you can look at these basic tutorials to get you started:
http://phpsense.com/2007/php-curl-functions/
http://devzone.zend.com/160/using-curl-and-libcurl-with-php/
Using curl with php will save you plenty of time but be warned, if the site's owners aren't wanting you to scrape their site, you could be in for a tough time. And of course there are copyright issues to think of, etc, etc.
Have you tried searching google for site:www.example.com ? You may get a list of all pages back.
They might have submitted a sitemap or Google might have found another way.
Related
I'd like some help in taking data input from a user, using that input to complete a form on a different site and then collecting the results that the site outputs. Would it be possible to do this via PHP? If it helps/ additional info, the target site is in JSP.
The site in question is a result site for my university. It's done in JSP and there's no sort of API through which I can fetch the data. I'd like to be able to take user input ( unique student ID etc) , submit that to the result website, and fetch the results onto my own to do some calculations. Would it be possible to do this?
Any help would be appreciated.
well technically you can. if you create an http post request to that site you can send data to that site but the response will be the html code of the page. but as a security most of the sites usually protects their sites from such actions so they put a token inside their forms so that they can be able to know whether the form data is coming from their site or was submitted by external source.
you can try to send post HTTP using curl
here is a sample tutorial
https://davidwalsh.name/curl-post
It is not possible if the external site does not have APIs to provide the data. Normally cross-origin HTTP request would be disallowed for a web application. Read more about CORS Here.
I'm terrible at keeping track of my bills, so I wanted to create something automated. I also wanted the challenge of making it myself.
My questions:
Is it possible to have a webpage connect to another domain (any utility website i.e. timewarnercable.com) with the proper login credentials and retrieve the dollar amount I owe, then send me an email or even just display it on the webpage?
I've already got a webpage setup that has all my account info stored in it (don't worry it's only a local site!) and I can click a button and the info I have stored sends a POST request to the utility login site. This logs me in to my account page and then I can view the bill. But don't want it to open another page..I'd rather load the content of that page in the background, scan for the code where its says my $ owed, then capture that somehow, then return the dollar amount onto the webpage.
If so, is this possible to design this with Ruby (Rails) or php, with Javascript / AJAX.
Thanks!
What you're basically asking about is "page scraping", but your scenario is more complicated. You would have to fake the login post, capture and store any cookie/session info returned to you in the response and use that in subsequent requests to the site. You may also have to deal with redirects, depending on the site.
I have found nodejs actually quite useful for scraping pages since it has plugins that provide dom selectors (there is a jquery plugin) - you're using javascript for server-side programming.
Check if the site has API and if the site provides that, will make your life a ton easier.
Some banks like BankOfAmerica have applications that already do this - they aggregate your accounts and bills from other sites, see if your bank can do this.
Ok this is quite complicated and not even sure if it is possible. Need some insight from knowledgeable people to advise on how I should proceed.
I need to process a form on a remote site, screen scrape the results (on the fly), parse the information and display it back to the end user.
--More clearly explained by example--
1 my site is -> sitea.com
[2] the form is on -> somebodyelseswebsite.com (no DB access, but form is public)
Here's my logic:
i can replicate the form from site [2] and make an exact copy on my site1.
when the user submits the form i need some kind of object in the POST (javascript?) that will assign the users input to ... and process the form on site [2], screen scrape the results, and return the data in an array, which i can display on my site1.
key points:
The user must not be aware of the transaction with site[2].
This must happen in real-time and fast
So can this be done? If YES, How? I know about PHP cURL can I use only PHP or do I need to use something else?
--further clarification--
Yes, this can be done. cURL is one way to do it, yes. You need some pretty heavy error-checking and validation for any sort of reliability though. You'd use a cURL POST (assuming the remote host doesn't have any sort of form key, ip block, referer checking, etc.) to replicate the behavior of that form's fields. Then you'd need to scrape the return and I think that's the difficult part.
For me, I'd use a DOM Parser to get very specific. Here is a post on how to do that.
Lately i have been working on php-Browser-alike program. The goal of this program is to use this php-browser platform to browse only 'safe' web sites. the capabilities will be to track an adult site and not displaying it.
unfortunately , there are two major problems:
Cookies - user can't log-in their users in different sites while using this platform.
Security redirecting - some sites check the url either in PHP or JS and then redirect to their page.
So , simply i though about plain B:
i was thinking about using iFrame and build the whole program in JavaScript and Ajax! but unfortunately , iFrame is super secured and i can't touch anything in it!
- and there is gone plain B.
My question is: is there anything you can think of / advices that can help building PHP/javascript+ajax browser alike program?
For the PHP side you'll need to use curl. You'd probably want to change the html on the server side. Take a look at this Is there a PHP HTML tag library?.
For checking if the site is adult. You should just pass the domain through a database of adult sites.
For javascript I don't know of any pre-made browsers. You'll probably have to block it in yourself, it shouldn't be to hard.
Update
basic structure:
js client makes ajax request to php server using GET or POSt (ex "url=site.com/page/foo.html")
Php gets url using GET or POST
php uses curl to get page contents
php parses through html and changes urls or js prevent link press and send href="" to server via ajax (back to top) : Is it possible to stop redirection to another link page?
php echos out the page
javascript places it in display
I know my ans is too late, posting for so that anyone get help. There is a simple solution for creating complete php browser. Here is the link: http://sourceforge.net/projects/snoopy/
Howdy folks, I am wanting to build a script or something to take a single row from my MySQL database and use that data to pre-populate form fields on one of multiple sites that aren't mine. What I'd like to do is to take information a user has entered on my site and when they click a link to one of the sites in my system it loads the external site with certain pre-mapped fields populated with the info they entered. But I can't seem to get my head around a way to do this, seeing as I can't add anything to these pages. Do you guys have any suggestions?
The flow you described is not possible due to cross-site scripting constraints. This post is relevant: Browser Automation and Cross Site Scripting
The closest thing I can think of is Greasemonkey, which would force the user to download the plugin from Mozilla, plus a new userscript from your website.
Another option would be reproducing the form on your own web server, and hoping the form action doesn't perform referrer checks.
i am not very sure but you can use wget and pass xml data...i.e you can build an xml string with the data you want to send across and then do a wget to the other site...hope this helps