Scrape a website (javascript website) using php - php

I am trying to scrape a website (believe it is in JavaScript) using a simple PHP script. I am a beginner so any help would be greatly appreciated. The URL of the webpage is:
http://www.indiainfoline.com/Markets/Company/Fundamentals/Balance-Sheet/Yes-Bank-Ltd/532648
So here for example I would like to pass the name of company (Yes-Bank-Ltd) and code (532648) in get_file_contents. Not sure on how to do it so can somebody please help.
Thanks,
Nidhi

Why aren't you just not append the string of the company and code in the url. Here is an idea that you fill up an array of company and code (need to be the same size) and then you loop them to scrape the data you want.
for($i=0;$i<count($listOfCie);$i++)
{
$cie = $listOfCie[$i];
$code = $listOfCode[$i];
$urlToScrape = "http://www.indiainfoline.com/Markets/Company/Fundamentals/Balance-Sheet/" . $cie . "/" . $code
//... = get_file_contents($urlToScrape....
}

Use the data.html table in YQL! http://developer.yahoo.com/yql/console

The simplest way to scrape a site in PHP is to use curl (http://php.net/manual/en/book.curl.php)
For some examples look at http://php.net/manual/en/curl.examples-basic.php or google :)
If the website relies on javascript though it's going to be difficult to get the data you want. You might look at a "headless browser" like http://phantomjs.org/

Related

PHP Variables not passing to url correctly from API

I'm trying to use the Premium URL Shortener script from codecanyon, I have asked for support but they seem to be a little busy, so the response time is not to quick.
The issue I have is when the API sends the request to the url shortener script with the following shortened query for example purposes:
$short = "http://myurl/api?api=MYAPI&format=text&url=http://myfullwebsite.com/email/quote.php?fullname=$fullname&address=$address&emailaddress=$emailaddress";
Although the variables are being placed in the script correctly using echo function at the end of the script after the api request is sent shows they are correctly inserted like so:
http://myurl/api?api=MYAPI&format=text&url=http://myfullwebsite.com/email/quote.php?fullname=Dan Smith&address=12 Main Street, London&emailaddress=dan#smith.com
However if I click the shortened url provided to me from the script I only get the following url string appear in the browser:
http://myfullwebsite.com/email/quote.php?fullname=Dan
It seems as soon as there is a space or even if there is no second name such as Dan Smith and only Dan is the available name, it will not even apply the second ampersand or & sign.
I have tried to use urlecode() but still no joy and I've been pulling my hair out for the last 3 days!
As a novice beginner it has been somewhat difficult to try and achieve the end result and it seems unreachable so I would appreciate any kind help or advice if possible, Maybe I'm missing something so simple?
I've thought of having the url query build from an array of variables but as a novice I've tried one way and failed so not sure if I have done it wrong.
Here is my full api code where I have tried both with SESSION and GET but that is not the problem as the end result echos to the browser with the variables there.. it's only when you follow the shortened url link that you see they're missing.
<?php
session_start();
$fullname = htmlspecialchars($_GET["fullname"]);
$address = htmlspecialchars($_GET["address"]);
$postcode = htmlspecialchars($_GET["postcode"]);
$emailaddress = htmlspecialchars($_GET["emailaddress"]);
$short = "http://myurl/api?api=MYAPI&format=text&url=http://ukhomesurveys.co.uk/email/quote.php?fullname=$fullname&address=$address&emailaddress=$emailaddress";
echo $short;
// Using Plain Text Response
$api_url = $short;
$res= #file_get_contents($api_url);
if($res){
echo $res;
}
?>
Hope I covered everything and hope I have not confused anyone. Thanks.
I think the good choice here is to encode your query with base64 and then pass it to the shortener. In your http://myfullwebsite.com/email/quote.php you just decode the query and use it. The standart PHP functions are base64_encode and base64_decode.
Did you try to encode URI using rawurlencode?
$url = rawurlencode('http://myfullwebsite.com/email/quote.php?fullname=Dan Smith&address=12 Main Street, London&emailaddress=dan#smith.com');

Possible to use PHP to simulate an iFrame so I can access the DOM of a non hosted page?

I'm working on a project where I would like to load the contents of one webpage (that I'm not hosting) into a webpage that I am hosting with the ability to access the DOM of the non-hosted page.
If anyone has any advice as to whether it's possible to achieve this, I'd love to hear some feedback. Maybe PHP isn't even the answer. Maybe I'm going about this all wrong. I'm definitely open to any suggestions at this point!
Thanks for reading,
DJS
You can use curl in PHP to load the webpage into a variable instead of an IFrame and then output the contents of the variable using PHP wrapped in your layout. In this way, the DOM for all of the content should be accessible with JavaScript.
As ronnied has answered, you can use CURL to load the page. You can update all the links by using a simple regex query on the loaded page. The following code should point you in the right direction in particular look up preg_replace and preg_replace_callback:
//Regular expression to deal with links...
function replaceCallback($match){
$url = $match[3];
...
return $match[1].$match[2].$replacement.$match[4];
}
//$html is curl'd page contents
$pattern = "/(<a.*?href\s*=\s*)('|\")(.*?)('|\")/i";
$html = preg_replace_callback($pattern,'replaceCallback',$html);
Regular expressions are hard to get your head around. But when you do you will be highly rewarded as they are very powerful...

Parsing webpage from php

I'm working on getting my new website up and I cannot figure out the best way to do some parsing.
What I'm doing is trying to parse this webpage for the comments (last 3) the "whats new" page, permissions page, and the right-bar (the one with the ratings etc).
I have looked at parse_url and a few other methods, but nothing is really working at all.
Any help is appreciated, and examples are even better! Thanks in advance.
I recommend to use the DOM to this job, here it is an example to fetch all the urls in a web page:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.theurlyouwanttoscrape.com');
foreach( $doc->getElementsByTagName('a') as $item){
$href = $item->getAttribute('href');
var_dump($href);
}
Simple HTML DOM
I use it and it works great. Samples at the link provided.
parse_url parses the actual URL (not the page the URL points to).
What you want to do is scrape the webpage it is pointing to, and pick up content from there. You would need to use fopen, which will give you the HTML source of the page and then parse the HTML and pick up what you need.
Disclaimer: Scraping pages is not always allowed.
PHP SimpleXML extension is your friend here: http://php.net/manual/en/book.simplexml.php

Simple HTML DOM only returns partial html of website

I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)

How to pass part of url in new link? Using only HTML & PHP

I have been trying to attempt to use the facebook share function in my website but i cant seems to have the right result.
Say:
i have a page called http://www.example.com/product.php?prod=lpd026n&cat=43
and i am using facebook's share function to have visitors to share the page in the FB wall.
i tried writing the link this way but i doesn't seems to be successful:
href="http://www.facebook.com/share.php?u=www.example.com/proddetail.php?<?php print urlencode(#$_SERVER['QUERY_STRING']!=''?'?'.$_SERVER['QUERY_STRING']:'')?>"
as the result the arguments in the URL came out to be in %26, %3D and etc..
Ie: example.com/proddetail.php?prod%3Dlpd026n%26cat%3D43
as some of you may know that the data after '?' is dynamic and i am planing to use the code above in the frame of the page, so it will have different query passed to the share link in every new item.
The end result that i want got to look like this:
http://www.facebook.com/sharer.php?u=http://www.example.com/proddetail.php?prod=lpd026n&cat=43
Not
http://www.facebook.com/share.php?u=http://www.example.com/proddetail.php?prod%3Dlpd026n%26cat%3D43
can anyone help me to solve this problem?
Thanks in advance!
Ps: if you are unclear, please ask me to further clarify.
This URL:
http://www.facebook.com/share.php?u=http://www.example.com/proddetail.php?prod%3Dlpd026n%26cat%3D43
is only partially-encoded. You actually need to fully URL-encode it before passing to FB, so that it won't interfere with FB's URL structure. I'm sure that their script will know how to parse it properly.
The correct method is:
$url = 'http://www.facebook.com/sharer.php?u='.urlencode('http://www.example.com/proddetail.php?prod=lpd026n&cat=43');
// evaluates to:
// http://www.facebook.com/sharer.php?u=http%3A%2F%2Fwww.example.com%2Fproddetail.php%3Fprod%3Dlpd026n%26cat%3D43
Update: build your dynamic query
// Original URL
$url = 'http://www.example.com/proddetail.php';
if ($_SERVER['QUERY_STRING'])
$url .= '?'.$_SERVER['QUERY_STRING'];
// Final URL for FB
$fb_url = 'http://www.facebook.com/share.php?u='.urlencode($url);
This is what urlencode does, what is the problem with the link this way?
Edit: I do not use PHP, but I think the following will do the trick (omitted the urlencode):
href="http://www.facebook.com/share.php?u=www.example.com/proddetail.php?<?php print $_SERVER['QUERY_STRING']?>"
I guess K Prime is right.
u need to encode the whole url because the slashes and ":" are still causing problems in this link ;)
$url = 'http://www.facebook.com/sharer.php?u='.urlencode('http://www.example.com/proddetail.php?prod=lpd026n&cat=43');
should be fine for your purposes.

Categories