I'm using this code to get data using cURL
$url='http://example.com/'; //URL to get content from..
print_r(get_data($url)); //Dumps the content
/* Gets the data from a URL */
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
However, This code returns data with relative url. How can I get ride of this relative url & print with absolute url? May be with preg_replace.. But How ?
Have a look at the HTML base tag. You should find it helpful if you want to let the browser do all the relative-to-absolute conversion:
$data = get_data($url);
// Note: ideally you should use DOM manipulation to inject the <base>
// tag inside the <head> section
$data = str_replace("<head>", "<head><base href=\"$url\">", $data);
echo $data;
I think that you must to use a HTML parser like http://simplehtmldom.sourceforge.net/, and replace all links with the correct path.
Related
I have a php script that loads this webpage to extract some data from it's tables.
The following methods failed to get it's table contents:
Using file_get_contents:
$document -> file_get_contents("http://www.webpage.com/");
print_r($document);
Using cURL:
$document = curl_init('http://www.webpage.com/');
curl_setopt($document, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($document);
print_r($html);
Using loadHTMLFile:
$document->loadHTMLFile('http://www.webpage.com/');
print_r($document);
I'm not an expert in php and except the first method, the other ones are copied from StackOverflow's answers.
What am I doing wrong?
and How they do block some contents from loading?
Not the answer you're likely to want to hear, but none of the methods you describe will evaluate JavaScript and other browser resources as a normal browser client would. Instead, each of those methods retrieves the contents of only the file you've specified. A quick glance at the site you're targeting clearly shows this table in question being populated as the result of an AJAX call, which none of the methods you've tried are able to evaluate.
You'll need to lean on a library or script that has the capability for this type of emulation; namely laravel/dusk, the PHP bindings for Selenium webdriver, or something similar.
This is what I did to scrape data from a webpage using php curl:
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$target_url = "https://www.somesite.com";
$scraped_website = curl($target_url);
$data_set_1 = scrape_between($scraped_website, "%before%", "%after%");
$data_set_2 = scrape_between($scraped_website, "%before%", "%after%");
The %before% and %after% is data that always shows up on the webpage before and after the data you wish to grab. Could be div tags or some other html tags that are unique to the data you wish to grab.
So maybe look into using curl and and imitate the same ajax request that the site is using? When I searched for that, this is what I found:
Mimicking an ajax call with Curl PHP
This is my first ever question here.
I have following code:
<?php
set_time_limit(3000);
$url= "https://telenorcsms.com.pk:27677/corporate_sms2/api/auth.jsp? msisdn=xxxx&password=xxxx";
print_r(get_data($url));
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
?>
but unfortunately it's not giving any response. Just a blank page.
I want to get the response in XML and find the <data> value only.
While accessing the URL in browser directly its gives following response:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<corpsms>
<command>Auth_request</command>
<data>e8570f2482444d0ca50b76727d01a846</data>
<response>OK</response>
</corpsms>
Please respond with some solution......
I'm trying to request information about a domain without success; code:
<?php
echo file_get_contents('https://sb-ssl.google.com/safebrowsing/api/lookup?client=asasd&apikey=MYKEY&appver=1.5.2&pver=3.0&url=http%3A%2F%2Fwww.onet.pl%2F');
?>
Why isn'tit working?
//function for getting the data from url
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Then get the content using the function call :
$returned_content = get_data('your url');
get_file_contents() has huge security threat – and many servers have disabled this feature in PHP.
Why isn'tit working?
because wrong url
http://www.google.com/safebrowsing/diagnostic?site=http://example.com/
just take a look at the documentation:
with URLs, you should use urlencode()
fopen wrapper has to be enabled (same as for fopen())
maybe the url is wrong - when i copy your URL and try to open it, i get a pageload-failure.
I need any link that has a "a href=" tag when clicked to be received via curl. I can't hard code these links as they are from a dynamic site so could be anything. How would I achieve this?
Thanks
Edit: Let me explain more. I have an app on my pc that uses a web front end. It catalogs files and gives yo options to rename delete etc. I want to add a public view however if I put it as is online then anyone can delete rename files. If I curl the pages I can remove the menu bars and editing options through the use of a different css. That part all works. The only part that isn't working is if I click on a link on the page it directs me back to the original link address and that defeats the point as the menu bars are back. I need it to curl the clicked links. Hope that makes more sense..
Here is my code that fetches the original link and curls that and changes the css to point to my own css. It points the java script to the original as I dont need to change that. I now need to make the "a href" links on the page when clicked be called by curl and not go to the original destination
<?php
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, 'http://192.168.0.14:8081/home/');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$curl_response = curl_exec($ch);
curl_close($ch);
//Change link url
$link = $curl_response;
$linkgo = '/sickbeard_public';
$linkfind = 'href="';
$linkreplace = 'href="' . $linkgo ;
$link = str_replace($linkfind, $linkreplace, $link);
//Change js url
$js = $link;
$jsgo = 'http://192.168.0.14:8081';
$jsfind = 'src="';
$jsreplace = 'src="' . $jsgo ;
$js = str_replace($jsfind, $jsreplace, $js);
//Fix on page link errors
$alink = $js;
$alinkgo = 'http://192.168.0.14:8081/';
$alinkfind = 'a href="/sickbeard_public/';
$alinkreplace = 'a href="' . $alinkgo ;
$alink = str_replace($alinkfind, $alinkreplace, $alink);
//Echo page back
echo $alink;
?>
You could grab all the URLs using a regular expression
// insert general warning about how parsing HTML using regex is evil :-)
preg_match('/href="([^"]+)"/', $html, $matches);
$urls = array_slice($matches, 1);
// Now just loop through the array and fetch the URLs with cUrl...
While I can't imagine why you would do that I think you should use ajax.
Attach an event on every a tag and send them to a script on your server where the magic of curl would happen.
Anyway you should explain why you need to fetch data with curl.
As far as I can understand your question you need to get the contents of URL via CURL... so here is the solution
Click here to get via curl
Then attach an event with the above <a> tag, e.g. in JQuery
$("#my_link").click(function(){
var target_url = $(this).attr("href");
//Send an ajax call to some of your page like cURL_wrapper.php with target_url as parameter in get
});
then in cURL_wrapper.php do follwoing
<?php
//Get the $target_url here from $_GET[];
$ch = curl_init($your_domain");
$fp = fopen("$target_url", "r");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
I am looking to draw html of a webpage inside my website.
Take this scenario:
I have a website that checks availability of a hotel. But instead of hosting that hotel's images on my server. I simple curl, a specific page on the hotels website that contains their images.
Can I grab anything from the html and display it on my website? using their HTML code, but only the div(s) or images that i want to display?
I'm using this code, sourced from:
http://davidwalsh.name/download-urls-content-php-curl
As practice and arguments sake, lets try and display Google's logo from their homepage.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://www.google.com');
echo '<base href="http://www.google.com/" />';
echo $returned_content;
Thanks to #alex I have started to play with DOMDocument in PHP's lib. However, I have hit a snag.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = "www.abc.net.au";
$html = get_data($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$logo = $dom->getElementById("abcLogo");
var_dump($logo);
Returns: object(DOMElement)[2]
How do I parse this further? Or Simply just print/echo the contents of the DIV with that id..?
Yes, run the resulting HTML through something like DOMDocument to extract the portions you require.
Once you have found a DOM element, it can be a bit tricky to get the HTML of the element itself (rather than just its contents).
You can get the XML value of a single element very easily with DOMDocument::saveXML:
echo $dom->saveXML($logo);
This may be good enough for you. I believe there is a change coming that will add this functionality to saveHTML as well.
echo $logo->nodeValue should work because you can only have 1 element by id!