How to download a webpage in php - php

I was wondering how I could download a webpage in php for parsing?

You can use something like this
$homepage = file_get_contents('http://www.example.com/');
echo $homepage;

Since you will likely want to parse the page with DOM, you can load the page directly with:
$dom = new DOMDocument;
$dom->load('http://www.example.com');
when your PHP has allow_url_fopen enabled.
But basically, any function that supports HTTP stream wrappers can be used to download a page.

With the curl library.

Just to add another option because it is there, while not the best is just to use file. Its another option that I dont see anyone has listed here.
$array = file("http://www.stackoverflow.com");
Its nice if you want it in an array of lines, whereas the already mentioned file_get_contents will put it in a string.
Just another thing you can do.
Then you can loop thru each line if this matches your goal to do so:
foreach($array as $line){
echo $line;
// do other stuff here
}
This comes in handy sometimes when certain APIs spit out plain text or html with a new entry on each line.

You can use this code
$url = 'your url';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec ($ch);
curl_close ($ch);
// you can do something with $data like explode(); or a preg match regex to get the exact information you need
//$data = strip_tags($data);
echo $data;

Related

Get Content from Web Pages with PHP

I am working on a small project to get information from several webpages based on the HTML Markup of the page, and I do not know where to start at all.
The basic idea is of getting the Title from <h1></h1>s, and content from the <p></p>s tags and other important information that is required.
I would have to setup each case from each source for it to work the way it needs. I believe the right method is using $_GET method with PHP. The goal of the project is to build a database of information.
What is the best method to grab the information which I need?
First of all: PHP's $_GET is not a method. As you can see in the documentation $_GET is simply an array initialized with the GET's parameters your web server received during the current query. As such it is not what you want to use for this kind of things.
What you should look into is cURL that allows you to compose even fairly complex query, send to the destination server and retrieve the response. For example for a POST request you could do something like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.mysite.com/tester.phtml");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
"postvar1=value1&postvar2=value2&postvar3=value3");
// in real life you should use something like:
// curl_setopt($ch, CURLOPT_POSTFIELDS,
// http_build_query(array('postvar1' => 'value1')));
// receive server response ...
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
curl_close ($ch);
Source
Of course if you don't have to do any complex query but simple GET requests you can go with the PHP function file_get_contents
After you received the web page content you have to parse it. IMHO the best way to do this is by using PHP's DOM functions. How to use them should really be another question, but you can find tons of example without much effort.
<?php
$remote = file_get_contents('http://www.remote_website.html');
$doc = new DomDocument();
$file = #$doc->loadHTML($remote);
$cells = #$doc->getElementsByTagName('h1');
foreach($cells AS $cell)
{
$titles[] = $cell->nodeValue ;
}
$cells = #$doc->getElementsByTagName('p');
foreach($cells AS $cell)
{
$content[] = $cell->nodeValue ;
}
?>
You can get the HTML source of a page with:
<?php
$html= file_get_contents('http://www.example.com/');
echo $html;
?>
Then once you ahve the structure of the page you get the request tag with substr() and strpos()

Using php to open a url ending in .rss

I'm trying to load the data stored at
https://www.nextbigsound.com/charts/social50.rss using php.
I've tried using a curl and the simplexml_load_file function but neither of these are working for me. I'm pretty new to the language but the data that is displayed when I follow this link looks like nothing more than an xml document. However whenever I try to load the url, I get an error saying that it failed to load. Any ideas what I'm doing wrong?
Edit:
This is the line I tried with the simplexml_load_file
$rss = simplexml_load_file('https://www.nextbigsound.com/charts/social50.xml');
When I tried the curl method I used this:
$url = 'https://www.nextbigsound.com/charts/social50.xml';
$ch = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$rss = curl_exec($curl);
curl_close($curl);
To check if results were returned I just ran a simple var_dump($rss) but it always showed up as a boolean set to false.
Simple file_get_contents will work
$content = file_get_contents('https://www.nextbigsound.com/charts/social50.rss');
echo $content;
or with use of simple_xml_load_string
$data = file_get_contents('https://www.nextbigsound.com/charts/social50.rss');
$xml = simplexml_load_string($data);
The above code assumes that there is a https wrapper, you can check it by this:
echo 'https wrapper: ', in_array(stream_get_wrappers(), $w) ? 'yes':'no', "\n";
if is output false you need to alow php_openssl extension. This can be done by uncomenting line in your php.ini
extension=php_openssl.dll // on windows machine
extension=php_openssl.so // on unix machine

Load divs' content from another domain to my pages' php variable

I'm trying to figure it out how to take information from another site(with different domain name) and place it in my php program.
Explanation:
User inputs URL from another site.
jQuery or PHP takes information from entered URL. I know where the information is (i know its' divs ID)
And that var is put into my php program as a variable $kaina, for example.
EX:
User enters URL:http://www.sportsdirect.com/lee-cooper-bud-mens-boots-118358
And I want to get the Price. (27,99)
What lang should I use? PHP? or jquery? or anything else?
What function should I use?
How should the program look like?
Thank you for your answers :)
I'd say you have to use php (curl or file_get_contents) to download the page on to your server, parse it or use regular expression to get the price. But in this case it will be even trickier because it looks like this link leads to a page that uses javascript.
But you have to know the format of how you are going to extract the data. So PHP will do the job.
PHP's cURL library should do the trick for you: http://php.net/manual/en/book.curl.php
<?php
$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
You need to research on each of the step mentioned below,
One Thing That you can do is, post the message entered by the user to the server means PHP file, there you can extract the url entered by the user,
In order to extract the URL from the user post, you can use regex search:-
Check this link out:-
Extract URLs from text in PHP
Know you can curl to the url extracted from the user input.
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $extracted_url );
$html = curl_exec ( $ch );
curl_close($ch);
The curl output will contain the complete html of the page, you can then use a HTML parser
$DOM = new DOMDocument;
$DOM->loadHTML($str);
to parse till the required div is found, to have its value.
I would proabaly do something like this:
get the contents of the file: $contents = file_get_contents("http://www.sportsdirect.com/lee-cooper-bud-mens-boots-118358")
convert the contents you just got to xml: $xml = new SimpleXMLElement($contents);
search the xml for the node with attribute itemprop="price" using xpath query
read the contents of that node, et voila, you have your price

Use of Php function file_get_contents

I have to crawl some values from a website. Should I use curl for that or file_get_contents ??
I am getting some warning with file_get_contents at my localhost .
Any help will be appreciated
If you have basic requirements, I would favor file_get_contents. If you need to set headers, and request method etc... I would recommend using curl.
In your case, I think file_get_contents is enough.
Alternatively, you can use file which returns an array of lines from the retrieved file. It works with locally accessible files, and also with remote urls. I often find it more convenient to loop over an array of lines, than to deal with the whole file in one block - so this might be your best option.
<?php
foreach(file('http://example.com/the-file.ext') as $line){
// do something with $line
}
?>
I think Curl is more preferable, as compared to file_get_contents as you can set headers, request methods like POST or GET, follow redirection etc. So, curl will be advisble
<?php
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
?>
file_get_contents should be just fine for that purpose

Loading a remote xml page with file_get_contents()

I have seen some questions similar to this on the internet, none with an answer.
I want to return the source of a remote XML page into a string. The remote XML page, for the purposes of this question, is:
http://www.test.com/foo.xml
In a regular webbrowser, I can view the page and the source is an XML document. When I use file_get_contents('http://www.test.com/foo.xml'), however, it returns a string with the corresponding URL.
Is there to retrieve the XML component? I don't care if it uses file_get_contents or not, just something that will work.
You need to have allow_url_fopen set in your server for this to work.
If you donĀ“t, then you can use this function as a replacement:
<?php
function curl_get_file_contents($URL)
{
$c = curl_init();
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_URL, $URL);
$contents = curl_exec($c);
curl_close($c);
if ($contents) return $contents;
else return FALSE;
}
?>
Borrowed from here.
That seems odd. Does file_get_contents() return any valid data for other sites (not only XML)? An URL can only be used as the filename parameter if the fopen-wrappers has been enabled (which they are by default).
I'm guessing you're going to process the retrieved XML later on - then you should be able to load it into SimpleXml directly using the simplexml_load _file().
try {
$xml = simplexml_load_file('http://www.test.com/foo.xml');
print_r($xml);
} ...
I recommend using SimpleXML for reading XML-files, it's very easy to use.

Categories