This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do I save a web page, programatically?
I'm just starting with curl and I've managed to pull an external website:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$test = get_data("http://www.selfridges.com");
echo $test;
However the CSS and images are not included. I need to be also retrieve the CSS and images, basically the whole website. Can someone please post a brief way for me to get started in understanding how to parse the CSS, images and URL to get me going?
There are better tools to do this than PHP, eg. wget with the --page-requisites parameter.
Note however that automatic scraping is often a violation of the site's TOS.
There are HTML parsers for PHP. There are qute a few available, here's a post that discusses that: How do you parse and process HTML/XML in PHP?
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
Is it possible to scrape a webpage with PHP without downloading some sort of PHP library or extension?
Right now, I can grab meta tags from a website with PHP like this:
$tags = get_meta_tags('www.example.com/');
echo $tags['author']; // name
echo $tags['description']; // description
Is there a similar way to grab a info like the href from this tag, from any given website:
<link rel="img_src" href="image.png"/>
I would like to be able to do it with just PHP.
Thanks!
Try the file_get_contents function. For example:
<?php
$data = file_get_contents('www.example.com');
$regex = '/Search Pattern/';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>
You could also use the cURL library - http://php.net/manual/en/book.curl.php
Use curl for more advanced functionality. You'll be able to access headers, redirections etc. PHP Curl
<?php
$c = curl_init();
// set some options
curl_setopt($c, CURLOPT_URL, "google.com");
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($c);
curl_close($c);
?>
This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Web-scraping JavaScript page with Python
(18 answers)
Closed 3 years ago.
I am trying to create a a basic web crawler the specifically looks for links from adverts.
I have managed to find a script that uses cURL to get the contents of the target webpage
I also found one that uses DOM
<?php
$ch = curl_init("http://www.nbcnews.com");
$fp = fopen("source_code.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
These are great and I certainly feel like I'm heading in the right direction except quite a few adverts are displayed using JS and as it's client side, it obviously isn't processed and I only see the JS code and not the ads.
Basically, is there any way of getting the JS to execute before I start trying to extract the links?
Thanks
I am working on a application that needs to scrape a part of a website the user submits. I want to collect useful and readable content from the website and definitely not the whole site. If I look at applications that also do this (thinkery for example) I notice that that they somehow managed to create a way to scrape the website, try to guess what useful content is, show it in a readable format and they do that pretty fast.
I've been playing with cURL and I am getting pretty near the result I want but I have some issues and was wondering if someone has some more insights.
$ch = curl_init('http://www.example.org');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// $content contains the whole website
$content = curl_exec($ch);
curl_close($ch);
With the very simple code above I can scrape the whole website and with preg_match() I can try to find div's with the class, id or properties which contains the string 'content', 'summary' et cetera.
If preg_match() has result I can fairly guess that I have found relevant content and save this as the summary of the saved page. The problem I have is that cURL saves the WHOLE page in memory so this can take up a lot of time and resources. And I think that doing a preg_match() over such a large result can also take up a lot of time.
Is there a better way to achieve this?
I tried the DomDocument::loadHTMLFile as One Trick Pony suggested (Thanks!)
$ch = curl_init('http://stackoverflow.com/questions/17180043/extracting-useful-readable-content-from-a-website');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
#$doc->loadHTML($content);
$div_elements = $doc->getElementsByTagName('div');
if ($div_elements->length <> 0)
{
foreach ($div_elements as $div_element)
{
if ($div_element->getAttribute('itemprop') == 'description')
{
var_dump($div_element->nodeValue);
}
}
}
The result for above code is my question here on this page! Only thing left to do is find a good and consistent way to loop through or query the divs and determine if it is useful content.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Fetching data from another website
I want to create a webpage, that would display another webpage on it at user's request. User enters URL and sees webpage he wants on my website. Request to another page has to come from my server, not from user. Otherwise I could just use iframe.
I'm willing to write it on php because I know some of it. Can anyone tell me what subjects one must know to do this ?
You need some kind of "PHP Proxy" for this, that means get the website contents via curl or file_get_contents(). Have a look at this here: http://davidwalsh.name/curl-download
Your proxy script that may look like this:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo get_data($_GET["url"]);
Please note that you may have to pay attention to headers for images etc. and there may also be some security flaws, but that is the basic idea.
Now you have to parse the contents of the initial website you just got and change all links from this format:
http://example.com/thecss.css
to
http://yoursite.com/proxy.php?url=http://example.com/thecss.css
Some regexes or PHP HTML parser may work here.
You could just use
echo file_get_contents('http://google.com')
But why not just download a php webproxy package like http://sourceforge.net/projects/poxy/
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to extract img src, title and alt from html using php?
I want to replicate some functionality from Digg.com whereby when you post a new address it automatically scans the url and finds the page title.
Please tell how it is done in php......is there any other management system available by which you can make website like digg
You can use file_get_contents() to get the data from the page, then use preg_match() along with a regex pattern to get the data between <title></title>
'/<title>(.*?)<\/title>'/
You can achieve this using Ajax call to the server, where you curl the URL and send back all the details you want. You might be interested in Title, Description, Keywords etc.
function get_title($url) {
$ch = curl_init();
$titleName = '';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
// data will contain the whole page you are looking for
// You need to parse it for the string like this <title>Google</title>
// start = strrpos($data, '<title>');
// end = strrpos($data, '</title>');
// substr($data, $start + 6, $end); 6 - length of title
return $titleName;
}
You need to implement smarter way of parsing, because <title > Google < /title> it will no find.