This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Web-scraping JavaScript page with Python
(18 answers)
Closed 3 years ago.
I am trying to create a a basic web crawler the specifically looks for links from adverts.
I have managed to find a script that uses cURL to get the contents of the target webpage
I also found one that uses DOM
<?php
$ch = curl_init("http://www.nbcnews.com");
$fp = fopen("source_code.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
These are great and I certainly feel like I'm heading in the right direction except quite a few adverts are displayed using JS and as it's client side, it obviously isn't processed and I only see the JS code and not the ads.
Basically, is there any way of getting the JS to execute before I start trying to extract the links?
Thanks
Related
This question already has answers here:
how to get the cookies from a php curl into a variable
(8 answers)
Closed 8 years ago.
When I get a response from a page, it gives a response data but if I want to get cookie of the session which is set by page, how can I get it with PHP cURL?
There are two ways(may be more) you can do this.
Using the cookie file:
$cookie_file = 'e:/demo/cookies.txt';
curl_setopt($ch,CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch,CURLOPT_COOKIEFILE, $cookie_file);
Using from the header that is responded back with html source from curl.
curl_setopt($curl_connection, CURLOPT_HEADER, true);
// this is returning the http response header along with html
You'll find the cookies there under the Set-Cookie: header for second example.
By the way, I assume you know how to handle curl. If you don't here are few helps.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
Is it possible to scrape a webpage with PHP without downloading some sort of PHP library or extension?
Right now, I can grab meta tags from a website with PHP like this:
$tags = get_meta_tags('www.example.com/');
echo $tags['author']; // name
echo $tags['description']; // description
Is there a similar way to grab a info like the href from this tag, from any given website:
<link rel="img_src" href="image.png"/>
I would like to be able to do it with just PHP.
Thanks!
Try the file_get_contents function. For example:
<?php
$data = file_get_contents('www.example.com');
$regex = '/Search Pattern/';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>
You could also use the cURL library - http://php.net/manual/en/book.curl.php
Use curl for more advanced functionality. You'll be able to access headers, redirections etc. PHP Curl
<?php
$c = curl_init();
// set some options
curl_setopt($c, CURLOPT_URL, "google.com");
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($c);
curl_close($c);
?>
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Fetching data from another website
I want to create a webpage, that would display another webpage on it at user's request. User enters URL and sees webpage he wants on my website. Request to another page has to come from my server, not from user. Otherwise I could just use iframe.
I'm willing to write it on php because I know some of it. Can anyone tell me what subjects one must know to do this ?
You need some kind of "PHP Proxy" for this, that means get the website contents via curl or file_get_contents(). Have a look at this here: http://davidwalsh.name/curl-download
Your proxy script that may look like this:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo get_data($_GET["url"]);
Please note that you may have to pay attention to headers for images etc. and there may also be some security flaws, but that is the basic idea.
Now you have to parse the contents of the initial website you just got and change all links from this format:
http://example.com/thecss.css
to
http://yoursite.com/proxy.php?url=http://example.com/thecss.css
Some regexes or PHP HTML parser may work here.
You could just use
echo file_get_contents('http://google.com')
But why not just download a php webproxy package like http://sourceforge.net/projects/poxy/
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do I save a web page, programatically?
I'm just starting with curl and I've managed to pull an external website:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$test = get_data("http://www.selfridges.com");
echo $test;
However the CSS and images are not included. I need to be also retrieve the CSS and images, basically the whole website. Can someone please post a brief way for me to get started in understanding how to parse the CSS, images and URL to get me going?
There are better tools to do this than PHP, eg. wget with the --page-requisites parameter.
Note however that automatic scraping is often a violation of the site's TOS.
There are HTML parsers for PHP. There are qute a few available, here's a post that discusses that: How do you parse and process HTML/XML in PHP?
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to extract img src, title and alt from html using php?
I want to replicate some functionality from Digg.com whereby when you post a new address it automatically scans the url and finds the page title.
Please tell how it is done in php......is there any other management system available by which you can make website like digg
You can use file_get_contents() to get the data from the page, then use preg_match() along with a regex pattern to get the data between <title></title>
'/<title>(.*?)<\/title>'/
You can achieve this using Ajax call to the server, where you curl the URL and send back all the details you want. You might be interested in Title, Description, Keywords etc.
function get_title($url) {
$ch = curl_init();
$titleName = '';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
// data will contain the whole page you are looking for
// You need to parse it for the string like this <title>Google</title>
// start = strrpos($data, '<title>');
// end = strrpos($data, '</title>');
// substr($data, $start + 6, $end); 6 - length of title
return $titleName;
}
You need to implement smarter way of parsing, because <title > Google < /title> it will no find.