Grabbing information from a webpage using PHP [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
Is it possible to scrape a webpage with PHP without downloading some sort of PHP library or extension?
Right now, I can grab meta tags from a website with PHP like this:
$tags = get_meta_tags('www.example.com/');
echo $tags['author']; // name
echo $tags['description']; // description
Is there a similar way to grab a info like the href from this tag, from any given website:
<link rel="img_src" href="image.png"/>
I would like to be able to do it with just PHP.
Thanks!

Try the file_get_contents function. For example:
<?php
$data = file_get_contents('www.example.com');
$regex = '/Search Pattern/';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>
You could also use the cURL library - http://php.net/manual/en/book.curl.php

Use curl for more advanced functionality. You'll be able to access headers, redirections etc. PHP Curl
<?php
$c = curl_init();
// set some options
curl_setopt($c, CURLOPT_URL, "google.com");
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($c);
curl_close($c);
?>

Related

Parsing Javascript on server [duplicate]

This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Web-scraping JavaScript page with Python
(18 answers)
Closed 3 years ago.
I am trying to create a a basic web crawler the specifically looks for links from adverts.
I have managed to find a script that uses cURL to get the contents of the target webpage
I also found one that uses DOM
<?php
$ch = curl_init("http://www.nbcnews.com");
$fp = fopen("source_code.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
These are great and I certainly feel like I'm heading in the right direction except quite a few adverts are displayed using JS and as it's client side, it obviously isn't processed and I only see the JS code and not the ads.
Basically, is there any way of getting the JS to execute before I start trying to extract the links?
Thanks

Displaying on my website another webpage at users request. [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Fetching data from another website
I want to create a webpage, that would display another webpage on it at user's request. User enters URL and sees webpage he wants on my website. Request to another page has to come from my server, not from user. Otherwise I could just use iframe.
I'm willing to write it on php because I know some of it. Can anyone tell me what subjects one must know to do this ?
You need some kind of "PHP Proxy" for this, that means get the website contents via curl or file_get_contents(). Have a look at this here: http://davidwalsh.name/curl-download
Your proxy script that may look like this:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo get_data($_GET["url"]);
Please note that you may have to pay attention to headers for images etc. and there may also be some security flaws, but that is the basic idea.
Now you have to parse the contents of the initial website you just got and change all links from this format:
http://example.com/thecss.css
to
http://yoursite.com/proxy.php?url=http://example.com/thecss.css
Some regexes or PHP HTML parser may work here.
You could just use
echo file_get_contents('http://google.com')
But why not just download a php webproxy package like http://sourceforge.net/projects/poxy/

How to parse curl URL, CSS and images? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do I save a web page, programatically?
I'm just starting with curl and I've managed to pull an external website:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$test = get_data("http://www.selfridges.com");
echo $test;
However the CSS and images are not included. I need to be also retrieve the CSS and images, basically the whole website. Can someone please post a brief way for me to get started in understanding how to parse the CSS, images and URL to get me going?
There are better tools to do this than PHP, eg. wget with the --page-requisites parameter.
Note however that automatic scraping is often a violation of the site's TOS.
There are HTML parsers for PHP. There are qute a few available, here's a post that discusses that: How do you parse and process HTML/XML in PHP?

Extract Title from web page using the url of that page [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to extract img src, title and alt from html using php?
I want to replicate some functionality from Digg.com whereby when you post a new address it automatically scans the url and finds the page title.
Please tell how it is done in php......is there any other management system available by which you can make website like digg
You can use file_get_contents() to get the data from the page, then use preg_match() along with a regex pattern to get the data between <title></title>
'/<title>(.*?)<\/title>'/
You can achieve this using Ajax call to the server, where you curl the URL and send back all the details you want. You might be interested in Title, Description, Keywords etc.
function get_title($url) {
$ch = curl_init();
$titleName = '';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
// data will contain the whole page you are looking for
// You need to parse it for the string like this <title>Google</title>
// start = strrpos($data, '<title>');
// end = strrpos($data, '</title>');
// substr($data, $start + 6, $end); 6 - length of title
return $titleName;
}
You need to implement smarter way of parsing, because <title > Google < /title> it will no find.

What's wrong with this php json script? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to get useful error messages in PHP?
I can't get this php json script to work. I'm trying to get the screen name from twitter, using their api.
Here's what I did.
$send_request = file_get_contents('https://api.twitter.com/1/users/lookup.json?screen_name=frankmeacey');
$request_contents = json_decode($send_request);
echo $request_contents->screen_name;
Why is this returning a blank value every time? I've tried changing things here and there and it's just not working...
Because the data structure you get back is an array of objects, not an object.
echo $request_contents[0]->screen_name;
That data looks to be an object inside an array. Try
echo $request_contents[0]->screen_name;
Best to check first it is an array and to get the first user from it:
if (is_array($request_contents)) {
$user_info = $request_contents[0];
}
if (isset($user_info)) {
echo $user_info->screen_name;
}
It's
$request_contents[0]->screen_name
since $request_contents is an array of objects, not the object itself.
Do a
var_dump($request_contents);
to see the structure of your json.
Your page should not be blank .. you should get an error like Notice: Trying to get property of non-object in since you are calling $request_contents->screen_name which is not valid.
Try telling PHP to output all error Using
error_reporting(E_ALL);
I also prefer CURL its faster
$ch = curl_init("https://api.twitter.com/1/users/lookup.json?screen_name=frankmeacey");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
$request_contents = json_decode($result);
var_dump($request_contents[0]->screen_name);
Output
string 'frankmeacey' (length=11)
try to use
print_r($request_contents);
OR
var_dump($request_contents);
for checking array.

Categories