I'm using some simple php to scrape information from a website to allow reading it offline. The code seems to be working fine but I am worried about undefined behaviour. The site is a bit poorly coded and some of the elements I'm grabbing share the same id with another element. I'd imagine that getElementById traverses the DOM from top to bottom and the reason I'm not having an issue is because the element I need is the first instance with the id. Is there any way to ensure this behaviour? The element has no other real way of distinguishing it so selecting it by id seems to be the best option. I have included a stripped back example of the code I'm using below.
Thanks.
<?php
$curl_referer = "http://example.com/";
$curl_url = "http://example.com/content.php";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Scraper/0.9');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_REFERER, "$curl_referer");
curl_setopt($ch, CURLOPT_URL, "$curl_url");
$output = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($output);
$content = $dom->getElementById('content');
echo $content->nodeValue;
?>
Try using XPath expression to get the first containing id.
Like that: //*[#id="content"][1]
The PHP code will be like that:
$xpath = new DOMXPath($dom);
$xpath->query('//*[#id="content"][1]')->item(0)->nodeValue;
And an tip: use libxml_use_internal_errors(true), you can catch they latter for logging or try tidying-up the document.
Edit
Hey, in your code you're setting the UA as "Scraper/0.9", most people that write a bad website doesn't look at that and doesn't do logging incoming requests in their pages, but, i don't recommend to put UA like that, just put an browser UA, like chrome's user-agent because if they're monitoring and see requests that contains this user-agent, they will be blacklist you (future).
Related
I have a URL like this https://facebook.com/5 , I want to get HTML of that page, just like view source.
I tried using file_get_contents but that didn't returned me correct stuff.
Am I missing something ?
Is there any other function that I can utilize ?
If I can't get HTML of that page, what special thing did the developer do while coding the site to avoid this thing ?
Warning for being off topic
But does this task have be done using PHP?
Since this sounds like a task of web-scraping, I think you would gain more use in casperjs.
With this, you can target with precision what you would want to retrieved from the fb-page rather than grabbing the whole content, which I assume as of this writing is generated by multiple requests of content and rendered to you through a virtual DOM.
Please note that I haven't tried retrieving content from facebook, but I've done this with multiple services.
Good luck!
You may want to use curl instead: http://php.net/manual/en/curl.examples.php
Edit:
Here is an example of mine:
$url = 'https://facebook.com/5';
$ssl = true;
$ch = curl_init();
$timeout = 3;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, $ssl);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
Note that depending on the websites vhost configuration a slash at the end of the url can make a difference.
Edit: Sorry for the undefined variable.. I copied it out of a helper method i used. Now it should be alright.
Yet another Edit:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
By adding this option you will follow the redirects that are apperently happening in your example. Since you said it was an example I actually didnt run it before. Now I did and it works.
I am working on a small project to get information from several webpages based on the HTML Markup of the page, and I do not know where to start at all.
The basic idea is of getting the Title from <h1></h1>s, and content from the <p></p>s tags and other important information that is required.
I would have to setup each case from each source for it to work the way it needs. I believe the right method is using $_GET method with PHP. The goal of the project is to build a database of information.
What is the best method to grab the information which I need?
First of all: PHP's $_GET is not a method. As you can see in the documentation $_GET is simply an array initialized with the GET's parameters your web server received during the current query. As such it is not what you want to use for this kind of things.
What you should look into is cURL that allows you to compose even fairly complex query, send to the destination server and retrieve the response. For example for a POST request you could do something like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.mysite.com/tester.phtml");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
"postvar1=value1&postvar2=value2&postvar3=value3");
// in real life you should use something like:
// curl_setopt($ch, CURLOPT_POSTFIELDS,
// http_build_query(array('postvar1' => 'value1')));
// receive server response ...
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
curl_close ($ch);
Source
Of course if you don't have to do any complex query but simple GET requests you can go with the PHP function file_get_contents
After you received the web page content you have to parse it. IMHO the best way to do this is by using PHP's DOM functions. How to use them should really be another question, but you can find tons of example without much effort.
<?php
$remote = file_get_contents('http://www.remote_website.html');
$doc = new DomDocument();
$file = #$doc->loadHTML($remote);
$cells = #$doc->getElementsByTagName('h1');
foreach($cells AS $cell)
{
$titles[] = $cell->nodeValue ;
}
$cells = #$doc->getElementsByTagName('p');
foreach($cells AS $cell)
{
$content[] = $cell->nodeValue ;
}
?>
You can get the HTML source of a page with:
<?php
$html= file_get_contents('http://www.example.com/');
echo $html;
?>
Then once you ahve the structure of the page you get the request tag with substr() and strpos()
I am working on a application that needs to scrape a part of a website the user submits. I want to collect useful and readable content from the website and definitely not the whole site. If I look at applications that also do this (thinkery for example) I notice that that they somehow managed to create a way to scrape the website, try to guess what useful content is, show it in a readable format and they do that pretty fast.
I've been playing with cURL and I am getting pretty near the result I want but I have some issues and was wondering if someone has some more insights.
$ch = curl_init('http://www.example.org');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// $content contains the whole website
$content = curl_exec($ch);
curl_close($ch);
With the very simple code above I can scrape the whole website and with preg_match() I can try to find div's with the class, id or properties which contains the string 'content', 'summary' et cetera.
If preg_match() has result I can fairly guess that I have found relevant content and save this as the summary of the saved page. The problem I have is that cURL saves the WHOLE page in memory so this can take up a lot of time and resources. And I think that doing a preg_match() over such a large result can also take up a lot of time.
Is there a better way to achieve this?
I tried the DomDocument::loadHTMLFile as One Trick Pony suggested (Thanks!)
$ch = curl_init('http://stackoverflow.com/questions/17180043/extracting-useful-readable-content-from-a-website');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
#$doc->loadHTML($content);
$div_elements = $doc->getElementsByTagName('div');
if ($div_elements->length <> 0)
{
foreach ($div_elements as $div_element)
{
if ($div_element->getAttribute('itemprop') == 'description')
{
var_dump($div_element->nodeValue);
}
}
}
The result for above code is my question here on this page! Only thing left to do is find a good and consistent way to loop through or query the divs and determine if it is useful content.
I'm trying to figure it out how to take information from another site(with different domain name) and place it in my php program.
Explanation:
User inputs URL from another site.
jQuery or PHP takes information from entered URL. I know where the information is (i know its' divs ID)
And that var is put into my php program as a variable $kaina, for example.
EX:
User enters URL:http://www.sportsdirect.com/lee-cooper-bud-mens-boots-118358
And I want to get the Price. (27,99)
What lang should I use? PHP? or jquery? or anything else?
What function should I use?
How should the program look like?
Thank you for your answers :)
I'd say you have to use php (curl or file_get_contents) to download the page on to your server, parse it or use regular expression to get the price. But in this case it will be even trickier because it looks like this link leads to a page that uses javascript.
But you have to know the format of how you are going to extract the data. So PHP will do the job.
PHP's cURL library should do the trick for you: http://php.net/manual/en/book.curl.php
<?php
$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
You need to research on each of the step mentioned below,
One Thing That you can do is, post the message entered by the user to the server means PHP file, there you can extract the url entered by the user,
In order to extract the URL from the user post, you can use regex search:-
Check this link out:-
Extract URLs from text in PHP
Know you can curl to the url extracted from the user input.
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $extracted_url );
$html = curl_exec ( $ch );
curl_close($ch);
The curl output will contain the complete html of the page, you can then use a HTML parser
$DOM = new DOMDocument;
$DOM->loadHTML($str);
to parse till the required div is found, to have its value.
I would proabaly do something like this:
get the contents of the file: $contents = file_get_contents("http://www.sportsdirect.com/lee-cooper-bud-mens-boots-118358")
convert the contents you just got to xml: $xml = new SimpleXMLElement($contents);
search the xml for the node with attribute itemprop="price" using xpath query
read the contents of that node, et voila, you have your price
I am grabbing a page and then converting it into an xml format, the function im using is below
public function getXML($url){
$ch = curl_init();
//curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
//curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
$xml = simplexml_load_string($response);
return $xml;
}
print_r($curl->getXML("http://www.amazon.co.uk/gp/offer-listing/0292783760/ref=tmm_pap_new_olp_sr?ie=UTF8&condition=used"));
After trying different urls nothing is returned, the page loads fine so the problem is with the line $xml = simplexml_load_string($response);
What could be wrong with this code?
Not understanding exactly what you're up to, it looks like you're trying to scrape the Amazon web page? If I pull up that URL in my browser, it's not listed as XHTML in the headers or document itself--I suspect it's not. I don't think simplexml can handle that.
(Does CURL do the conversion to XML for you? I don't think so but I'm not a master of all things CURL. If so, it might be an incompatability between CURL's output and what simplxml--which is fairly limited--will take in).
You might try working with DOMDocument instead, although my PHP could be a bit out of date--there may be better utilities these days.
A quick googling brought up this tutorial
<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
?>
I don't think this is a complete answer, but it was a bit much for a comment; so take it with a grain of salt and a healthy serving of doubt. I hope it inspires some ideas.