I am working on a small project to get information from several webpages based on the HTML Markup of the page, and I do not know where to start at all.
The basic idea is of getting the Title from <h1></h1>s, and content from the <p></p>s tags and other important information that is required.
I would have to setup each case from each source for it to work the way it needs. I believe the right method is using $_GET method with PHP. The goal of the project is to build a database of information.
What is the best method to grab the information which I need?
First of all: PHP's $_GET is not a method. As you can see in the documentation $_GET is simply an array initialized with the GET's parameters your web server received during the current query. As such it is not what you want to use for this kind of things.
What you should look into is cURL that allows you to compose even fairly complex query, send to the destination server and retrieve the response. For example for a POST request you could do something like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.mysite.com/tester.phtml");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
"postvar1=value1&postvar2=value2&postvar3=value3");
// in real life you should use something like:
// curl_setopt($ch, CURLOPT_POSTFIELDS,
// http_build_query(array('postvar1' => 'value1')));
// receive server response ...
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
curl_close ($ch);
Source
Of course if you don't have to do any complex query but simple GET requests you can go with the PHP function file_get_contents
After you received the web page content you have to parse it. IMHO the best way to do this is by using PHP's DOM functions. How to use them should really be another question, but you can find tons of example without much effort.
<?php
$remote = file_get_contents('http://www.remote_website.html');
$doc = new DomDocument();
$file = #$doc->loadHTML($remote);
$cells = #$doc->getElementsByTagName('h1');
foreach($cells AS $cell)
{
$titles[] = $cell->nodeValue ;
}
$cells = #$doc->getElementsByTagName('p');
foreach($cells AS $cell)
{
$content[] = $cell->nodeValue ;
}
?>
You can get the HTML source of a page with:
<?php
$html= file_get_contents('http://www.example.com/');
echo $html;
?>
Then once you ahve the structure of the page you get the request tag with substr() and strpos()
Related
I am building a PHP script that needs to use content loaded from an external webpage, but I don't know how to send/receive data.
The external webpage is http://packer.50x.eu/ .
Basically, I want to send a script (which is manually done in the first form) and receive the output (from the second form).
I want to learn how to do it because it can surely be an useful thing in the future, but I have no clue where to start.
Can anyone help me? Thanks.
You can use curl to receive data from external page. Look this example:
$url = "http://packer.50x.eu/";
$ch = curl_init($url);
// curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // you can use some options for this request
// curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5); // or not to use them
// you can set many others options. read about them in manual
$data = curl_exec($ch);
curl_close($ch);
var_dump($data); // < -- here is received page data
curl_setopt manual
Hope, this will help you.
You may want to look at file_get_contents($url) as it is very simple to use, simpler than CURL (more limited though), so your code could look like:
$url = "http://packer.50x.eu/";
$url_content=file_get_contents($url);
echo $url_content;
Look at the documentation as you could use offset and other tricks.
I'm using some simple php to scrape information from a website to allow reading it offline. The code seems to be working fine but I am worried about undefined behaviour. The site is a bit poorly coded and some of the elements I'm grabbing share the same id with another element. I'd imagine that getElementById traverses the DOM from top to bottom and the reason I'm not having an issue is because the element I need is the first instance with the id. Is there any way to ensure this behaviour? The element has no other real way of distinguishing it so selecting it by id seems to be the best option. I have included a stripped back example of the code I'm using below.
Thanks.
<?php
$curl_referer = "http://example.com/";
$curl_url = "http://example.com/content.php";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Scraper/0.9');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_REFERER, "$curl_referer");
curl_setopt($ch, CURLOPT_URL, "$curl_url");
$output = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($output);
$content = $dom->getElementById('content');
echo $content->nodeValue;
?>
Try using XPath expression to get the first containing id.
Like that: //*[#id="content"][1]
The PHP code will be like that:
$xpath = new DOMXPath($dom);
$xpath->query('//*[#id="content"][1]')->item(0)->nodeValue;
And an tip: use libxml_use_internal_errors(true), you can catch they latter for logging or try tidying-up the document.
Edit
Hey, in your code you're setting the UA as "Scraper/0.9", most people that write a bad website doesn't look at that and doesn't do logging incoming requests in their pages, but, i don't recommend to put UA like that, just put an browser UA, like chrome's user-agent because if they're monitoring and see requests that contains this user-agent, they will be blacklist you (future).
I am working on a application that needs to scrape a part of a website the user submits. I want to collect useful and readable content from the website and definitely not the whole site. If I look at applications that also do this (thinkery for example) I notice that that they somehow managed to create a way to scrape the website, try to guess what useful content is, show it in a readable format and they do that pretty fast.
I've been playing with cURL and I am getting pretty near the result I want but I have some issues and was wondering if someone has some more insights.
$ch = curl_init('http://www.example.org');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// $content contains the whole website
$content = curl_exec($ch);
curl_close($ch);
With the very simple code above I can scrape the whole website and with preg_match() I can try to find div's with the class, id or properties which contains the string 'content', 'summary' et cetera.
If preg_match() has result I can fairly guess that I have found relevant content and save this as the summary of the saved page. The problem I have is that cURL saves the WHOLE page in memory so this can take up a lot of time and resources. And I think that doing a preg_match() over such a large result can also take up a lot of time.
Is there a better way to achieve this?
I tried the DomDocument::loadHTMLFile as One Trick Pony suggested (Thanks!)
$ch = curl_init('http://stackoverflow.com/questions/17180043/extracting-useful-readable-content-from-a-website');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
#$doc->loadHTML($content);
$div_elements = $doc->getElementsByTagName('div');
if ($div_elements->length <> 0)
{
foreach ($div_elements as $div_element)
{
if ($div_element->getAttribute('itemprop') == 'description')
{
var_dump($div_element->nodeValue);
}
}
}
The result for above code is my question here on this page! Only thing left to do is find a good and consistent way to loop through or query the divs and determine if it is useful content.
I'm trying to figure it out how to take information from another site(with different domain name) and place it in my php program.
Explanation:
User inputs URL from another site.
jQuery or PHP takes information from entered URL. I know where the information is (i know its' divs ID)
And that var is put into my php program as a variable $kaina, for example.
EX:
User enters URL:http://www.sportsdirect.com/lee-cooper-bud-mens-boots-118358
And I want to get the Price. (27,99)
What lang should I use? PHP? or jquery? or anything else?
What function should I use?
How should the program look like?
Thank you for your answers :)
I'd say you have to use php (curl or file_get_contents) to download the page on to your server, parse it or use regular expression to get the price. But in this case it will be even trickier because it looks like this link leads to a page that uses javascript.
But you have to know the format of how you are going to extract the data. So PHP will do the job.
PHP's cURL library should do the trick for you: http://php.net/manual/en/book.curl.php
<?php
$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
You need to research on each of the step mentioned below,
One Thing That you can do is, post the message entered by the user to the server means PHP file, there you can extract the url entered by the user,
In order to extract the URL from the user post, you can use regex search:-
Check this link out:-
Extract URLs from text in PHP
Know you can curl to the url extracted from the user input.
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $extracted_url );
$html = curl_exec ( $ch );
curl_close($ch);
The curl output will contain the complete html of the page, you can then use a HTML parser
$DOM = new DOMDocument;
$DOM->loadHTML($str);
to parse till the required div is found, to have its value.
I would proabaly do something like this:
get the contents of the file: $contents = file_get_contents("http://www.sportsdirect.com/lee-cooper-bud-mens-boots-118358")
convert the contents you just got to xml: $xml = new SimpleXMLElement($contents);
search the xml for the node with attribute itemprop="price" using xpath query
read the contents of that node, et voila, you have your price
I was wondering how I could download a webpage in php for parsing?
You can use something like this
$homepage = file_get_contents('http://www.example.com/');
echo $homepage;
Since you will likely want to parse the page with DOM, you can load the page directly with:
$dom = new DOMDocument;
$dom->load('http://www.example.com');
when your PHP has allow_url_fopen enabled.
But basically, any function that supports HTTP stream wrappers can be used to download a page.
With the curl library.
Just to add another option because it is there, while not the best is just to use file. Its another option that I dont see anyone has listed here.
$array = file("http://www.stackoverflow.com");
Its nice if you want it in an array of lines, whereas the already mentioned file_get_contents will put it in a string.
Just another thing you can do.
Then you can loop thru each line if this matches your goal to do so:
foreach($array as $line){
echo $line;
// do other stuff here
}
This comes in handy sometimes when certain APIs spit out plain text or html with a new entry on each line.
You can use this code
$url = 'your url';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec ($ch);
curl_close ($ch);
// you can do something with $data like explode(); or a preg match regex to get the exact information you need
//$data = strip_tags($data);
echo $data;