I'm not sure this is possible or not.
I want a php script when executed , it will go to a page (on a different domain) and get the html contents of it and inside the html there's links , and that script is able to get each link's href.
html code:
<div id="somediv">
Yahoo
Google
Facebook
</div>
The output code(which php will echo out) will be
http://yahoo.com
http://google.com
http://facebook.com
I have heard of cURL in php can do something like this but not exactly like this , i'm a bit confused , i hope some can guide me on this.
Thanks.
have a look at something like http://simplehtmldom.sourceforge.net/
Using DOM and XPath:
<?php
$doc = new DOMDocument();
$doc->loadHTMLFile("http://www.example.com/"); // or you could load from a string using loadHTML();
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//div[#id='somediv']//a");
foreach($elements as $elem){
echo $elem->getAttribute('href');
}
BTW: you should read up on DOM and XPath.
Related
I am studying parsing HTML on PHP and I am using DOM for this.
I write this code inside my php file:
<?php
$site = new DOMDocument();
$div = $site->createElement("div");
$class = $site->createAttribute("class");
$class->nodeValue = "wrapper";
$div->appendChild($class);
$site->appendChild($div);
$html = $site->saveHTML();
echo $html;
?>
And when I run this on the browser and view the page source, only this code comes out:
<div class="wrapper"></div>
I don't know why it is not showing the whole html document that supposedly have to be. I am using XAMPP v3.2.1.
Please tell me where did I gone wrong with this. Thanks.
It's showing the whole HTML you created. A div node with a wrapper class attribute.
See the example in the docs. There the html, head, etc. nodes are explicitly created.
PHP only adds missing DOCTYPE, html and body elements when loading HTML, not when saving.
Adding $site->loadHTML($site->saveHTML()); before $html = $site->saveHTML(); will demonstrate this.
I've tried to grab all the comments from a website (The text between <!-- and -->), but without luck.
Here is my current code:
include('simple_html_dom.php');
$html = file_get_html('THE URL');
foreach($html->find('comment') as $element)
echo $element->plaintext;
Anyone have any ideas how to grab the comments, at the moment it's only giving me a blank page
I know regex is not supposed to parse HTML, but <!--(.*?)--> you can use a similar regex to find and fetch the comments...
With php file_get_contents() i want just only the post and image. But it's get whole page. (I know there is other way to do this)
Example:
$homepage = file_get_contents('http://www.bdnews24.com/details.php?cid=2&id=221107&hb=5',
true);
echo $homepage;
It's show full page. Is there any way to show only the post which cid=2&id=221107&hb=5.
Thanks a lot.
Use PHP's DomDocument to parse the page. You can filter it more if you wish, but this is the general idea.
$url = 'http://www.bdnews24.com/details.php?cid=2&id=221107&hb=5';
// Create new DomDocument
$doc = new DomDocument();
$doc->loadHTMLFile($url);
// Get the post
$post = $doc->getElementById('opage_mid_left');
var_dump($post);
Update:
Unless the image is a requirement, I'd use the printer-friendly version: http://www.bdnews24.com/pdetails.php?id=221107, it's much cleaner.
You will need to parse the resulting HTML using a DOM parser to get the HTML of only the part you want. I like PHP Simple HTML DOM Parser, but as Paul pointed out, PHP also has it's own.
you can extract the
<div id="page">
//POST AND IMAGE EXIST HERE
</div>
part from the fetched contents using regex and push it on your page...
i want make a news site gets its content from other news sites,
open the rss feed and feach url and open the html dom of the page then
get just the text of the news
i think i have to use the DOMDocument class of the php?
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo $doc->saveHTML();
?>
http://www.php.net/manual/en/class.domdocument.php
RSS feeds are XML. To get the links here I would use simpleXML. To load the page you can use cURL or HttpRequest.
To analyse the returned code I would use DOMDocument, too! Alternatively you could use simpleHtmlDom.
First, I know that I can get the HTML of a webpage with:
file_get_contents($url);
What I am trying to do is get a specific link element in the page (found in the head).
e.g:
<link type="text/plain" rel="service" href="/service.txt" /> (the element could close with just >)
My question is: How can I get that specific element with the "rel" attribute equal to "service" so I can get the href?
My second question is: Should I also get the "base" element? Does it apply to the "link" element? I am trying to follow the standard.
Also, the html might have errors. I don't have control on how my users code there stuff.
Using PHP's DOMDocument, this should do it (untested):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}
You should get the Base element, but know how it works and its scope.
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
http://code.google.com/p/phpquery/
I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.