DOM Manipulation with PHP - php

I would like to make a simple but non trivial manipulation of DOM Elements with PHP but I am lost.
Assume a page like Wikipedia where you have paragraphs and titles (<p>, <h2>). They are siblings. I would like to take both elements, in sequential order.
I have tried GetElementbyName but then you have no possibility to organize information.
I have tried DOMXPath->query() but I found it really confusing.
Just parsing something like:
<html>
<head></head>
<body>
<h2>Title1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Title2</h2>
<p>Paragraph3</p>
</body>
</html>
into:
Title1
Paragraph1
Paragraph2
Title2
Paragraph3
With a few bits of HTML code I do not need between all.
Thank you. I hope question does not look like homework.

I think DOMXPath->query() is the right approach. This XPath expression will return all nodes that are either a <h2> or a <p> on the same level (since you said they were siblings).
/html/body/*[name() = 'p' or name() = 'h2']
The nodes will be returned as a node list in the right order (document order). You can then construct a foreach loop over the result.

I have uased a few times simple html dom by S.C.Chen.
Perfect class for access dom elements.
Example:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Check it out here. simplehtmldom
May help with future projects

Try having a look at this library and corresponding project:
Simple HTML DOM
This allows you to open up an online webpage or a html page from filesystem and access its items via class names, tag names and IDs. If you are familiar with jQuery and its syntax you need no time in getting used to this library.

Related

How to get desire innertext from html tag in simple html dom

I have some text in which there is codes. I want to get last text from the link. here is an example
Some textBeezfeed.cu.ma<br>
another textGoogle.com<br>
I want to get Google.com text from the above code. I have tried and use Simple html dom. Anyway Here is my code
<?PHP
require_once('simple_html_dom.php');
$html = new simple_html_dom();
function tags($ddd){
$bbb=$ddd->find('a',1);
foreach($bbb as $bs){
echo $bs->innertext;
}
}
$html = str_get_html('Some textBeezfeed.cu.ma<br>
another textGoogle.com<br>');
echo tags($html);
?>
I want to get Google.com how to get. Please help me
I strongly recommend you use some external library to parse HTML. Any HTML you need. As you need today or in future needs.
Some very good tools are named inside these stackoverflow post.
I personally use simplehtmldom.sourceforge.net since ages with very good results.

How can we get specific links using simple html dom

I have used this script which i found in the official simple html dom site to find hyperlinks in a website
foreach($html->find('a') as $element)
echo $element->href . '<br>';
it returned all the links found in the website but i want only specific links in that website.
is there a way of doing it in simple html dom. This is the html code for that specific links
<a class="z" href="http://www.bbc.co.uk/news/world-middle-east-16893609" target="_blank" rel="follow">middle east</a>
where this is the html tag which is different from other hyperlinks
<a class="z"
and also there is any way i can get the link text ("middle east") together with the link.
I understand you'd like all a elements with the class z? You can do that like this:
foreach($html->find('a.z') as $element)
You can get an element's value (which for links will be the link text) with the plaintext property:
$element->plaintext
Please note that this can all be found in the manual.

How do I get the link element in a html page with PHP

First, I know that I can get the HTML of a webpage with:
file_get_contents($url);
What I am trying to do is get a specific link element in the page (found in the head).
e.g:
<link type="text/plain" rel="service" href="/service.txt" /> (the element could close with just >)
My question is: How can I get that specific element with the "rel" attribute equal to "service" so I can get the href?
My second question is: Should I also get the "base" element? Does it apply to the "link" element? I am trying to follow the standard.
Also, the html might have errors. I don't have control on how my users code there stuff.
Using PHP's DOMDocument, this should do it (untested):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}
You should get the Base element, but know how it works and its scope.
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
http://code.google.com/p/phpquery/
I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.

How to write this crawler in php?

I need to create a php script.
The idea is very simple:
When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server.
What PHP function I have to use for this crawler ?
Use PHP Simple HTML DOM Parser
// Create DOM from URL
$html = file_get_html('http://www.example.com/');
// Find all images
$images = array();
foreach($html->find('img') as $element) {
$images[] = $element->src;
}
Now $images array have images links of given webpage. Now you can store your desired image in database.
HTML Parser: HTMLSQL
Features: you can get external html file, http or ftp link and parse content.
Well, you'll have to use quite a few functions :)
But I'm going to assume that you're asking specifically about finding the image, and say that you should use a DOM parser like Simple HTML DOM Parser, then curl to grab the src of the first img element.
I would user file_get_contents() and a regular expression to extract the first image tags src attribute.
CURL or a HTML Parser seem overkill in this case, but you are welcome to check it out.

PHP SimpleXML: How can I load an HTML file?

When I try to load an HTML file as XML using simplexml_load_string I get many errors and warnings regarding the HTML and it fails, it there a way to properly load an html file using SimpleXML?
This HTML file may have unneeded spaces and maybe some other errors that I would like SimpleXML to ignore.
Use DomDocument::loadHtmlFile together with simplexml_import_dom to load non-wellformed HTML pages into SimpleXML.
I would suggest using PHP Simple HTML DOM. I've used it myself for anything from page scraping to manipulating HTML template files and its very simple and quite powerful and should suit your needs just fine.
Here's a few examples from their docs that show the kind of things you can do:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Here's some quick code to load an external html page, then parse it with simple xml.
//suppresses errors generated by poorly-formed xml
libxml_use_internal_errors(true);
//create the html object
$html = new DOMDocument();
//load the external html file
$html->loadHtmlFile('http://blahwhatever.com/');
//import the HTML object into simple xml
$shtml = simplexml_import_dom($html);
//print the result
echo "<pre>";
print_r($shtml);
echo "</pre>";
check this man page, one of those options (LIBXML_NOERROR for example) might help you.. but keep in mind that a html is not necessarily a valid xml, so parsing it as xml might not work.

Categories