PHP SimpleXML: How can I load an HTML file? - php

When I try to load an HTML file as XML using simplexml_load_string I get many errors and warnings regarding the HTML and it fails, it there a way to properly load an html file using SimpleXML?
This HTML file may have unneeded spaces and maybe some other errors that I would like SimpleXML to ignore.

Use DomDocument::loadHtmlFile together with simplexml_import_dom to load non-wellformed HTML pages into SimpleXML.

I would suggest using PHP Simple HTML DOM. I've used it myself for anything from page scraping to manipulating HTML template files and its very simple and quite powerful and should suit your needs just fine.
Here's a few examples from their docs that show the kind of things you can do:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';

Here's some quick code to load an external html page, then parse it with simple xml.
//suppresses errors generated by poorly-formed xml
libxml_use_internal_errors(true);
//create the html object
$html = new DOMDocument();
//load the external html file
$html->loadHtmlFile('http://blahwhatever.com/');
//import the HTML object into simple xml
$shtml = simplexml_import_dom($html);
//print the result
echo "<pre>";
print_r($shtml);
echo "</pre>";

check this man page, one of those options (LIBXML_NOERROR for example) might help you.. but keep in mind that a html is not necessarily a valid xml, so parsing it as xml might not work.

Related

PHP DOMDocument is not working

I am studying parsing HTML on PHP and I am using DOM for this.
I write this code inside my php file:
<?php
$site = new DOMDocument();
$div = $site->createElement("div");
$class = $site->createAttribute("class");
$class->nodeValue = "wrapper";
$div->appendChild($class);
$site->appendChild($div);
$html = $site->saveHTML();
echo $html;
?>
And when I run this on the browser and view the page source, only this code comes out:
<div class="wrapper"></div>
I don't know why it is not showing the whole html document that supposedly have to be. I am using XAMPP v3.2.1.
Please tell me where did I gone wrong with this. Thanks.
It's showing the whole HTML you created. A div node with a wrapper class attribute.
See the example in the docs. There the html, head, etc. nodes are explicitly created.
PHP only adds missing DOCTYPE, html and body elements when loading HTML, not when saving.
Adding $site->loadHTML($site->saveHTML()); before $html = $site->saveHTML(); will demonstrate this.

Creating a personalization engine with php

I am new to php and I want to create an php engine which changes the web content of a webpage with PHP with the use of data in mysql. For example (changing the order of navigation links on a webpage with the order of highest click count) I am not sure how PHP will read the HTML file and change the elements in the HTML file and also output the HTML file with the changes. Is this possible?
I am not quite sure why you would want to generate the html, read it, change it and then output it. It seems to be a lot easier to just generate it the way you want to in the first place.
I am not sure how PHP will read the HTML file and change the elements in the HTML file and also output the HTML file with the changes. Is this possible?
You could use file_get_contents:
$html = file_get_contents($url);
Then use a html-parser like Simple HTML DOM Parser, change what you want to do and output it.
If you want to modify HTML structure, use ganon - HTML DOM parser for PHP
include('path/ganon.php');
// Parse the google code website into a DOM
$html = file_get_dom('http://code.google.com/');
foreach($html('p[class]') as $element) {
echo $element->class, "<br>\n";
}

PHP replace text within a <h1> </h1> tag

I'm using AJAX to call a PHP file which will effectively edit particular bits of content within another HTML file. My problem is that I'm not sure of the best way of targeting these particular areas.
I figured some sort of unique identifier would need to attached to the tag that needs to be edited or in a comment perhaps, and then PHP simply searches for this before doing the replacing?
Use simplehtml for this.
You can change all <h1> to foo like this:
$html = file_get_html('http://www.google.com/');
foreach($html->find('h1') as $element)
{
$element->innertext = 'foo';
}
echo $html;
The simplehtmldom framework allows you to search and modify the DOM of a HTML file or url.
http://simplehtmldom.sourceforge.net/
// Create DOM from URL or file $html =
file_get_html('http://www.google.com/');
// Find all images foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links foreach($html->find('a') as $element)
echo $element->href . '<br>';
Another nice library is querypath. It is very similar to jquery:
qp($html_code)->find('body')->text('Hello World')->writeHTML();
https://fedorahosted.org/querypath/wiki/QueryPathTutorial

How to write this crawler in php?

I need to create a php script.
The idea is very simple:
When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server.
What PHP function I have to use for this crawler ?
Use PHP Simple HTML DOM Parser
// Create DOM from URL
$html = file_get_html('http://www.example.com/');
// Find all images
$images = array();
foreach($html->find('img') as $element) {
$images[] = $element->src;
}
Now $images array have images links of given webpage. Now you can store your desired image in database.
HTML Parser: HTMLSQL
Features: you can get external html file, http or ftp link and parse content.
Well, you'll have to use quite a few functions :)
But I'm going to assume that you're asking specifically about finding the image, and say that you should use a DOM parser like Simple HTML DOM Parser, then curl to grab the src of the first img element.
I would user file_get_contents() and a regular expression to extract the first image tags src attribute.
CURL or a HTML Parser seem overkill in this case, but you are welcome to check it out.

DOM Manipulation with PHP

I would like to make a simple but non trivial manipulation of DOM Elements with PHP but I am lost.
Assume a page like Wikipedia where you have paragraphs and titles (<p>, <h2>). They are siblings. I would like to take both elements, in sequential order.
I have tried GetElementbyName but then you have no possibility to organize information.
I have tried DOMXPath->query() but I found it really confusing.
Just parsing something like:
<html>
<head></head>
<body>
<h2>Title1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Title2</h2>
<p>Paragraph3</p>
</body>
</html>
into:
Title1
Paragraph1
Paragraph2
Title2
Paragraph3
With a few bits of HTML code I do not need between all.
Thank you. I hope question does not look like homework.
I think DOMXPath->query() is the right approach. This XPath expression will return all nodes that are either a <h2> or a <p> on the same level (since you said they were siblings).
/html/body/*[name() = 'p' or name() = 'h2']
The nodes will be returned as a node list in the right order (document order). You can then construct a foreach loop over the result.
I have uased a few times simple html dom by S.C.Chen.
Perfect class for access dom elements.
Example:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Check it out here. simplehtmldom
May help with future projects
Try having a look at this library and corresponding project:
Simple HTML DOM
This allows you to open up an online webpage or a html page from filesystem and access its items via class names, tag names and IDs. If you are familiar with jQuery and its syntax you need no time in getting used to this library.

Categories