Parse HTML with PHP's HTML DOMDocument

Parse HTML with PHP's HTML DOMDocument - php

I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)
I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)
So I want to capture "Capture this text 1" and "Capture this text 2" and so on.
Doesn't look to hard, but I can't figure it out :(
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>

If you want to get :
The text
that's inside a <div> tag with class="text"
that's, itself, inside a <div> with class="main"
I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).
Instead, I would use an XPath query on your document, using the DOMXpath class.
For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :
$html = <<<HTML
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :
$tags = $xpath->query('//div[#class="main"]/div[#class="text"]');
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
And executing this gives me the following output :
string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)

You can use http://simplehtmldom.sourceforge.net/
It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.
Something like this:
// Find all <div> which have attribute id=text
$ret = $html->find('div[id=text]');
See the documentation of it for more help.

Related

unable to correctly apply XPath from php

I'm using xpath to extract data from a web site, but I have a problem with the XPath selector, assuming i have this HTML code:
<div id="_parent">
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
</div>
what I get:
Hi!
I am a child!
I am a span child!
what I should get:
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
My current xpath php code
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[#class='my']");

When in Chrome I open the console and enter this in it:
document.evaluate( "//div[#class='my']", document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null ).singleNodeValue;
then what I get is:
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
so the XPath expression actually works as intended. So I infer, that the way you apply the XPath expression must be wrong. However you did not show us the code that applies the XPath expression?

DOMDocument Remove div and it content by identifier with PHP

Hi I wanna remove a line from a HTML file with PHP
like this:
<div id="buttons">
<div id="buttonid_4">Button 4</div>
<div id="buttonid_3">Button 3</div>
<div id="buttonid_2">Button 2</div>
<div id="buttonid_1">Button 1</div>
</div>
So, I wanna remove the buttonid_4, and it content.
That it will be like this:
<div id="buttons">
<div id="buttonid_3">Button 3</div>
<div id="buttonid_2">Button 2</div>
<div id="buttonid_1">Button 1</div>
</div>
First I think it is easy, but I can't found the answer :|
I tried:
"as simple"
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTMLFile($The_Path_For_File);
$element = $dom->getElementById('buttonid_'. $Button_Id);
$element->parentNode->removeChild($element);
$dom->saveHTMLFile($The_Path_For_File);
I got
Call to a member function removeChild() on a non-object
and everytime when I tried with GetElementById, so I continue with XPATH:
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//div[#id="buttonid'.$Button_Id.'"]');
foreach($nodeList as $element){
$dom->$element->removeChild($element);
}
$dom->saveHTMLFile($The_Path_For_File);
I didn't get error, the notepad requested the refresh for file, but no change
Anyone know how to produce this?

The use of getElementById requires a Document Type Declaration (DTD).
PHP Documentation
Notice your HTML fails validation $dom->validate()
Just add <!DOCTYPE html> to your HTML and it will work.
For this function to work, you will need either to set some ID
attributes with DOMElement::setIdAttribute or a DTD which defines an
attribute to be of type ID. In the later case, you will need to
validate your document with DOMDocument::validate or
DOMDocument::$validateOnParse before using this function.

How to get content from Div which have other HTML tags using Regexp

I have div which contain other html tags along with text
I want to extract only text from this div OR inside all html tags
<div class="rpr-help m-chm">
<div class="header">
<h2 class="h6">Repair Help</h2>
</div><!-- /end .header -->
<div class="inner m-bsc">
<ul>
<li>Repair Video</li>
<li>Repair Q&A</li>
</ul>
</div>
<div>
<br>
<span class="h4">Cross Reference Information</span><br>
<p>Part Number 285753A (AP3963893) replaces 1195967, 280152, 285140, 285743, 285753, 3352470, 3363664, 3364002, 3364003, 62672, 62693, 661560, 80008, 8559748, AH1485646, EA1485646, PS1485646.
<br>
</p>
</div>
</div>
Here is my Regexp
preg_match_all("/<div class=\"rpr-help m-chm\">(.*)<\/.*>/s", $urlcontent, $description);
Its working fine whenever I assign this complete div to $urlcontent variable.
But when I am fetching data from real url like $urlcontent = "www.test.com/test.html";
its returning complete webpage script.
How can I get inside content of <div class="rpr-help m-chm"> ?
Is there any correction require in my regexp?
Any help would be appreciated. Thanks

It's not possible to parse HTML/XHTML by regex. Source
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML
Based on the language you use, Please consider using a thirdpart library for HTML parsing.

use this function
function GetclassContent($tagStart,$tagEnd,$content)
{
$first_step = explode( $tagStart,$content );
$second_step = explode($tagEnd,$first_step[1] );
return $second_step[0];
}
Steps to Use Above function
$website="www.test.com/test.html";
$content=file_get_contents($website);
$tagStart ='<div class="rpr-help m-chm">';
$tagEnd = "</div >";
$RequiredContent = GetclassContent($tagStart,$tagEnd,$content);

Getting and echo element including content by ID using PHP

I am trying to get an element from external page (div tag including some content) by its ID and print it to another page on a site. I am trying to use the code below however getting tag errors which I have in the including element (figcaption, figure). Is there anyway to include only a single div by its ID from another page?
PHP
$doc = new DOMDocument();
$doc->loadHTMLFile($_SERVER['DOCUMENT_ROOT'].'/assets/includes/example.html');
$example = $doc->getElementById('test');
echo $example->nodeValue;
?>
HTML
<div id="test">
<figure>
<img src="img1.jpg" alt="img" />
<figcaption></figcaption>
</figure>
</div>

DOMDocument output errors on HTML5 even if there are not error, due to impossibility of DTD check.
To avoid this, simply change your code in this way:
libxml_use_internal_errors( True );
$doc->loadHTMLFile( '$_SERVER['DOCUMENT_ROOT'].'/assets/includes/example.html' );
Anyway — even if some errors are displayed — your code load correctly HTML document, but you can't display the <div> because you use a wrong syntax: change echo $example->nodeValue
with:
echo $doc->saveHTML( $example );
The right syntax to print DOM HTML is DOMDocument->saveHTML(), or — if you want print only part of document — DOMDocument->saveHTML( DOMElement ).
Also note that DOMDocument is designed to not try to preserve formatting from the original document, so you probably don't obtain this:
<div id="test">
<figure>
<img src="img1.jpg" alt="img" />
<figcaption></figcaption>
</figure>
</div>
but this:
<div id="test">
<figure><img src="img1.jpg" alt="img"><figcaption></figcaption></figure>
</div>

You are currenlt only echo-ing node value, which will be text. Since you have no text in #test, nothing will output.
You have to print it as HTML:
echo $doc->saveHTML($example);

php - regex to get contents in DIV tags

Hello and thank for looking at my question.
I'm in need to grab some data from an HTML snippet.
This source is a trusted/structured one so I think it's OK to use regex in this HTML. Dom and other advanced features in php are an overkill I guess.
Here is the format of the HTML snippet.
<div id="d-container">
<div id="row-custom_1">
<div class="label">Type</div>
<div class="content">John Smith</div>
<div class="clear"></div>
</div>
</div>
In above, please note the first 2 DIV tags have IDs set. There could be several row-custom_1 like div tags so I will need to escape them.
I'm actually very poor in regex so I'm expecting a help from you to rab the John Smith from above html snippet.
It could be something like
<div * id="row-custom_1" * > * <div * class="content" * >GRAB THIS </div>
but I don't know how to do it in regex.
John Smith part won't contain any html for sure. it's from a trusted source that it strips all html and gives the data in above format.
I can understand that regex is never a good idea to process HTML anyway.
Thank you very much for any assistance.
Edit just after 30 minutes:
Many of the awesome people suggested to use an HTML parser so I did ; worked like a charm. So if anyone comes here with a similar question, as the stupid question author, I'd recommend using DOM for the job.

Here is a simple DOM based code to get your value from the given HTML:
$html = <<< EOF
<div id="d-container">
<div id="row-custom_1">
<div class="label">Type</div>
<div class="content">John Smith</div>
<div class="clear"></div>
</div>
</div>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$value = $xpath->evaluate("string(//div[#id='d-container']
/div[#id='row-custom_1']/div[#class='content']/text())");
echo "User Name: [$value]\n"; // prints your user name
OUTPUT:
User Name: [John Smith]

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse HTML with PHP's HTML DOMDocument - php

Related

unable to correctly apply XPath from php

DOMDocument Remove div and it content by identifier with PHP

How to get content from Div which have other HTML tags using Regexp

Getting and echo element including content by ID using PHP

php - regex to get contents in DIV tags

Categories

Resources