I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)
I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)
So I want to capture "Capture this text 1" and "Capture this text 2" and so on.
Doesn't look to hard, but I can't figure it out :(
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>
If you want to get :
The text
that's inside a <div> tag with class="text"
that's, itself, inside a <div> with class="main"
I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).
Instead, I would use an XPath query on your document, using the DOMXpath class.
For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :
$html = <<<HTML
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :
$tags = $xpath->query('//div[#class="main"]/div[#class="text"]');
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
And executing this gives me the following output :
string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)
You can use http://simplehtmldom.sourceforge.net/
It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.
Something like this:
// Find all <div> which have attribute id=text
$ret = $html->find('div[id=text]');
See the documentation of it for more help.
Related
I'm using xpath to extract data from a web site, but I have a problem with the XPath selector, assuming i have this HTML code:
<div id="_parent">
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
</div>
what I get:
Hi!
I am a child!
I am a span child!
what I should get:
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
My current xpath php code
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[#class='my']");
When in Chrome I open the console and enter this in it:
document.evaluate( "//div[#class='my']", document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null ).singleNodeValue;
then what I get is:
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
so the XPath expression actually works as intended. So I infer, that the way you apply the XPath expression must be wrong. However you did not show us the code that applies the XPath expression?
Hi I wanna remove a line from a HTML file with PHP
like this:
<div id="buttons">
<div id="buttonid_4">Button 4</div>
<div id="buttonid_3">Button 3</div>
<div id="buttonid_2">Button 2</div>
<div id="buttonid_1">Button 1</div>
</div>
So, I wanna remove the buttonid_4, and it content.
That it will be like this:
<div id="buttons">
<div id="buttonid_3">Button 3</div>
<div id="buttonid_2">Button 2</div>
<div id="buttonid_1">Button 1</div>
</div>
First I think it is easy, but I can't found the answer :|
I tried:
"as simple"
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTMLFile($The_Path_For_File);
$element = $dom->getElementById('buttonid_'. $Button_Id);
$element->parentNode->removeChild($element);
$dom->saveHTMLFile($The_Path_For_File);
I got
Call to a member function removeChild() on a non-object
and everytime when I tried with GetElementById, so I continue with XPATH:
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//div[#id="buttonid'.$Button_Id.'"]');
foreach($nodeList as $element){
$dom->$element->removeChild($element);
}
$dom->saveHTMLFile($The_Path_For_File);
I didn't get error, the notepad requested the refresh for file, but no change
Anyone know how to produce this?
The use of getElementById requires a Document Type Declaration (DTD).
PHP Documentation
Notice your HTML fails validation $dom->validate()
Just add <!DOCTYPE html> to your HTML and it will work.
For this function to work, you will need either to set some ID
attributes with DOMElement::setIdAttribute or a DTD which defines an
attribute to be of type ID. In the later case, you will need to
validate your document with DOMDocument::validate or
DOMDocument::$validateOnParse before using this function.
I have div which contain other html tags along with text
I want to extract only text from this div OR inside all html tags
<div class="rpr-help m-chm">
<div class="header">
<h2 class="h6">Repair Help</h2>
</div><!-- /end .header -->
<div class="inner m-bsc">
<ul>
<li>Repair Video</li>
<li>Repair Q&A</li>
</ul>
</div>
<div>
<br>
<span class="h4">Cross Reference Information</span><br>
<p>Part Number 285753A (AP3963893) replaces 1195967, 280152, 285140, 285743, 285753, 3352470, 3363664, 3364002, 3364003, 62672, 62693, 661560, 80008, 8559748, AH1485646, EA1485646, PS1485646.
<br>
</p>
</div>
</div>
Here is my Regexp
preg_match_all("/<div class=\"rpr-help m-chm\">(.*)<\/.*>/s", $urlcontent, $description);
Its working fine whenever I assign this complete div to $urlcontent variable.
But when I am fetching data from real url like $urlcontent = "www.test.com/test.html";
its returning complete webpage script.
How can I get inside content of <div class="rpr-help m-chm"> ?
Is there any correction require in my regexp?
Any help would be appreciated. Thanks
It's not possible to parse HTML/XHTML by regex. Source
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML
Based on the language you use, Please consider using a thirdpart library for HTML parsing.
use this function
function GetclassContent($tagStart,$tagEnd,$content)
{
$first_step = explode( $tagStart,$content );
$second_step = explode($tagEnd,$first_step[1] );
return $second_step[0];
}
Steps to Use Above function
$website="www.test.com/test.html";
$content=file_get_contents($website);
$tagStart ='<div class="rpr-help m-chm">';
$tagEnd = "</div >";
$RequiredContent = GetclassContent($tagStart,$tagEnd,$content);
I am trying to get an element from external page (div tag including some content) by its ID and print it to another page on a site. I am trying to use the code below however getting tag errors which I have in the including element (figcaption, figure). Is there anyway to include only a single div by its ID from another page?
PHP
$doc = new DOMDocument();
$doc->loadHTMLFile($_SERVER['DOCUMENT_ROOT'].'/assets/includes/example.html');
$example = $doc->getElementById('test');
echo $example->nodeValue;
?>
HTML
<div id="test">
<figure>
<img src="img1.jpg" alt="img" />
<figcaption></figcaption>
</figure>
</div>
DOMDocument output errors on HTML5 even if there are not error, due to impossibility of DTD check.
To avoid this, simply change your code in this way:
libxml_use_internal_errors( True );
$doc->loadHTMLFile( '$_SERVER['DOCUMENT_ROOT'].'/assets/includes/example.html' );
Anyway — even if some errors are displayed — your code load correctly HTML document, but you can't display the <div> because you use a wrong syntax: change echo $example->nodeValue
with:
echo $doc->saveHTML( $example );
The right syntax to print DOM HTML is DOMDocument->saveHTML(), or — if you want print only part of document — DOMDocument->saveHTML( DOMElement ).
Also note that DOMDocument is designed to not try to preserve formatting from the original document, so you probably don't obtain this:
<div id="test">
<figure>
<img src="img1.jpg" alt="img" />
<figcaption></figcaption>
</figure>
</div>
but this:
<div id="test">
<figure><img src="img1.jpg" alt="img"><figcaption></figcaption></figure>
</div>
You are currenlt only echo-ing node value, which will be text. Since you have no text in #test, nothing will output.
You have to print it as HTML:
echo $doc->saveHTML($example);
Hello and thank for looking at my question.
I'm in need to grab some data from an HTML snippet.
This source is a trusted/structured one so I think it's OK to use regex in this HTML. Dom and other advanced features in php are an overkill I guess.
Here is the format of the HTML snippet.
<div id="d-container">
<div id="row-custom_1">
<div class="label">Type</div>
<div class="content">John Smith</div>
<div class="clear"></div>
</div>
</div>
In above, please note the first 2 DIV tags have IDs set. There could be several row-custom_1 like div tags so I will need to escape them.
I'm actually very poor in regex so I'm expecting a help from you to rab the John Smith from above html snippet.
It could be something like
<div * id="row-custom_1" * > * <div * class="content" * >GRAB THIS </div>
but I don't know how to do it in regex.
John Smith part won't contain any html for sure. it's from a trusted source that it strips all html and gives the data in above format.
I can understand that regex is never a good idea to process HTML anyway.
Thank you very much for any assistance.
Edit just after 30 minutes:
Many of the awesome people suggested to use an HTML parser so I did ; worked like a charm. So if anyone comes here with a similar question, as the stupid question author, I'd recommend using DOM for the job.
Here is a simple DOM based code to get your value from the given HTML:
$html = <<< EOF
<div id="d-container">
<div id="row-custom_1">
<div class="label">Type</div>
<div class="content">John Smith</div>
<div class="clear"></div>
</div>
</div>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$value = $xpath->evaluate("string(//div[#id='d-container']
/div[#id='row-custom_1']/div[#class='content']/text())");
echo "User Name: [$value]\n"; // prints your user name
OUTPUT:
User Name: [John Smith]