I have the following content:
<div class="item">
<a href="ONE">
<img src="TWO">
</a>
</div>
I want to use XPath to pull out "ONE" and "TWO" from there.
The code I have right now is:
$html = file_get_contents($_POST['url']);
$document = new DOMDocument();
$document->loadHTML ($html);
$selector = new DOMXPath($document);
$query = '//div[#class="item"]';
$anchors = $selector->query($query);
foreach ($anchors as $node) {
// print ONE;
// print TWO;
}
Here comes an example:
$html = <<<EOF
<div class="item">
<a href="ONE">
<img src="TWO">
</a>
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXPath($doc);
$links = $selector->query(
'//div[#class="item"]//#href | //div[#class="item"]//#src'
);
foreach($links as $link) {
echo $link->nodeValue . PHP_EOL;
}
If you want to break it down by <div class="item"> you can use the following code:
foreach($selector->query('//div[#class="item"]') as $div) {
foreach($selector->query('.//#href | .//#src', $div) as $link) {
echo $link->nodeValue . PHP_EOL;
}
}
Related
I have a result from a curl request from a page like this:
$result =
<div class="c-wrapper">
<a href="link-to-a-page.php">
<div class="c-content-img">
<img src="...">
</div>
<div class="c-link-data">
<div class="c-link-data-title">
<h4>TITLE</h4>
</div>
</div>
</a>
<div>
<div class="c-wrapper">
<div class="c-content-img">
<img src="...">
</div>
<div class="c-link-data">
<div class="c-link-data-title">
<h4>TITLE 2</h4>
</div>
</div>
<div>
Now I have to count how many c-wrapper is present:
I use correctly this:
$doc = new DOMDocument();
#$doc->loadHTML($result);
$xpath = new DOMXPath($doc);
$divs = $xpath->query("//div[contains(#class, 'c-wrapper')]");
echo $divs-length; //<--- printed: 2
Then I have to print all titles:
I use correctly this:
$titles = $xpath->query("//div[contains(#class, 'c-link-data-title')]/h4");
foreach ($titles as $title) {
echo $title->textContent . "<br>";
}
Now the part I don't know: In the first div is present a link, in the second one no link. I'd like to edit my print of titles like this:
foreach ($titles as $title) {
if ( $link_extracted !="" )
echo "<a href='" . $link_extracted . "'>" . $title->textContent . "</a><br>";
else
echo $title->textContent . "<br>";
}
How can I edit $titles = $xpath->query("//div[contains(#class, 'c-link-data-title')]/h4"); to achieve this?
Rather than doing this in separate stages, the code finds the c-wrapper elements and then further uses XPath to find the various parts you want inside that particular element, so in
$link_extracted = $xpath->evaluate("a/#href", $div)[0];
it is looking for an <a> element relative to the $div element. Using [0] as you want only the first one.
$doc = new DOMDocument();
#$doc->loadHTML($result);
$xpath = new DOMXPath($doc);
$divs = $xpath->query("//div[contains(#class, 'c-wrapper')]");
echo $divs->length;
foreach ( $divs as $div ) {
$link_extracted = $xpath->evaluate("a/#href", $div)[0];
$title = $xpath->evaluate("descendant::div[contains(#class, 'c-link-data-title')]/h4/text()"
, $div)[0];
if ( !empty($link_extracted->nodeValue) ) {
echo "<a href='" . $link_extracted->nodeValue . "'>" . $title->textContent . "</a><br>";
}
else {
echo $title->textContent . "<br>";
}
}
which for your test HTML gives...
2<a href='link-to-a-page.php'>TITLE</a><br>TITLE 2<br>
I have the following:
<div id="content">
<div class="content-top">bla</div>
<div class="inner text-inner">
bla bla bla
</div>
</div>
and the PHP:
$page = file_get_contents('http://www.example.com/test');
#$doc = new DOMDocument();
#$doc->loadHTML($page);
$node = $doc->getElementById('content');
How can I modify $node = $doc->getElementById('content'); so i can target <div class="inner text-inner"> ?
You can use XPath to easily achieve it.
$page = file_get_contents('http://www.example.com/test');
$doc = new DOMDocument();
$doc->loadHTML($page);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[#class='inner text-inner']");
$node = $nodeList->item(0);
// To check the result:
echo "<p>" . $node->nodeValue . "</p>";
This will output:
bla bla bla
I have this html structure:
<html>
<body>
<section>
<div>
<div>
<section>
<div>
<table>
<tbody>
<tr></tr>
<tr>
<td></td>
<td></td>
<td>
<i></i>
<div class="first-div class-one">
<div class="second-div"> soft </div>
130 cm / 15cm
</div>
</td>
</tr>
<tr></tr>
</tbody>
</table>
</div>
</section>
</div>
</div>
</section>
</body>
</html>
Now, I have this XPath code:
$doc = new DOMDocument();
#$doc->loadHtmlFile('http://www.whatever.com');
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query( '/html/body/section/div[2]/section/div/table/tbody/tr[2]/td[3]/div' );
foreach ( $nodelist as $node ) {
$result = $node->nodeValue."\n";
}
This gets me 'soft 130 cm / 15cm' as a result.
But I want to know how to get only '15', so I need:
1. To know how to get rid of the childNode->nodeValue
2. Once I have '130 cm / 15cm', to know how to get only '15' as the nodeValue of a variable in PHP.
Can you guys help?
Thanks in advance
Text within a tag is also a node (a child), more particularly a DOMText.
By looking at the children of that div, you can find the DOMText and get its nodeValue. An example below:
$doc = new DOMDocument();
$doc->loadHTML("<html><body><p>bah</p>Test</body></html>");
echo $doc->saveHTML();
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query( '/html/body' );
foreach ( $nodelist as $node ) {
if ($node->childNodes)
foreach ($node->childNodes as $child) {
if($child instanceof DOMText)
echo $child->nodeValue."\n"; // should output "Test".
}
}
Your second point can easily be done with regular expressions:
$string = "130 cm / 15cm";
$matches = array();
preg_match('|/ ([0-9]+) ?cm$|', $string, $matches);
echo $matches[1];
Full Solution:
<?php
$strhtml = '
<html>
<body>
<section>
<div>
<div>
<section>
<div>
<table>
<tbody>
<tr></tr>
<tr>
<td></td>
<td></td>
<td>
<i></i>
<div class="first-div class-one">
<div class="second-div"> soft </div>
130 cm / 15cm
</div>
</td>
</tr>
<tr></tr>
</tbody>
</table>
</div>
</section>
</div>
</div>
</section>
</body>
</html>';
$doc = new DOMDocument();
#$doc->loadHTML($strhtml);
echo $doc->saveHTML();
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query( '/html/body/section/div/div/section/div/table/tbody/tr[2]/td[3]/div' );
foreach ( $nodelist as $node ) {
if ($node->childNodes)
foreach ($node->childNodes as $child) {
if($child instanceof DOMText && trim($child->nodeValue) != "")
{
echo 'Raw: '.trim($child->nodeValue)."\n";
$matches = array();
preg_match('|/ ([0-9]+) ?cm$|', trim($child->nodeValue), $matches);
echo 'Value: '.$matches[1]."\n";
}
}
}
Let's say we have a nodeValue '92/100' or 'some/all', or ' 92 / 100' and 'some /all ' (with those spaces around the letters/numbers).
For example:
<?php
$strhtml='
<div>
<div>
<p>92/100</p>
</div>
<div>
<p>some/all</p>
</div>
<div>
<p> 92 / 100</p>
</div>
<div>
<p>some /all </p>
</div>
</div>';
$doc = new DOMDocument();
#$doc->loadHtml($strhtml);
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query('//div/div/p[1]');
foreach( $nodelist as $node ) {
$result = $node->nodeValue;
}
echo $result;
How to select only '92' and/or 'some'?
Thanks.
Is this more or less what you were asking to do? Retrun the portion to the left of the / ?
$strhtml='
<div>
<div>
<p>92/100</p>
</div>
<div>
<p>some/all</p>
</div>
<div>
<p> 92 / 100</p>
</div>
<div>
<p>some /all </p>
</div>
</div>';
$doc = new DOMDocument();
$doc->loadHtml($strhtml);
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query('//div/div/p[1]');
foreach( $nodelist as $node ) {
list( $keep, $junk )=explode('/',$node->nodeValue );
$result = trim( $keep );
}
echo $result;
I want to remove all the anchor tags that starts with '/'. this is my code:
$html = <<<HTML
<ul>
<li><a href="/foo/bar1">link1</li>
<li><a href="/foo/bar2">link2</li>
<li><a href="/foo/bar3">link3</li>
</ul>
HTML;
$dom = new DOMDocument;
#$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
echo 'removed nodes:<br />';
foreach ($tags as $tag)
{
$href = $tag->getAttribute('href');
if($href[0] == '/')
{
echo $tag->nodeValue.'<br />';
$tag->parentNode->removeChild($tag);
}
}
echo 'remined content:<br />';
echo $dom->saveXML($dom);
but the problem is it reminds some of them.
removed nodes:<br>
link1<br>
link3<br>
remined content:<br>
<ul><li>
</li><li>link2</li>
<li>
</li></ul>
any idea on how to do that?
thanks.
You can't remove DOMNodes from a DOMNodeList as you're iterating over them in a foreach loop (http://php.net/manual/en/domnode.removechild.php#90292). Though, making a queue of items to remove seems to work:
<?php
$html = <<<HTML
<ul>
<li>link1</li>
<li>link2</li>
<li>link3</li>
</ul>
HTML;
$dom = new DOMDocument;
#$dom->loadHTML($html);
$domNodeList = $dom->getElementsByTagName('a');
$domElemsToRemove = array();
foreach ($domNodeList as $domElement ) {
$domElemsToRemove[] = $domElement;
}
echo 'removed nodes:<br />';
foreach ($domElemsToRemove as $tag)
{
$href = $tag->getAttribute('href');
if($href[0] == '/')
{
echo $tag->nodeValue.'<br />';
$tag->parentNode->removeChild($tag);
}
}
echo 'remined content:<br />';
echo $dom->saveXML($dom);
EDIT
also you forgot close tag <a>