I have this html structure:
<html>
<body>
<section>
<div>
<div>
<section>
<div>
<table>
<tbody>
<tr></tr>
<tr>
<td></td>
<td></td>
<td>
<i></i>
<div class="first-div class-one">
<div class="second-div"> soft </div>
130 cm / 15cm
</div>
</td>
</tr>
<tr></tr>
</tbody>
</table>
</div>
</section>
</div>
</div>
</section>
</body>
</html>
Now, I have this XPath code:
$doc = new DOMDocument();
#$doc->loadHtmlFile('http://www.whatever.com');
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query( '/html/body/section/div[2]/section/div/table/tbody/tr[2]/td[3]/div' );
foreach ( $nodelist as $node ) {
$result = $node->nodeValue."\n";
}
This gets me 'soft 130 cm / 15cm' as a result.
But I want to know how to get only '15', so I need:
1. To know how to get rid of the childNode->nodeValue
2. Once I have '130 cm / 15cm', to know how to get only '15' as the nodeValue of a variable in PHP.
Can you guys help?
Thanks in advance
Text within a tag is also a node (a child), more particularly a DOMText.
By looking at the children of that div, you can find the DOMText and get its nodeValue. An example below:
$doc = new DOMDocument();
$doc->loadHTML("<html><body><p>bah</p>Test</body></html>");
echo $doc->saveHTML();
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query( '/html/body' );
foreach ( $nodelist as $node ) {
if ($node->childNodes)
foreach ($node->childNodes as $child) {
if($child instanceof DOMText)
echo $child->nodeValue."\n"; // should output "Test".
}
}
Your second point can easily be done with regular expressions:
$string = "130 cm / 15cm";
$matches = array();
preg_match('|/ ([0-9]+) ?cm$|', $string, $matches);
echo $matches[1];
Full Solution:
<?php
$strhtml = '
<html>
<body>
<section>
<div>
<div>
<section>
<div>
<table>
<tbody>
<tr></tr>
<tr>
<td></td>
<td></td>
<td>
<i></i>
<div class="first-div class-one">
<div class="second-div"> soft </div>
130 cm / 15cm
</div>
</td>
</tr>
<tr></tr>
</tbody>
</table>
</div>
</section>
</div>
</div>
</section>
</body>
</html>';
$doc = new DOMDocument();
#$doc->loadHTML($strhtml);
echo $doc->saveHTML();
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query( '/html/body/section/div/div/section/div/table/tbody/tr[2]/td[3]/div' );
foreach ( $nodelist as $node ) {
if ($node->childNodes)
foreach ($node->childNodes as $child) {
if($child instanceof DOMText && trim($child->nodeValue) != "")
{
echo 'Raw: '.trim($child->nodeValue)."\n";
$matches = array();
preg_match('|/ ([0-9]+) ?cm$|', trim($child->nodeValue), $matches);
echo 'Value: '.$matches[1]."\n";
}
}
}
Related
I'm trying to scrape the site inside the code but I would it in table format.
$url='http://www.arbworld.net/en/moneyway';
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->recover=true;
$dom->strictErrorChecking=false;
$dom->loadHTMLFile( $url );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$col=$xp->query('//table[#class="grid"]/tr[#class="belowHeader"]/td');
if( $col->length > 0 ){
foreach( $col as $node )echo $node->textContent;
}
Now the output is this:
Romanian Liga I22.Dec 18:00:00 FCSBUniversitat2.063.33.999.9 %€ 2070.1
%€ 00 %€ 0€ 207 22.Dec 18:00:00 Italian Serie A22.Dec 11:30:00
AtalantaAC Milan1.8844.499.7 %€ 21 5580.1 %€ 170.2 %€ 46€ 21 622
22.Dec 11:30:00 English League 221.Dec 15:0
0:00
You should retrieve the rows instead of the columns (without the /td at the end), then simply put everything into an HTML table, with one <tr> for each row:
<?php
// your current code
$xp = new DOMXPath($dom);
$rows = $xp->query('//table[#class="grid"]/tr[#class="belowHeader"]');
?>
<table>
<tbody>
<?php foreach ($rows as $row): ?>
<tr>
<?php foreach ($row->childNodes as $col): ?>
<?php if ($col->getAttribute('style') !== 'display:none'): ?>
<?php foreach ($col->childNodes as $colPart): ?>
<?php if ($colText = trim($colPart->textContent)): ?>
<td><?= $colText ?></td>
<?php elseif ($colPart instanceof DOMElement && $colPart->tagName === 'a'): ?>
<?php
$href = $colPart->getAttribute('href');
if (strpos($href, 'javascript') !== 0):
?>
<td><?= $colPart->getAttribute('href') ?></td>
<?php endif ?>
<?php endif ?>
<?php endforeach ?>
<?php endif ?>
<?php endforeach ?>
</tr>
<?php endforeach ?>
</tbody>
</table>
I have the following:
<div id="content">
<div class="content-top">bla</div>
<div class="inner text-inner">
bla bla bla
</div>
</div>
and the PHP:
$page = file_get_contents('http://www.example.com/test');
#$doc = new DOMDocument();
#$doc->loadHTML($page);
$node = $doc->getElementById('content');
How can I modify $node = $doc->getElementById('content'); so i can target <div class="inner text-inner"> ?
You can use XPath to easily achieve it.
$page = file_get_contents('http://www.example.com/test');
$doc = new DOMDocument();
$doc->loadHTML($page);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[#class='inner text-inner']");
$node = $nodeList->item(0);
// To check the result:
echo "<p>" . $node->nodeValue . "</p>";
This will output:
bla bla bla
Let's say we have a nodeValue '92/100' or 'some/all', or ' 92 / 100' and 'some /all ' (with those spaces around the letters/numbers).
For example:
<?php
$strhtml='
<div>
<div>
<p>92/100</p>
</div>
<div>
<p>some/all</p>
</div>
<div>
<p> 92 / 100</p>
</div>
<div>
<p>some /all </p>
</div>
</div>';
$doc = new DOMDocument();
#$doc->loadHtml($strhtml);
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query('//div/div/p[1]');
foreach( $nodelist as $node ) {
$result = $node->nodeValue;
}
echo $result;
How to select only '92' and/or 'some'?
Thanks.
Is this more or less what you were asking to do? Retrun the portion to the left of the / ?
$strhtml='
<div>
<div>
<p>92/100</p>
</div>
<div>
<p>some/all</p>
</div>
<div>
<p> 92 / 100</p>
</div>
<div>
<p>some /all </p>
</div>
</div>';
$doc = new DOMDocument();
$doc->loadHtml($strhtml);
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query('//div/div/p[1]');
foreach( $nodelist as $node ) {
list( $keep, $junk )=explode('/',$node->nodeValue );
$result = trim( $keep );
}
echo $result;
I have the following content:
<div class="item">
<a href="ONE">
<img src="TWO">
</a>
</div>
I want to use XPath to pull out "ONE" and "TWO" from there.
The code I have right now is:
$html = file_get_contents($_POST['url']);
$document = new DOMDocument();
$document->loadHTML ($html);
$selector = new DOMXPath($document);
$query = '//div[#class="item"]';
$anchors = $selector->query($query);
foreach ($anchors as $node) {
// print ONE;
// print TWO;
}
Here comes an example:
$html = <<<EOF
<div class="item">
<a href="ONE">
<img src="TWO">
</a>
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXPath($doc);
$links = $selector->query(
'//div[#class="item"]//#href | //div[#class="item"]//#src'
);
foreach($links as $link) {
echo $link->nodeValue . PHP_EOL;
}
If you want to break it down by <div class="item"> you can use the following code:
foreach($selector->query('//div[#class="item"]') as $div) {
foreach($selector->query('.//#href | .//#src', $div) as $link) {
echo $link->nodeValue . PHP_EOL;
}
}
I want to remove all the anchor tags that starts with '/'. this is my code:
$html = <<<HTML
<ul>
<li><a href="/foo/bar1">link1</li>
<li><a href="/foo/bar2">link2</li>
<li><a href="/foo/bar3">link3</li>
</ul>
HTML;
$dom = new DOMDocument;
#$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
echo 'removed nodes:<br />';
foreach ($tags as $tag)
{
$href = $tag->getAttribute('href');
if($href[0] == '/')
{
echo $tag->nodeValue.'<br />';
$tag->parentNode->removeChild($tag);
}
}
echo 'remined content:<br />';
echo $dom->saveXML($dom);
but the problem is it reminds some of them.
removed nodes:<br>
link1<br>
link3<br>
remined content:<br>
<ul><li>
</li><li>link2</li>
<li>
</li></ul>
any idea on how to do that?
thanks.
You can't remove DOMNodes from a DOMNodeList as you're iterating over them in a foreach loop (http://php.net/manual/en/domnode.removechild.php#90292). Though, making a queue of items to remove seems to work:
<?php
$html = <<<HTML
<ul>
<li>link1</li>
<li>link2</li>
<li>link3</li>
</ul>
HTML;
$dom = new DOMDocument;
#$dom->loadHTML($html);
$domNodeList = $dom->getElementsByTagName('a');
$domElemsToRemove = array();
foreach ($domNodeList as $domElement ) {
$domElemsToRemove[] = $domElement;
}
echo 'removed nodes:<br />';
foreach ($domElemsToRemove as $tag)
{
$href = $tag->getAttribute('href');
if($href[0] == '/')
{
echo $tag->nodeValue.'<br />';
$tag->parentNode->removeChild($tag);
}
}
echo 'remined content:<br />';
echo $dom->saveXML($dom);
EDIT
also you forgot close tag <a>