parsing using DOM Element - php

Suppose I have some html and I want to parse something from it.
In my html I know A, and want to know what is in C .
I start getting all td elements, but what to do next ?
I need to check something like " if this td has A as value then check what is written in third td after this. But how can I write it ?
$some_code = ('
....
<tr><td>A</td><td>...</td><td>c</td></tr>
.....
');
$doc->loadHTML($some_code);
$just_td = $doc->getElementsByTagName('td');
foreach ($just_td as $t) {
some code....
}

With XPath:
/html/body//tr/td[text()="A"]/following-sibling::td[3]
will find the third sibling of a td element with text content of A that is a child of a tr element anywhere below the html body element.

Related

Traverse HTML file elements with PHP

I need to read an HTML file (that I don't know how it will look like) and go through all its elements. For those elements that have an innerhtml text, I'd like to grab or modify that. I've searched exhaustively but can't find something that does what I need.
Here's an example HTML file:
<!DOCTYPE html>
<html lang="en">
<body>
<p> 1st text I need</p>
2nd text I need
<table>
<tr>
<td>3rd text I need</td>
</tr>
</table>
</body>
</html>
Here's what I need to accomplish:
Traverse file
Find which elements have innerhtml
Grab or modify the text
Save the file
In the file above almost all elements have text but complex files won't.
I can use DOMDocument() to loop through specific types of nodes but I don't know what I'm going to encounter until a file is selected.
I thought the code below would do it but it prints just the file name during the loop.
<?php
include 'functions.php';
$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');
showDOMNode($doc);
function showDOMNode($domNode) {
foreach ($domNode->childNodes as $node)
{
if($node->nodeName !="#text") {
echo $node->nodeName . ' ';
echo $node->nodeType . ' ';
echo $node->textContent . '<br>';
if($node->hasChildNodes()) {
showDOMNode($node);
}
}
}
}
?>
Here's what I get:
html 10
html 1 1st text I need 2nd text I need 3rd text I need
body 1 1st text I need 2nd text I need 3rd text I need
p 1 1st text I need
a 1 2nd text I need
table 1 3rd text I need
tr 1 3rd text I need
td 1 3rd text I need
As you can see, when the textContent seems to show the text for all child nodes while I need the specific one for each node. Any help is much appreciated.

php dom parser return parent and child

I think this is a simple question but I can't sort it, I am trying to get all heading tags with the simple php DOM parser, my code works only one way, example
$heading['h2']=$html->find('h2 a');//works fine
I have found some sites wrap the h2 within the a tag like this
<a href='#'><h2> my heading</h2></a>
The problem is trying to get both tags so I can display the link with it. So when I do this
$heading['h2']=$html->find('a h2');
I get the h2 fine but it will not wrap the link tag around it, which of course makes sense, find all h2 tags that are children of a but how do I get the entire parent tag, I hope that makes sense, what I want it to return is
<h2>My Headings</h2>
then I can just print the output with
echo $headings['h2']; //and the link with be there
If the <a href="[..]"> ist just the outer element, you can do it like this:
$heading['h2']=$html->find('a h2');
foreach ($heading['h2'] as $h2) {
echo $h2->parent(), "\n";
}
You could also go up the DOM tree until you reach an <a> tag:
$heading['h2']=$html->find('a h2');
foreach ($heading['h2'] as $h2) {
$a = $h2;
while ($a && $a->tag != "h2") $a = $a->parent();
if (!$a) continue; // no <a> above <h2>
echo $a, "\n";
}
Well my first thought we be to use
$html->find('a');
But I'm guessing you have multiple links on your page. So the correct practice would then be to use an ID (or a class) to identify your link
<h2> my heading</h2>
And then search for that specific ID:
$html->find('a#titleLink');
I don't know what library you're using and what syntax it supports, but I hope you get the idea anyway.
According to docs: $heading['h2']=$html->find('a > h2')->parent(); would return the anchor tag wrapping the h2, but if you have multiple 'a > h2' in the page, the find function will return an array, so try it and/or use foreach.
$info = $html->find('a,h2');
echo '<a href='.$info[0]->href.'>'.$info[1]->innertext.'</a>';

Stripping span tag from simple html dom parser

HI i dont want to parse the span tag which is a child tag of from where i am extracting my data.....
Ex:- <a class="imp">
Some data 1 2 3
<span>
Unwanted Data
</span>
</a>
Code i am using:-
foreach($html->find(a.imp) as $value)
{
echo $value->innertext;
}
Output:-
Some data 1 2 3
Unwanted Data...
Desired output:-
Some data 1 2 3
I really dont knw is there any function or way so that i cant include the child tags ???
I believe you would have to loop through your first set of results, find all span elements and set each span element's outertext to an empty string, thus removing the entire HTML for that element.
foreach($html->find('a.imp') as $value)
{
foreach($value->find('span') as $e)
{
$e->outertext = '';
}
echo $value->innertext;
}
Simple HTML DOM Parser will work:
$content = file_get_html($link);
$stuffiwant = $content->find("//a/text()");
var_dump($stuffiwant);
I don't believe simple has a clean way to remove elements. In phpquery you can:
$doc->find('a.imp span')->remove();
echo $doc->find('a.imp')->text();

XPath - Get text from parent using php xpath

I am trying to get the text from a specific node's parent. For example:
<td colspan="1" rowspan="1">
<span>
<a class="info" shape="rect"
rel="empLinkData" href="/employee.htm?id=8468524">
Jack Johnson
</a>
</span>
(*)
</td>
I am able to successfully process the anchor tag by using:
$xNodes = $xpath->query('//a[#class="info"][#rel="empLinkData"]');
// $xNodes contains employee ids and names
foreach ($xNodes as $xNode)
{
$sLinktext = #$xNode->firstChild->data;
$sLinkurl = 'http://www.company.com' . $xNode->getAttribute('href');
if ($sLinktext != '' && $sLinkurl != '')
{
echo '<li><a href="' . $sLinkurl . '">' .
$sLinktext . '</a></li>';
}
}
Now, I need to retrieve the text from the <td> tag (in this case, the (*) appearing right after the span tag closes), but I can't seem to refer to it properly.
The xpath for this that seems to make the most sense to me is:
$xNodes = $xpath->query('//a[#class="info"]
[#rel="empLinkData"]/ancestor::*');
but it is retrieving the wrong data from elsewhere nested above this code.
It's not necessary to retreat back up the tree. Instead, directly select the td that contains the relevant element:
//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()
Edit: As #Dimitre rightly pointed out, this selects all text children. Your td has two such nodes: the whitespace-only text node that precedes the span and the text node that follows it. If you only want the second text node, then use:
//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()[2]
Or:
//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()[last()]
As you can see, the resulting expressions are essentially the same, but you do need to target the correct text node (if you want only one). Note also that if the target text is truly in a td then it's safer to target that element type directly (without wildcards). As this is HTML, your actual document almost certainly contains several other elements, including multiple other anchors that you may not want to target.
Sample PHP:
$nodes = $xpath->query(
'//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()[last()]');
echo "[". $nodes->item(0)->nodeValue . "]";
Deepest td ancestor:
//a[#class="info"][#rel="empLinkData"]/ancestor::td[1]
Use:
//*[a[#class="info"][#rel="empLinkData"]]/following-sibling::text()[1]
This selects a single text node -- exactly the wanted one.
Do note that an XPath expression like:
//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()
selects more than one text nodes -- not only the wanted text node.

search for element name using PHP simple HTML dom parser

I'm hoping someone can help me. I'm using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm) successfully, but I now am trying to find elements based on a certain name. For example, in the fetched HTML, there might be a tags such as:
<p class="mattFacer">Matt Facer</p>
<p class="mattJones">Matt Jones</p>
<p class="daveSmith">DaveS Smith</p>
What I need to do is to read in this HTML and capture any HTML elements which match anything beginning with the word, "matt"
I've tried
$html = str_get_html("http://www.testsite.com");
foreach($html->find('matt*') as $element) {
echo $element;
}
but this doesn't work. It returns nothing.
Is it possible to do this? I basically want to search for any HTML element which contains the word "matt". It could be a span, div or p.
I'm at a dead end here!
$html = str_get_html("http://www.testsite.com");
foreach($html->find('[class*=matt]') as $element) {
echo $element;
}
Let's try that
Maybe this?
foreach(array_merge($html->find('[class*=matt]'),$html->find('[id*=matt]')) as $element) {
echo $element;
}

Categories