Step through DOMDocument tag by tag - php

Strangely I can't find an answer for this, though it seems like it must have been asked before. I have a DOMDocument in PHP and I want to step through each html tag as if it were a flat document basically. I need to inspect each element looking for names of the tag and specific attribute values. I can't use xpath in this instance i don't think because although the structure of the html remains the same, the attributes can be different depending on when the doc is parsed.
My document is a little unusual like this
<tr class='THIS COULD BE ONE OF THREE DIFFERENT CLASSES' id='UNIQUE ID'>
<td class='statistics show' >
<button class="js-hide">Show</button>
</td>
<td class='details'>
<p>
<span class='home'>
<a href='LINK'>TEAM 1</a> </span>
<span class='COULD BE ONE OF TWO DIFFERENT CLASSES'> VARIABLE CONTENT </span> <span class='away'>
<a href='LINK'>TEAM 2</a> </span>
</p>
</td>
<td class='COULD BE ONE OF THREE CLASS TYPES'>
VARIABLE CONTENT</td>
<td class='status'>
</td>
</tr>
There are other tags around the document but there are a number of duplicated sections like that one I would like to pull out. I can't see how xpath would allow me to parse this sensibly so tag by tag is my only option but I can't find the correct way to do it. Any suggestions?

you could use getElementsByTagName(*) to get all elements and loop through those.

Related

Returning specific DIV via specific TAG - PHP XPATH [duplicate]

I'm using Html Agility Pack to run xpath queries on a web page. I want to find the rows in a table which contain a certain interesting element. In the example below, I want to fetch the second row.
<table name="important">
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
<tr>
<td>Stuff I'm interested in</td>
<td><interestingtag/></td>
<td>More stuff I'm interested in</td>
</tr>
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
</table>
I'm looking to do something like this:
//table[#name='important']/tr[has a descendant named interestingtag]
Except with valid xpath syntax. ;-)
I suppose I could just find the interesting element itself and then work my way up the parent chain from the node that's returned, but it seemed like there ought to be a way to do this in one step and I'm just being dense.
"has a descendant named interestintag" is spelled .//interestintag in XPath, so the expression you are looking for is:
//table[#name='important']/tr[.//interestingtag]
Actually, you need to look for a descendant, not a child:
//table[#name='important']/tr[descendant::interestingtag]
I know this isn't what the OP was asking, but if you wanted to find an element that had a descendant with a particular attribute, you could do something like this:
//table[#name='important']/tr[.//*[#attr='value']]
I know it is a late answer but why not going the other way around. Finding all <interestingtag/> tags and then select the parent <tr> tag.
//interestingtag/ancestor::tr

DOMNode->ChildNodes->length returns incorrect value

I have a PHP script that parses a webpage and navigates through it using DOMDocument and DOMXpath libraries. Wen running $tr->ChildNodes->length to get the 3 <td>, the instruction returns 6, where 0 returns the first <td>, 1 is a blank string(19), 2 is the second <td>, 3 is again the blank string(19), 4 is the third <td>, 5 is another time the empty string(19) and 6 is the entire HTML of the page. (tested using $dom->saveHTML($tr->childNodes->item(0) etc.)
How do i make ->length return the correct number? Why does it behave so strange?
<tr>
<td>
<span>...</span>
</td>
<td>
<img ...>
</td>
<td>
<div>
<span>
...
<br>
...
</span>
<span>...</span>
<br><br>
..., ...
</div>
<div>
... | ...
</div>
</td>
</tr>
Please note that i omitted some attributes like style, class, data, etc.
This behaviour is not quite "strange". In DOM, line breaks are actually treated as empty nodes. To get the "correct" number of children, you have to either remove line breaks from the document you are trying to parse beforehand, or get all the children and remove the empty elements from this node list.

Php_simple_html_dom on a table

I would like to extract data from a website, whose code is written like this:
...
<tr>
<td class="something1"><a class="whatever" href="#">NAME</a> </td>
<td class="something2">DATA</td>
<td class="something3">NUMERIC DATA</td>
</tr>
...
In particular, I have my NAME list from my MySQL database, and if my NAME is equal to NAME on this website, I want to print on my website the correspondent NUMERIC DATA.
I know I can do something with php_simple_html_dom but I cannot really achieve this action. Can you please help me?
Thanks!
So you want to read NAME first. if relevant then read the rest? You can read a website Dom as explained here: How do I get the HTML code of a web page in PHP?
$html = file_get_contents('http://pathToTheWebsite.com/thePage');
Now lets parse the $html with some regex. (you can use that library too, the documentation tells you how to do it!
preg_match('/<td class="something1"><a class="whatever" href="#">(?<name>\w)</a> </td>/', $html, $matches);
now $matches['name'] will contain the NAME. You can do the same for the rest and maybe cleanup that regex a little this was just an example.

TCPDF: Writing multiple images after eachother using writeHTML renders images in a "staircase" shape instead of in a straight line

I'm using the PHP class TCPDF's writeHTML method for creating a PDF document. It works great and seems to cover what I need, but when I try to create multiple images in a straight line using HTML elements in a sequence the images are not rendered in the straight line that I expect. Instead, the position of every sequential image is increased (or decreased in some cases) by a few pixels on the y axis, hence making the sequence of images look like a "staircase":
What I expect (every x is a picture):
x x x
What I get:
x
x
x
Sometimes I get it the other way around:
x
x
x
The HTML markup looks like this:
<img src="x.png"><img src="x.png"><img src="x.png"><img src="x.png">
HTML does not normally behave this way and I have not found any solutions by Googling. Any help would be appreciated! Thanks.
Have you considered trying to put it in a table?
<table>
<tr>
<td>
<img src="x.png">
</td>
<td>
<img src="x.png">
</td>
<td>
<img src="x.png">
</td>
<td>
<img src="x.png">
</td>
</tr>
</table>

Using regex in php to add a cell in a row

As usual I have trouble writing a good regex.
I am trying to make a plugin for Joomla to add a button to the optional print, email and PDF buttons produced by the core on the right of article titles. If I succeed I will distribute it under the GPL. None of the examples I found seem to work and I would like to create a php-only solution.
The idea is to use the unique pattern of the Joomla output for article titles and buttons for one or more regex. One regex would find the right table by looking for a table with class "contentpaneopen" (of which there are several in a page) and containing a cell with class "contentheading". A second regex could check if in that table there is a cell with class "buttonheading". The number of these cells could be from zero to three but I could use this check if the first regex returns more than one match. With this, I would like to replace the table by the same table but with an extra cell holding the button I want to add. I could do that by taking off the last row and table closing tags and inserting my button cell before adding those closing tags again.
The normal Joomla output looks like this:
<table class="contentpaneopen">
<tbody>
<tr>
<td width="100%" class="contentheading">
<a class="contentpagetitle" href="url">Title Here</a>
</td>
<td width="100%" align="right" class="buttonheading">
<a rel="nofollow" onclick="etc" title="PDF" href="url"><img alt="PDF" src="/templates/neutral/images/pdf_button.png"/></a>
</td>
<td width="100%" align="right" class="buttonheading">
<a rel="nofollow" onclick="etc" title="Print" href="url"><img alt="Print" src="/templates/neutral/images/printButton.png" ></a>
</td>
</tr>
</tbody>
</table>
The code would very roughly be something like this:
$subject = $article;
$pattern1 = '[regex1]'; //<table class="contentpaneopen">etc</table>
preg_match($pattern, $subject, $match);
$pattern2 = '[regex2]'; //</tr></tbody></table>
$replacement = [mybutton];
echo preg_replace($pattern2, $replacement, $match);
Without a good regex there is little point doing the rest of the code, so I hope someone can help with that!
This is a common question on SO and the answer is always the same: regular expressions are a poor choice for parsing or processing HTML or XML. There are many ways they can break down. PHP comes with at least three built-in HTML parsers that will be far more robust.
Take a look at Parse HTML With PHP And DOM and use something like:
$html = new DomDocument;
$html->loadHTML($source);
$html->preserveWhiteSpace = false;
$tables = $html->getElementsByTagName('table');
foreach ($tables as $table) {
if ($table->getAttribute('class') == 'contentpaneopen') {
// replace it with something else
}
}
Is there a reason that you need to use regex for this? DOM parsing would be much more straightforward.
Since a plugin in the scenario you provided is called everytime you load a page, a regex approach is faster than a dom call, that's why a lot of people use this approach. In Joomla's documentation, you can see too why a regex in the provided scenario is better than trying to use a dom approach.
The problem with your solution is that it's tied with Joomla's default template. I don't remember if it uses the same class="contentheading" structure in all templates. If you plan to GPL such an extension, you should be careful about that.
What you're trying to do seems to me as a template override, explained in more details here. Is a much more simpler solution. For example, the php that creates your article title's:
<div class="componentheading<?php echo $this->params->get('pageclass_sfx')?>">
<h2><?php echo $this->escape($this->params->get('page_title')); ?></h2>
</div>
You just need to override the com_content article template, and echo the html for the pdf buttons after the >get('page_title') call. If you don't want to echo the html, you can create a module or a component, import it in the template and after the >get('page_title') you call the methods in your component that show the html.
This component could have various checkboxes "show pdf (yes/no)" and other interesting actions.

Categories