Parsing HTML and removing specific td

Parsing HTML and removing specific td - php

I have html content like the following...
<table>
<tr>
<td>xyx...</td>
<td>abc....</td>
<td><span><h3>Downloads</h3></span><br>blah blah blah...</td>
</tr>
<tr>
<td><h3>Downloads</h3>again some content.</td>
<td>dddd</td>
<td>kkkl...</td>
</tr>
</table>
Now am trying to delete 'td's if it has the word 'Downloads' anywhere in the content. After some research on internet I can get something executed and the code is as follows...
$res_text = 'MY HTML';
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($res_text);
$selector = new DOMXPath($dom);
$results = $selector->query('//*[text()[contains(.,"Downloads")]]');
if($results->length){
foreach($results as $res){
$res->parentNode->removeChild($res);
}
}
This does deletes the word 'Downloads' with its current parent node <span> or <p>, but I wanted the whole <td> should be deleted along with the content.
I tried...
$results = $selector->query('//td[text()[contains(.,"Downloads")]]');
but it's not working. Can some one tell me how can I get it?

You don't need the text() in your query, it should be:
$results = $selector->query('//td[contains(.,"Downloads")]');
The whole code:
$dom = new DOMDocument();
$dom->loadHTML($res_text);
$selector = new DOMXPath($dom);
$results = $selector->query('//td[contains(.,"Downloads")]');
if($results->length){
foreach($results as $res){
$res->parentNode->removeChild($res);
}
}
echo htmlentities($dom->saveHTML());
DEMO

Related

Get specific output using php DOMElement and beginning of a value?

Here is the HTML code:
<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>mailto:example#mail.com</td>
</tr>
</table>
And, the php:
$html = 'http://www.example.com'; // target path
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
$linkatt = $link->getAttribute('href');
$linkval = substr($linkatt, 0, 5 );
if($linkval == "mailto"){
echo $link->nodeValue;
}
}
Tried to export all child a elements with href attribute by looking for the starting "mailto" value and got no results so, not sure what is wrong with my code...
How can I get it done, exporting all the values of href attribute starting with mailto... ?

If you want to load HTML from a filename/URL, you need to use DOMDocument::loadHTMLFile(), not DOMDOcument::loadHTML(). The latter expects a string of HTML, not a filename or URL.

PHP DOM/xpath check element span class value

Within a curl request I have a html table that has the below structure. I now want to extract only table rows that contain a span element with the empty class and not the ones with the class="subcomponent".
I successfully tried Xpath to find the elements with the empty class but how to do I get the entire <tr> or even better specific <td> nodes that contain Version and Partnumber.
Thanks in advance.
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td>
<span class="">Product</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
<tr>
<td></td>
<td></td>
<td>
<span class="subcomponent">Component</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
</tbody>
My PHP code
$doc = new DOMdocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$doc->saveHTML();
$xpath = new DOMXpath($doc);
$query ='//span[#class=""]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo $entry->C14N();
}

To access the table rows themselves using SimpleXML, you can use the following:
$sxml = simplexml_load_string('<table>...</table>');
$rows = $sxml->xpath('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
echo "Version: ", $row->td[3], ", Partnumber: ", $row->td[4];
}
The XPath works by selecting all <tr> tags that have a child <td>, which itself has a child <span> with a blank class.
In the loop, you need to access the child cells of each row by number, since your sample doesn't indicate that they're labelled any other way. I'm assuming a table structure won't change too often though, so that should be fine.
See https://eval.in/860169 for an example.
Alternative DOMDocument Version
If you're fetching a full webpage, which won't necessarily be well-formed, you might need to use DOMDocument as you have in your first example. It's a bit less clean to access the child-elements, but something like the following will work:
$doc = new DOMdocument;
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$xpath = new DOMXpath($doc);
$rows = $xpath->query('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
$version = $cells->item(3)->nodeValue;
$partNumber = $cells->item(4)->nodeValue;
echo "Version: {$version}, Part Number: {$partNumber}", PHP_EOL;
}
See https://eval.in/860217

I would use next XPath expression:
//td[text()="Version"] | //td[text()="Partnumber"]
Which gives me:
Element='<td>Version</td>'
Element='<td>Partnumber</td>'
Element='<td>Version</td>'
Element='<td>Partnumber</td>'

Trying to retrieve text only from a div with xpath

I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:
<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>
and when I copy the xpath in chrome I get this string:
/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]
I'm having trouble figuring out where I put that string in an xpath query.
If not an xpath query maybe I should do a preg_match?
I tried this:
$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
echo $node, "\n";
}
but nothing is printed to the page.
Thanks.
EDIT: Full sourse code here:
http://pastebin.com/K5tZ4dFH
EDIT2: Cleaner code screen shot: http://i.imgur.com/lWKheBy.png

From looking at your source, try the following:
$html = file_get_contents($URL);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(#style, 'left:20px')]");
foreach ($nodes as $node) {
echo $node->textContent;
}

It looks like you want the text just before the first </div>, so this regex will find that:
[^<>]+(?=<\/div>)
Here's a live demo.

Using regex to get the value from a tag in PHP

Using regex in PHP how can I get the 108 from this tag?
<td class="registration">108</td>

Regex isn't a good solution for parsing HTML. Use a DOM Parser instead:
$str = '<td class="registration">108</td>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$tds = $dom->getElementsByTagName('td');
foreach($tds as $td) {
echo $td->nodeValue;
}
Output:
108
Demo!
The above code loads up your HTML string using loadHTML() method, finds all the the <td> tags, loops through the tags, and then echoes the node value.
If you want to get only the specific class name, you can use an XPath:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DomXPath($dom);
// get the td tag with 'registration' class
$tds = $xpath->query("//*[contains(#class, 'registration')]");
foreach($tds as $td) {
echo $td->nodeValue;
}
Demo!
This is similar to the above code, except that it uses XPath to find the required tag. You can find more information about XPaths in the PHP manual documentation. This post should get you started.

If you wish to force regex, use the <td class=["']?registration["']?>(.*)</td> expression

preg_match() find all values inside of table?

hey guys,
a curl function returns a string $widget that contains regular html -> two divs where the first div holds a table with various values inside of <td>'s.
i wonder what's the easiest and best way for me to extract only all the values inside of the <td>'s so i have blank values without the remaining html.
any idea what the pattern for the preg_match should look like?
thank you.

Regex is not a suitable solution. You're better off loading it up in a DOMDocument and parsing it.

You're betting off using a DOM parser for that task:
$html = <<<HTML
<div>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
<tr>
<td>hello</td>
<td>world</td>
</tr>
</table>
</div>
<div>
Something irrelevant
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tds = $xpath->query('//div/table/tr/td');
foreach ($tds as $cell) {
echo "{$cell->textContent}\n";
}
Would output:
foo
bar
hello
world

You shouldn't use regexps to parse HTML. Use DOM and XPath instead. Here's an example:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//td');
$result = array();
foreach ($nodes as $node) {
$result[] = $node->nodeValue;
}
// $result holds the values of the tds

Only if you have very limited, well-defined HTML can you expect to parse it with regular expressions. The highest ranked SO answer of all time addresses this issue.
He comes ...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing HTML and removing specific td - php

Related

Get specific output using php DOMElement and beginning of a value?

PHP DOM/xpath check element span class value

Trying to retrieve text only from a div with xpath

Using regex to get the value from a tag in PHP

preg_match() find all values inside of table?

Categories

Resources