I got a DOMDocument which looks like this:
<font size="6" face="Arial">
CONTENT
<font size="5" face="Arial">...</font>
<br>
<table cellspacing="1" cellpadding="1" border="3" bgcolor="#E7E7E7" rules="all">...</table>
<table cellspacing="1" cellpadding="1">...</table>
<font size="3" face="Arial" color="#000000">...</font>
</font>
Now I want to get just CONTENT and not all the other child-elements.
How can I do that?
What you can do is grab the first DOMText node that's a child of the first <font> tag.
// Get the first <font> tag
$font = $doc->getElementsByTagName( 'font')->item(0);
// Find the first DOMText element
$first_text = null;
foreach( $font->childNodes as $child) {
if( $child->nodeType === XML_TEXT_NODE) {
$first_text = $child;
break;
}
}
if( $first_text != null) {
echo 'OUTPUT: ' . $first_text->textContent;
}
You can see from the demo that this prints:
OUTPUT: CONTENT
Shorter:
$output = $xml->getElementsByTagName("font")->item(1)firstChild->textContent;
nickb's solution works too and is even better if the CONTENT comes after one of the sub-childs. But since it doesn't do that in my case, this one is shorter.
Related
Hello I am having a problem with DomDocument. I need to do an script which extracts all the information from the tables with certain id.
So I did:
$link = "WEBSITE URL";
$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$context_nodes = $xpath->query('//table[#id="news"]/tr[position()>0]/td');
So I get all the <td>s and information, but the problem is that the <img> tags haven't been extracted by the script. How can I extract all the information of the tables either text or image html tags?
The html code from which I want to extract the info is:
<table id="news" width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="539" height="35"><span><strong>Info to Extract</strong></span></td>
</tr>
<tr>
<td height="35" class="texto10">Martes, 02 de Octubre de 2012 | Autor: Trovert" rel="author"></a></td>
</tr>
<tr>
<td height="35" class="texto12Gris"><p><strong>Info To extract</strong></p>
<p><strong> </strong></p>
<p><strong>Casa de Gobierno: (a 9 cuadras del hostel)</strong></p>
<img title="title" src="../images/theimage.jpg" width="400" height="266" />
</td>
</tr>
</table>
This is how I am iterating the extracted elements:
foreach ($context_nodes as $node) {
echo $node->nodeValue . '<br/>';
}
Thanks
If you need more than text, you'll have to try harder, not just nodeValue/textContent, but walk through the target nodes DOM branch:
function walkNode($node)
{
$str="";
if($node->nodeType==XML_TEXT_NODE)
{
$str.=$node->nodeValue;
}
elseif(strtolower($node->nodeName)=="img")
{
/* This is just a demonstration;
* You'll have to extract the info in the way you want
* */
$str.='<img src="'.$node->attributes->getNamedItem("src")->nodeValue.'" />';
}
if($node->firstChild) $str.=walkNode($node->firstChild);
if($node->nextSibling) $str.=walkNode($node->nextSibling);
return $str;
}
This is a simple, straightforward recursive function. So now you can do this:
$dom=new DOMDocument();
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
$tds=$xpath->query('//table[#id="news"]//tr[position()>0]/td');
foreach($tds as $td)
{
echo walkNode($td->firstChild);
echo "\n";
}
Online demo
(Please be noted that I "fixed" a little bit of your HTML as it doesn't seem valid; also pretty-indented a little bit)
This outputs something like this:
Info to Extract
Martes, 02 de Octubre de 2012 | Autor: Trovert
Info To extract
Casa de Gobierno: (a 9 cuadras del hostel)
<img src="../images/theimage.jpg" />
Try this....
foreach ($context_nodes as $node) {
echo $doc->saveHTML($node) . '<br/>';
}
I want to extract the multiple <a> tags from this html markup:
<table align="center" width="100%" cellpadding="0" cellspacing="0"><tr>
<td align="left" width="60%" valign="top">
<font size="5" color="#939390">title</font><br><font size="5" color="#939390">LINKS</font>
<a style="color:#000000;" title="title Link 1" href="/page/242808/1/44643.html"> <b style="background:#ff6633">Link 1</b></a>
<a style="color:#000000;" title="title Link 2" href="/page/242808/2/erewe.html"> <b style="background:#ff6633">Link 2</b></a>
</td>
</tr>
</table>
and here is my code :
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->validateOnParse = true;
$doc->preserveWhiteSpace = false;
$doc->loadHTML($html); // load HTML you can add $html
$selector = new DOMXPath($doc);
$a = $selector->query('//table[1]//a')->item(0);
echo $doc->saveHTML($a);
This gets the first <a>, but what I want is to get all the <a> tags in the document.
To get more than one, you'll need to loop through the results instead of just printing the first one:
$a = $selector->query('//table[1]//a');
foreach($a as $current) {
echo $doc->saveHTML($current);
}
I have searched online and thought this would work but it doesn't for some reason. I'm trying to extract a hyperlink that only displays it's URL from a HTML. I'm only trying to extract the URL within the td align="center". Here is a sample of the HTML doc I'm trying to extract:
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
And here is my PHP code to extract it from the td align="center":
<?php
//$searchURL = "site";
include 'simple_html_dom.php';
$site = 'website';
$html = file_get_html($site);
$tabledata = array();
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->href . '<br>';
?>
I know the code works because the code can extract everything if it is just the td within the barracks.
So you have identified the <td> elements themselves, but you did not go down to the next nesting level to grab the href from the <a> elements. You might do that like this:
foreach($html->find('td[align=center]') as $e)
echo $e->children(0)->href . '<br>';
Use the DOM and Xpath:
Select all td elements in the document
//td
Only if the align attribute equals "center"
//td[#align="center"]
Get the a sub elements
//td[#align="center"]//a
Get the href attribute nodes of that a elements
//td[#align="center"]//a/#href
Source example:
$html = <<<'HTML'
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//td[#align="center"]//a/#href');
foreach ($nodes as $node) {
var_dump($node->value);
}
You selected the td element. The anchor element is the child of the td element.
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->firstChild()->getAttribute('href') . '<br>';
I have a html code:
<table id="table1" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
<img src="http://vnexpress.net/Files/Subject/3b/bd/ac/f9/cuongbibat.jpg" width="330" height="441" border="1" alt="Cường">
</td>
</tr>
<tr>
<td class="Image">Everything
</td>
</tr>
</table>
<table id="table2" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
Someone
</td>
</tr>
<tr>
<td class="Image">Everything
</td>
</tr>
</table>
I have 2 table, i want to remove all tag: table, tr, td if table have img tag(table 1).
I need to get result like :
<img src="http://vnexpress.net/Files/Subject/3b/bd/ac/f9/cuongbibat.jpg" width="330" height="441" border="1" alt="Cường">
Everything
<table id="table2" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
Someone
</td>
</tr>
<tr>
<td class="text">Everything
</td>
</tr>
</table>
Please help me. Thank you.
HTML Purifier can be used to strip either all tags or a certain set of tags from a document. It's the go-to solution for basically any HTML tag stripping in PHP - don't ever use regexes for this or the sun will burn out and we will all freeze to death in the suffocating darkness.
Try something like:
$config->set('HTML.Allowed', 'img');
$purifier = new HTMLPurifier($config);
$output = $filter->purify($YOUR_HTML);
You'll need to add a $config->set('HTML.Allowed', 'TAGNAME'); line for every tag you don't want to get scrubbed away, but it's a price worth paying for the continued lifegiving warmth of the day-star. And also not leaving your site open to XSS attacks and content-eating glitches, I guess.
Check out:
http://simplehtmldom.sourceforge.net/
Let's you find tags on an HTML page with selectors just like jQuery and extract contents from HTML in a single line.
In theory, it's possible to do this with a single highly complex regexp. It's always easier to do the search-and-replace on separate steps: search for the outer container first, then work on what it contains.
<?php
header("Content-type: text/plain");
$html = '<table id="table1" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
<img src="http://vnexpress.net/Files/Subject/3b/bd/ac/f9/cuongbibat.jpg" width="330" height="441" border="1" alt="Cường">
</td>
</tr>
<tr>
<td class="Image">Everything
</td>
</tr>
</table>
<table id="table2" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
Someone
</td>
</tr>
<tr>
<td class="Image">Everything
</td>
</tr>
</table> ';
$html = preg_replace_callback('/<table\b[^>]*>.*?<\/table>/si', 'removeTableIfImg', $html);
function removeTableIfImg($matches) {
$table = $matches[0];
return preg_match('/<img\b[^>]*>/i', $table, $img)
? preg_replace('/<\/?(?:table|td|tr)\b[^>]*>\s*/i', '', $table)
: $table;
}
echo $html;
?>
The first pattern finds the tables. The second pattern (in the callback) checks if there's an image tag. The third removes the table, td, and tr tags.
i needed something like this.
here is my solution:
(<\/?tr.*?>)|(<\/?td.*?>)|(<\/?table.*?>)
this regex will select all tr td and table tags not greedy.
you can see it in action here:
http://regexr.com/3fslh
As sudowned said do not use regex for this, it will drive you crazy. Usually searching for libs consumes the same amount of time than writing your own small parser for this. I did this several times in different languages. You learn a lot and you often can reuse the code :-)
since you are not interested in attributes, this should be quite easy. loop the entry site char by char. Check out this java code, its one of my earlier, smaller approach to sanitize html:
public static String sanatize(String body, String[] whiteList, String tagSeperator, String seperate) {
StringBuilder out = new StringBuilder();
StringBuilder tag = new StringBuilder();
boolean quoteOpen = false;
boolean tagOpen = false;
for(int i=0;i<body.length();i++) {
char c = body.charAt(i);
if(i<body.length()-1 && c == '<' && !quoteOpen && body.charAt(i+1) != '!') {
tagOpen = true;
tag.append(c);
} else if(c == '>' && !quoteOpen && tagOpen) {
tag.append(c);
for (String tagName : whiteList) {
String stag = tag.toString().toLowerCase();
if (stag.startsWith("</"+tagName+" ") || stag.startsWith("</"+tagName+">") || stag.startsWith("<"+tagName+" ") || stag.startsWith("<"+tagName+">")) {
out.append(tag);
} else if (stag.startsWith("</") && tagSeperator != null) {
if (seperate.length()>2) {
if (seperate.contains("," + stag.replaceAll("[</]+(\\w+)[\\s>].*", "$1") + ",")) {
out.append(tagSeperator);
}
} else {
if (!out.toString().endsWith(tagSeperator)) {
out.append(tagSeperator);
}
}
}
}
tag = new StringBuilder();
tagOpen = false;
} else if (c == '"' && !quoteOpen) {
quoteOpen = true;
if (tagOpen)
tag.append(c);
else
out.append(c);
} else if (i>1 && c == '"' && quoteOpen && body.charAt(i-1) != '\\' ) {
quoteOpen = false;
if (tagOpen)
tag.append(c);
else
out.append(c);
} else {
if (tagOpen)
tag.append(c);
else
out.append(c);
}
}
return out.toString();
}
You can ignore separator and separate, I used this to sanitise tags and convert to csv
my first time here.
I got these lines as a response from the server and saved them in a file. They look like XML, right? My task is to read the content of those td tags and put them into other structured file(Excel). The problem is I dont know how to do that.
At the moment, I think I will strip the first and last line of the file then parse them into XML. But do you know other ways ? Thanks.
<CallbackContent><![CDATA[
<table cellspacing="0" border="0" cellpadding="0" width="100%">
<tr class="rowcolor2">
<td align="left" style="padding:5px;">22/02/2010</td>
<td align="right" style="padding:5px;">510,02</td>
</tr>
</table>
]]></CallbackContent>
Btw, I'm using PHP.
Use an XML parser such as SimpleXML. It will allow you to extract the CDATA safely.
Then if the HTML is XML-compliant (in other words, it's XHTML) you can use SimpleXML to extract data from it. For example:
$xml='<CallbackContent><![CDATA[
<table cellspacing="0" border="0" cellpadding="0" width="100%">
<tr class="rowcolor2">
<td align="left" style="padding:5px;">22/02/2010</td>
<td align="right" style="padding:5px;">510,02</td>
</tr>
</table>
]]></CallbackContent>';
$CallbackContent = simplexml_load_string($xml);
$html = (string) $CallbackContent;
// if XHTML
$table = simplexml_load_string($html);
// otherwise, use
$dom = new DOMDocument;
$dom->loadHTML($html);
$table = simplexml_import_dom($dom)->body->table;
foreach ($table->tr as $tr)
{
echo 'tr class=', $tr['class'], "\n";
foreach ($tr->td as $td)
{
echo 'td align=', $td['align'], ' - value: ', (string) $td, "\n";
}
}
You cannot read the table with an XML parser, because it is pushed out as a CDATA block, which equivocates to a string literal.
First, read the whole thing using a XML parser so that you can pull out the contents of the CDATA section. Then take that and stuff it through an HTML parser.