I have a question on simplehtmldom.
how can I get a text of certain element that is a next_sibling of another element that contains a certain text?
for example:
I have html text as this:
<div>
<table>
<tr>
<td>prova</td>
<td>pippo</td>
</tr>
</table>
</div>
and I need to extract the text of second "td".Consider that I know that the value "prova" is a fixed value. I thought that i could use this code:
echo $html->find("td:contains('prova')",0)->next_sibling();
but "contains" doesn't exists in simplehtmldom.
How I can do that?
Thanks a lot
thanks for your answer but I need to extract text of td next to td that contains the text "prova".
As example I need to extract the value "pippo" with a similar code
echo $html->find("td:contains('prova')",0)->next_sibling()->innertext;
because I know the value of first column. Unfortunately the function contains doesn't exists in simplehtmldom.
The code
echo $html->find("td:innertext('prova')",0)->next_sibling();
doesn't is the right way.
Do you have other suggestion?
Thanks
try this code
<?php
include_once "simple_html_dom.php";
// the html code loaded (in this case in string mode)
$html = '<div>
<table>
<tr>
<td>prova</td>
<td>pippo</td>
</tr>
</table>
</div>';
$dom = str_get_html($html);
// the selector :contains isn't develop yet
$tds = $dom -> find("td");
foreach($tds as $td){
if ($td -> innertext == "prova"){
echo $td -> next_sibling() -> innertext;
}
}
?>
Related
I need to read an HTML file (that I don't know how it will look like) and go through all its elements. For those elements that have an innerhtml text, I'd like to grab or modify that. I've searched exhaustively but can't find something that does what I need.
Here's an example HTML file:
<!DOCTYPE html>
<html lang="en">
<body>
<p> 1st text I need</p>
2nd text I need
<table>
<tr>
<td>3rd text I need</td>
</tr>
</table>
</body>
</html>
Here's what I need to accomplish:
Traverse file
Find which elements have innerhtml
Grab or modify the text
Save the file
In the file above almost all elements have text but complex files won't.
I can use DOMDocument() to loop through specific types of nodes but I don't know what I'm going to encounter until a file is selected.
I thought the code below would do it but it prints just the file name during the loop.
<?php
include 'functions.php';
$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');
showDOMNode($doc);
function showDOMNode($domNode) {
foreach ($domNode->childNodes as $node)
{
if($node->nodeName !="#text") {
echo $node->nodeName . ' ';
echo $node->nodeType . ' ';
echo $node->textContent . '<br>';
if($node->hasChildNodes()) {
showDOMNode($node);
}
}
}
}
?>
Here's what I get:
html 10
html 1 1st text I need 2nd text I need 3rd text I need
body 1 1st text I need 2nd text I need 3rd text I need
p 1 1st text I need
a 1 2nd text I need
table 1 3rd text I need
tr 1 3rd text I need
td 1 3rd text I need
As you can see, when the textContent seems to show the text for all child nodes while I need the specific one for each node. Any help is much appreciated.
I have a question regarding the domdocument.
My $html contains something like
texts …paragraph..
<table class='test'>
tr and td...
</table>
texts and more texts
I want to detect if there my html variable has a table element. If so, wrap the other texts in <p> tag.
so it will be
<p>texts …paragraph..</p>
<table class='test'>
tr and td...
</table>
<p>texts and more texts</p>
My codes is like
$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$tables = $doc->getElementsByTagName('table');
foreach ($tables as $table) {
//I am not sure what to do next...
}
Can someone help me out about this? Thanks so much!
I didnt test this but.
$html = preg_replace('/(<table.*</table>)/i','<p>$1</p>', $html);
Hope it helps...
I am trying to parse a folder full of .htm files. All these files contain 1 specific element that needs to be removed.
It's a td element with class="hide". So far, this is my code.
$dir. entry is the full path to the file.
$page = ($dir . $entry);
$this->domDoc->loadHTMLFile($page);
// Use xpath query to find the menu and remove it
$nodeList = $xpath->query('//td[#class="hide"]');
Unfortunately, this is where things already go wrong. If I do a var_dump of the node list, I get the following:
object(DOMNodeList)#5 (0) { }
Just so you folks get an idea of what I'm trying to select, here's an excerpt:
<td width="160" align="left" valign="top" class="hide">
lots of other TD's and content here
</td>
Does anybody see anything wrong with what I've come up with so far?
Is your initial file xhtml (i.e. with <html xmlns="http://www.w3.org/1999/xhtml">)? If so then your elements will be namespaced and you'll need to set up a prefix mapping using $xpath->registerNamespace and then use this prefix in the expression
$xpath->registerNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
$nodeList = $xpath->query('//xhtml:td[#class="hide"]');
Var dumping an xpath node list object doesn't show anything. Var dump the node list's length.
var_dump($nodeList->length);
If the value is over 0, then you can iterate over it using foreach:
foreach($nodeList as $node)var_dump($node->tagName);
Hope this helps.
For further clarification, here is a full working code snippet:
<?php
$html = <<<END
<html>
<body>
<td>
</td>
<td class="hide"></td>
<td class="hide"></td>
</body>
</html>
END;
$dom = new DOMDocument;
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//td[#class="hide"]');
// Shows a blank object
var_dump($nodeList);
// Shows 2
var_dump($nodeList->length);
// Echo out all the tag names.
foreach($nodeList as $node){
echo $node->tagName . "\n";
}
?>
Maybe you have more then one class in the class attribute of your td element:
<td class="hide anotherclass">
So '//td[#class="hide"]' would only match:
<td class="hide">
Try it like this to see if it contains the hide class you are looking for:
$nodeList = $xpath->query('//td[contains(#class,"hide")]');
Check out this blog post: XPath: Select element by class
I've stripped the tag data from an url like
$url='http://abcd.com';
$d=stripslashes(file_get_contents($url));
echo strip_tags($d);
but unfortunately all the tag values are clubbed together like user14036100 9.00user23034003 11.33user32028000 14.00 where in the user1, user2, user3 attributes are stored, It is hard to analyse the attribute values as all are joined together by strip_tags().
so friends can someone help me to strip each tag and store in an array or by placing a delimiter at the end of each stripped tag data.
Thanks in advance :)
You cannot achieve this with strip_tags(), since it justs removes the tags. You wan't to replace them with e.g. a whitespace character (new line, space, ..).
You should probably do this with a regex call, which just replaces all tags.
A better way would be to parse the fetched page with DOMDocument, so that you can derive the structure directly from the HTML structure.
Example of usage of DOMDocument
You have the following example html page:
<!DOCTYPE html>
<html>
<head>
<title>This is my title</title>
</head>
<body>
<table id="someDataHere">
<tr>
<th>Country</th>
<th>Population</th>
</tr>
<tr>
<td>Germany</td>
<td>81,779,600</td>
</tr>
<tr>
<td>Belgium</td>
<td>11,007,020</td>
</tr>
<tr>
<td>Netherlands</td>
<td>16,847,007</td>
</tr>
</table>
</body>
</html>
You can use DOMDocument to fetch the entries in the table:
$url = "...";
$dom = new DOMDocument("1.0", "UTF-8");
$dom->loadHTML(file_get_contents($url));
$preparedData = array();
$table = $dom->getElementById("someDataHere");
$tableRows = $table->getElementsByTagName('tr');
foreach ($tableRows as $tableRow)
{
$columns = $tableRow->getElementsByTagName('td');
// skip the header row of the table - it has no <td>, just <th>
if (0 == $columns->length)
{
continue;
}
$preparedData[ $columns->item(0)->nodeValue ] = $columns->item(1)->nodeValue;
}
$preparedData will now hold the following data:
Array
(
[Germany] => 81,779,600
[Belgium] => 11,007,020
[Netherlands] => 16,847,007
)
Some notes
Since you are developing a crawler (spider), you are highly dependent on the HTML structure of the target webpage. You may have to adjust your crawler every time they change something in their templates.
This is just a simple example, but it should make clear, how you can now use it, to produce more advanced results.
Since DOMDocument implements the DOM methods, you have to work your way through the HTML structure with the possibilities they provide.
For very huge HTML pages DOMDocument can become quite expensive in terms of memory.
I create this code until now:
<?php
$url=" SOME HTML URL ";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href');
}
?>
I have html pages with tables so i want the link the title and the date. Example of html code:
<TR>
<TD align="center" vAlign="top" bgColor="#ffffff" class="smalltext">3</TD>
<TD class="plaintext" >THIS IS THE TITLE </TD>
<TD align="center" class="plaintext" >THIS IS DATE</TD>
</TR>
It works fine for me for the link, but i don't know how to take the others.
Tnx.
Where you are doing this:
$tags = $doc->getElementsByTagName('a');
You are getting back all the A tags. There only happens to be one.
If you want to get the text "THIS IS DATE", you're aren't going to get it by looking in A tags because the text is not inside an A tag - it is in a TD tag.
$tds = $doc->getElementsByTagName('td');
... would work to get all the TD elements, or you could assign an ID to the element you want to target and use getElementById instead.
Basically, though, this information is all in the documentation, which you absolutely should read before asking questions. Happy reading!
Once again, that's: http://php.net/manual/en/class.domdocument.php