I've have the following (PHP) code that traverses an entire DOM document to get all of the text nodes. It's a bit of a ugly solution, and I'm sure there must be a better way... so, is there?
$skip = false;
$node = $document;
$nodes = array();
while ($node) {
if ($node->nodeType == 3) {
$nodes[] = $node;
}
if (!$skip && $node->firstChild) {
$node = $node->firstChild;
} elseif ($node->nextSibling) {
$node = $node->nextSibling;
$skip = false;
} else {
$node = $node->parentNode;
$skip = true;
}
}
Thanks.
The XPath expression you need is //text(). Try using it with DOMXPath::query. For example:
$xpath = new DOMXPath($doc);
$textnodes = $xpath->query('//text()');
Related
I have html code something like this:
<p><i>i_text</i>,p_text</p>
i_text,p_text
i want change all node values in this domelement and keep all tags
i_changed_text,p_changed_text
my attempts)
$html = '<p><i>i_text</i> p_text</p>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = true;
$elements = $dom->getElementsByTagName('*');
foreach ($elements as $element) {
$element->nodeValue = str_replace('_','_changed_',$element->nodeValue);
}
echo($dom->saveHTML());
output i_changed_text,p_changed_text
this code return correct text but don't save childnodes
$html = '<p><i>i_text</i>,p_text</p>';
$dom = new DOMDocument();
$dom->loadXML($html);
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = true;
$elements = $dom->getElementsByTagName('*');
$elem = $dom->createElement('dfn', 'tag');
$attr = $dom->createAttribute('text');
$attr->value = 'element';
$elem->appendChild($attr);
$elements = $dom->getElementsByTagName('*');
foreach ($elements as $element) {
while ($element->hasChildnodes()) {
$element = $element->childNodes->item(0);
}
$changed_value = str_replace('_','_changed_',$element->nodeValue);
$element->nodeValue = str_replace("tag", $dom->saveXML($elem), $changed_value);
}
echo ($dom->saveXML());
output
i_changed_text,p_text
this code save and change values in childnodes but don't change text in parentnode
my solution)
i_text,p_text,a_text,another one_text
$html = '<p><i>i_text</i>,p_text<b>,a_text</b>,another one_text</p>';
$dom = new DOMDocument();
$dom->loadXML($html);
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = true;
$elements = $dom->getElementsByTagName('*');
foreach ($elements as $element) {
if($element->hasChildnodes()==true && $element->parentNode->nodeName == '#document'){
foreach($element->childNodes as $element_child){
$element_child->nodeValue = str_replace('_','_changed_', $element_child->nodeValue);
}
}
}
echo ($dom->saveXML());
output
i_changed_text,p_changed_text,a_changed_text,another one_changed_text
I'm using one small script to convert from absolute links to relative ones. It is working but it needs improvement. Not sure how to proceed. Please have a look at part of the script used for this.
Script:
public function links($path) {
$old_url = 'http://test.dev/';
$dir_handle = opendir($path);
while($item = readdir($dir_handle)) {
$new_path = $path."/".$item;
if(is_dir($new_path) && $item != '.' && $item != '..') {
$this->links($new_path);
}
// it is a file
else{
if($item != '.' && $item != '..')
{
$new_url = '';
$depth_count = 1;
$folder_depth = substr_count($new_path, '/');
while($depth_count < $folder_depth){
$new_url .= '../';
$depth_count++;
}
$file_contents = file_get_contents($new_path);
$doc = new DOMDocument;
#$doc->loadHTML($file_contents);
foreach ($doc->getElementsByTagName('a') as $link) {
if (substr($link, -1) == "/"){
$link->setAttribute('href', $link->getAttribute('href').'/index.html');
}
}
$doc->saveHTML();
$file_contents = str_replace($old_url,$new_url,$file_contents);
file_put_contents($new_path,$file_contents);
}
}
}
}
As you can see I've added inside while loop that DOMDocument but it doesn't work. What I'm trying to achieve here is to add for every link at the end index.html if last char in that link is /
What am I doing wrong?
Thank you.
Is this what you want?
$file_contents = file_get_contents($new_path);
$dom = new DOMDocument();
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a");
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (substr($href, -1) === '/') {
$link->setAttribute('href', $href."index.html");
}
}
$new_file_content = $dom->saveHTML();
# save this wherever you want
See a demo on ideone.com.
Hint: Your call to $dom->saveHTML() leads to nowhere (ie there's no variable capturing the output).
I have a html string. I want to traverse it and extract some information. My code is as following:
$str = '<p>aaa</p><img src="http://stackoverflow.com/questions/ask"/><p>sss</p><img src="http://stackoverflow.com/"/>';
function parseContent($str) {
$contents = array();
$dom = new DOMDocument('1.0', 'UTF-8');
if (!$dom->loadHTML($str)) {
return $contents;
}
$stack = array($dom);
while (count($stack) > 0) {
$node = array_shift($stack);
foreach ($node->childNodes as $node) {
if ($node->hasChildNodes()) {
$stack[] = $node;
} else {
switch ($node->nodeType) {
case XML_ELEMENT_NODE:
if ('img' == $node->tagName) {
$contents[] = $node->attributes->getNamedItem('src')->nodeValue;
}
break;
case XML_TEXT_NODE:
$contents[] = $node->textContent;
break;
}
}
}
}
return $contents;
}
The problem is: When I dumped the return value of this function, it was something like this:
array(
'http://stackoverflow.com/questions/ask',
'http://stackoverflow.com/',
'aaa',
'sss',
)
Could someone point it out why the order was lost?
Extending from comment:
That's because each <p> also has child node (a text node), so they go into the first if ($node->hasChildNodes()) statement and are stacked one more time.
To avoid this, one way is to add one more condition:
/* ... */
if ($node->hasChildNodes()) {
if ($node->childNodes->length==1 && $node->childNodes->item(0)->nodeType==XML_TEXT_NODE) {
$contents[] = $node->childNodes->item(0)->textContent;
} else {
$stack[] = $node;
}
} else {
/* ... */
i tried to concatenate innerhtml of div into string variable:
games variable:
$games = '';
DOMinnerHTML function:
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
ExtractFromType function:
function ExtractFromType($type)
{
$html = file_get_contents('www.site.com/' .$type);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
if (strpos($div->getAttribute('style'),'MyString') !== false) {
//////
$games = $games.DOMinnerHTML($div);
//////
}
}
}
code:
ExtractFromType('MyType');
echo $games; // = Nothing.
this code return nothing.
$games is defined in the global scope, and it's not available inside ExctractFromType. Define it inside the function, then return the value:
function ExtractFromType($type) {
$html = file_get_contents('www.site.com/' .$type);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$divs = $dom->getElementsByTagName('div');
$games = '';
foreach ($divs as $div) {
if (strpos($div->getAttribute('style'),'MyString') !== false) {
$games = $games.DOMinnerHTML($div);
}
}
}
echo ExtractFromType('MyType');
I'm trying to add the results of a script to an array, but once I look into it there is only one item in it, probably me being silly with placement
function crawl_page($url, $depth)
{
static $seen = array();
$Linklist = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
}
if(shouldScrape($href)==true)
{
crawl_page($href, $depth - 1);
}
}
echo "URL:",$url;
echo http_response($url);
echo "<br/>";
$Linklist[] = $url;
$XML = new DOMDocument('1.0');
$XML->formatOutput = true;
$root = $XML->createElement('Links');
$root = $XML->appendChild($root);
foreach ($Linklist as $value)
{
$child = $XML->createElement('Linkdetails');
$child = $root->appendChild($child);
$text = $XML->createTextNode($value);
$text = $child->appendChild($text);
}
$XML->save("linkList.xml");
}
$Linklist[] = $url; will add a single item to the $Linklist array. This line needs to be in a loop I think.
static $Linklist = array(); i think, but code is awful