can I use str_replace in all xml nodes with php? - php

I think the problem is with my logic and I am probably going about this the wrong way. What I want is to
open an xml document with php
get elements by Tag name
then for each node that has child nodes
replace every letter a with ა every b with ბ and so on.
here is the code I have so for but it doesn't work.
xmlDoc=loadXMLDoc("temp/word/document.xml");
$nodes = xmlDoc.getElementsByTagName("w:t");
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
$node->nodeValue = str_replace("a","ა",$node->nodeValue);
$node->nodeValue = str_replace("b","ბ",$node->nodeValue);
$node->nodeValue = str_replace("g","გ",$node->nodeValue);
$node->nodeValue = str_replace("d","დ",$node->nodeValue);
// More replacements for each letter in the alphabet.
}
I thought it might be because of the multiple str_replace() calls but it doesn't work with even just one. Am I going about this the wrong way or have I missed something?

The $node variable gets overwritten on each iteration, so only the last $node will get modified (if ever). You need to do the replacement inside the loop and then use saveXML() method to return the modified XML markup.
Your code (with some improvements):
$xmlDoc = new DOMDocument();
$xmlDoc->load('temp/word/document.xml');
foreach ($xmlDoc->getElementsByTagName("w:t") as $node) {
while($node->hasChildNodes()) {
$node = $node->childNodes->item(0);
$search = array('a', 'b', 'g', 'd');
$replace = array('ა', 'ბ', 'გ', 'დ');
$node->nodeValue = str_replace($search, $replace, $node->nodeValue);
}
}
echo $xmlDoc->saveXML();

Related

Removing every li tag before reaching the first p tag in string

Suppose I have a string containing some HTML. I want to remove every li tag before reaching the first p tag.
How do I achieve something like that?
Example string:
$str = "<img src='something.png'/>some_text_here<li>needs_to_be_removed</li>
<li>also_needs_to_be_removed</li>some_other_text<p>finally</p>more_text_here
<li>this_should_not_be_removed</li>";`
The first two li tags need to be removed.
here is what you need. Simple and effective:
$mystring = "mystringwith<li>toberemovedstring</li><li>againremove</li><p>do not remove me</p>";//the string you provide
$findme = '<li>';//the string you want to search in $mystring
$findpee = '<p>';//haha pee also where to end it
$pos = strpos($mystring, $findme);//first position of <li>
$pospee = strpos($mystring, $findpee);// then position of pee.. get it :)
//Then we remove it
$result=substr_replace ( $mystring ,"" , $pos, ($pospee-$pos));
echo $result;
Edit: PHP sandbox
http://sandbox.onlinephpfunctions.com/code/e534259e2312682a04b64c6e3aae1521422aacd2
you can check the result here as well
You can do it with PHP's DOMdocument using the below traversal function
$doc = new DOMDocument();
$doc->loadHTML($str);
$foundp = false;
showDOMNode($doc);
//now $doc contains the string you want
$newstr = $doc->saveHTML();
function showDOMNode(DOMNode &$domNode) {
global $foundp;
foreach ($domNode->childNodes as $node)
{
if ($node->nodeName == "li" && $foundp==false){
//delete this node
$domNode->removeChild($node);
}
else if ($node->nodeName == "p"){
//stop here
$foundp = true;
return;
}
else if($node->hasChildNodes() && $foundp==false) {
//recursively
showDOMNode($node);
}
}
}
With XPath:
$str = "<img src='something.png'/>some_text_here<li>needs_to_be_removed</li>
<li>also_needs_to_be_removed</li>some_other_text<p>finally</p>more_text_here
<li>this_should_not_be_removed</li>";
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>' . $str .'</div>', LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
// ^---------------^----- add a root element
$xp = new DOMXPath($dom);
$lis = $xp->query('//p[1]/preceding-sibling::li');
foreach ($lis as $li) {
$li->parentNode->removeChild($li);
}
$result = '';
// add each child node of the root element to the result
foreach ($dom->getElementsByTagName('div')->item(0)->childNodes as $child) {
$result .= $dom->saveHTML($child);
}
I would suggest using a php praser library will be much better and faster approach. I personally use this one https://github.com/paquettg/php-html-parser in my projects. it provides apis like
$child->nextSibling()
$content->innerHtml,
$content->firstChild()
and more which can come in handy.
You can just do a foreach loop for all elements, register "li" tag inside them and if for third occurance, you find a "p" tag, you can just delete the $child->previousSibling();

Website Scraping Using Regex trying to extract integers

I'm having trouble to extract the integers between the brackets from this website.
Part of markup from the website:
<span class="b-label b-link-number" data-num="(322206)">Music & Video</span>
<span class="b-label b-link-number" data-num="(954218)">Toys, Hobbies & Games</span>
<span class="b-label b-link-number" data-num="(502981)">Kids, Baby & Maternity</span>
How do I extract the integers between the brackets?
Desired output:
322206
954218
502981
Should I use Regex since they got the same class name (but not Regex to get between brackets since there are other unwanted elements inside bracket as well from the source code).
Normally, this would be the way I use to extract information:
<?php
//header('Content-Type: text/html; charset=utf-8');
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://global.rakuten.com/en/search/?tl=&k=");
$finder = new DomXPath($grep);
$class = "b-list-item";
$nodes = $finder->query("//*[contains(#class, '$class')]");
foreach ($nodes as $node) {
$span = $node->childNodes;
$search = array(0,1,2,3,4,5,6,7,8,9,'(',')');
$categories = str_replace($search, '', $span->item(0)->nodeValue);
echo '<br>' . '<font color="green">' . $categories . ' ' . '</font>' ;
}
?>
but since the data I want is inside the tag, how do I extract them?
Adding on your current code, its simply straight forward, just change that $class to that class you desire and use ->getAttribute() to get those data-num's:
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://global.rakuten.com/en/search/?tl=&k=");
$finder = new DomXPath($grep);
$class = "b-link-number"; // change the span class
$nodes = $finder->query("//*[contains(#class, '$class')]"); // target those
$numbers = array();
foreach ($nodes as $node) { // for every found elemenet
$link_num = $node->getAttribute('data-num'); // get the attribute `data-num`
$link_num = str_replace(['(', ')'], '', $link_num); // simply remove those parenthesis
$numbers[] = $link_num; // push it inside the container
}
echo '<pre>';
print_r($numbers);
<span[^>)()]*\((\d+)\)[^>]*>
Try this.Grab the capture.See demo.
http://regex101.com/r/iM2wF9/10

URL decode all values in xml document in php

I'm after a way of making simplexml_load_string return a document where all the text values are urldecoded. For example:
$xmlstring = "<my_element>2013-06-19+07%3A20%3A51</my_element>";
$xml = simplexml_load_string($xmlstring);
$value = $xml->my_element;
//and value would contain: "2013-06-19 07:20:51"
Is it possible to do this? I'm not concerned about attribute values, although that would be fine if they were also decoded.
Thanks!
you can run
$value = urldecode( $value )
which will decode your string.
See: http://www.php.net/manual/en/function.urldecode.php
As long as each value is inside an element of its own (in SimpleXML you can not process text-nodes on its own, compare with the table in Which DOMNodes can be represented by SimpleXMLElement?) this is possible.
As others have outlined, this works by applying the urldecode function on each of these elements.
To do that, you need to change and add some lines of code:
$xml = simplexml_load_string($xmlstring, 'SimpleXMLIterator');
if (!$xml->children()->count()) {
$nodes = [$xml];
} else {
$nodes = new RecursiveIteratorIterator($xml, RecursiveIteratorIterator::LEAVES_ONLY);
}
foreach($nodes as $node) {
$node[0] = urldecode($node);
}
This code-example takes care that each leave is processed and in case, it's only the root element, that that one is processed. Afterwards, the whole document is changed so that you can access it as known. Demo:
<?php
/**
* URL decode all values in XML document in PHP
* #link https://stackoverflow.com/q/17805643/367456
*/
$xmlstring = "<root><my_element>2013-06-19+07%3A20%3A51</my_element></root>";
$xml = simplexml_load_string($xmlstring, 'SimpleXMLIterator');
$nodes = $xml->children()->count()
? new RecursiveIteratorIterator(
$xml, RecursiveIteratorIterator::LEAVES_ONLY
)
: [$xml];
foreach ($nodes as $node) {
$node[0] = urldecode($node);
}
echo $value = $xml->my_element; # prints "2013-06-19 07:20:51"

Modify html attribute with php

I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)

Dom replace entire node

Right now, i have this:
$text = $row->text;
$dom = new DOMDocument();
$dom->loadHTML($text);
$tags = $dom->getElementsByTagName('img');
foreach ($tags as $tag) {
$eg = $tag->getAttribute('data-easygal');
$src = $tag->getAttribute('src');
$values = explode("_",$eg);
$display = $this->prepareAlbum($values[0],$values[1],$src);
}
$row->text = $text;
is there a way to replace the whole node $tag, with what's in the $display string? I cant seem to find out how to str_replace the node for example.
Used to have preg_replace but that doesnt work properly on the clients server even though it works at home (and some instant anger from the php community with preg and html)
Tried searching the board, but no luck in finding what i need :S
Something like:
foreach($tags as &$tag) {
...
$tag = new DomNode();
}
Try
$tag-> parentNode ->replaceChild($newNode, $tag);
should replace the $tag node with $newNode - A DOM node that you create in the usual way.

Categories