I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION
You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.
Related
I need to be able to parse some text and find all the instances where an tag has target="_blank".... and for each match, add (for example): This link opens in a new window before the closeing tag.
For example:
Before:
Go here now
After:
Go here now<span>(This link opens in a new window)</span>
This is for a PHP site, so i assume preg_replace() will be the method... i just dont have the skills to write the regex properly.
Thanks in advance for any help anyone can offer.
You should never use a regex to parse HTML, except maybe in extremely well-defined and controlled circumstances.
Instead, try a built-in parser:
$dom = new DOMDocument();
$dom->loadHTML($your_html_source);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a[#target='_blank']");
foreach($links as $link) {
$link->appendChild($dom->createTextNode(" (This link opens in a new window)"));
}
$output = $dom->saveHTML();
Aternatively, if this is being output to the browser, you can just use CSS:
a[target='_blank']:after {
content: ' (This link opens in a new window)';
}
This will work for anchor tag replacement....
$string = str_replace('<a ','<a target="_blank" ',$string);
Well #Kolink is right, but there's my RegExp version.
$string = '<p>mess</p>Google<p>mess</p>';
echo preg_replace("/(\<a.*?target=\"_blank\".*?>)(.*?)(\<\/a\>)/miU","$1$2(This link opens in a new window)$3",$string);
This does the job:
$newText = '<span>(This link opens in a new window)</span>';
$pattern = '~<a\s[^>]*?\btarget\s*=(?:\s*([\'"])_blank\1|_blank\b)[^>]*>[^<]*(?:<(?!/a>)[^<]*)*\K~i';
echo preg_replace($pattern, $newText, $html);
However this direct string approach may replace also commented html parts, strings or comments in css or javascript code and eventually inside javascript literal regexes, that is at best unneeded and at worst unwanted at all. That's why you should use a DOM approach if you want to avoid these pitfalls. All you have to do is to append a new node to each link with the desired attribute:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$nodeList = $xp->query('//a[#target="_blank"]');
foreach($nodeList as $node) {
$newNode = dom->createElement('span', '(This link opens in a new window)');
$node->appendChild($newNode);
}
$html = $dom->saveHTML();
To finish, a last alternative consists to not change the html at all and to play with css:
a[target="_blank"]::after {
content: " (This link opens in a new window)";
font-style: italic;
color: red;
}
You won't be able to write a regex that will evaluate an infinitely long string. I suggest:
$h = explode('>', $html);
This will give you the chance to traverse it like any other array and then do:
foreach($h as $k){
if(!preg_match('/^<a href=/', $k){
continue;
}elseif(!preg_match(/target="_blank")/, $k){
continue;
}else{
$h[$k + 1] .= '(open in new window);
}
}
$html = implode('>', $h);
This is how I would approach such a problem. of course, I just threw this out off the top of my head and is note guaranteed to work as is, but with a few possible tweaks to your exact logic, and you will have what you need.
I'm using DOMDocument to retrieve on a HTML page a special div.
I just want to retrive the content of this div, without the div tag.
For example :
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML()
Here, i have the result :
<div id="inter">
//SOME THINGS IN MY DIV
</div>
And i just want to have :
//SOME THINGS IN MY DIV
Ideas ? Thanks !
I'm going to go with simple does it. You already have:
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML();
Now, DOMDocument::getElementById() returns one DOMElement which extends DOMNode which has the public stringnodeValue. Since you don't specify if you are expecting anything but text within that div, I'm going to assume that you want anything that may be stored in there as plain text. For that, we are going to remove $dom->saveHTML();, and instead replace it with:
$divString = $main->nodeValue;
With that, $divString will contain //SOME THINGS IN MY DIV, which, from your example, is the desired output.
If, however, you want the HTML of the inside of it and not just a String representation - replace it with the following instead:
$divString = "";
foreach($main->childNodes as $c)
$divString .= $c->ownerDocument->saveXML($c);
What that does is takes advantage of the inherited DOMNode::childNodes which contains a DOMNodeList each containing its own DOMNode (for reference, see above), and we loop through each one getting the ownerDocument which is a DOMDocument and we call the DOMDocument::saveXML() function. The reason we pass the current $c node in to the function is to prevent an entire valid document from being outputted, and because the ownerDocument is what we are looping through - we need to get one child at a time, with no children left behind. (sorry, it's late, couldn't resist.)
Now, after either option, you can do with $divString what you will. I hope this has helped explain the process to you and hopefully you walk away with a better understanding of what is going on instead of rote copying of code just because it works. ^^
you can use my custom function to remove extra div from content
$html_string = '<div id="inter">
SOME THINGS IN MY DIV
</div>';
// custom function
function DOMgetinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
your code will like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMgetinnerHTML($divs->item(0));
echo $innerHTML_contents
and your output will be
SOME THINGS IN MY DIV
you can use xpath
$xpath = new DOMXPath($xml);
foreach($xpath->query('//div[#id="inter"]/*') as $node)
{
$node->nodeValue
}
or simplu you can edit your code. see here
$main = $dom->getElementById('inter');
echo $main->nodeValue
Is it possible to delete element from loaded DOM without creating a new one? For example something like this:
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $href)
if($href->nodeValue == 'First')
//delete
You remove the node by telling the parent node to remove the child:
$href->parentNode->removeChild($href);
See DOMNode::$parentNodeDocs and DOMNode::removeChild()Docs.
See as well:
How to remove attributes using PHP DOMDocument?
How to remove an HTML element using the DOMDocument class
This took me a while to figure out, so here's some clarification:
If you're deleting elements from within a loop (as in the OP's example), you need to loop backwards
$elements = $completePage->getElementsByTagName('a');
for ($i = $elements->length; --$i >= 0; ) {
$href = $elements->item($i);
$href->parentNode->removeChild($href);
}
DOMNodeList documentation: You can modify, and even delete, nodes from a DOMNodeList if you iterate backwards
Easily:
$href->parentNode->removeChild($href);
I know this has already been answered but I wanted to add to it.
In case someone faces the same problem I have faced.
Looping through the domnode list and removing items directly can cause issues.
I just read this and based on that I created a method in my own code base which works:https://www.php.net/manual/en/domnode.removechild.php
Here is what I would do:
$links = $dom->getElementsByTagName('a');
$links_to_remove = [];
foreach($links as $link){
$links_to_remove[] = $link;
}
foreach($links_to_remove as $link){
$link->parentNode->removeChild($link);
}
$dom->saveHTML();
for remove tag or somthing.
removeChild($element->id());
full example:
$dom = new Dom;
$dom->loadFromUrl('URL');
$html = $dom->find('main')[0];
$html2 = $html->find('p')[0];
$span = $html2->find('span')[0];
$html2->removeChild($span->id());
echo $html2;
I'm trying to grab all the links and their content from a text, but my problem is that the links might also have other attributes like class or id. What would be the pattern for this?
What i tried so far is:
/<a href="(.*)">(.*)<\/a\>/
Thank You,
Radu
As the comment to your question states, avoid using regex for HTML. The correct way to do it is using DOMDocument
$dom = new DOMDocument;
$dom->load($html);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//*/a');
foreach ($links as $link) {
/* do something with this */
$href = $link->getAttribute('href');
$text = $link->nodeValue;
}
Edit:
An even better answer on the subject
This should do it:
/<a .*?href="(.*?)"[^>]*>([^<]*)<\/a>/i
Read this and see if you still want to use it.
I'm working on using htmlpurifier to create a text-only version of my site.
I now need to replace all the a hrefs with the text only url i.e. 'www.example.com/aboutus' becomes 'www.example.com/text/aboutus'
Initially I tried a simple str_replace on the domain (I use a global variable for the domain), but the problem is links to files also get replaced i.e.
'www.example.com/document.pdf' becomes 'www.example.com/text/document.pdf' and therefore fails.
Is there a regular expression where I can say replace domain with domain/text where the url does not include string?
Thanks for any pointers you might be able to give me :)
Use a negative lookahead:
$output = preg_replace(
'#www.example.com(?!/text/)#',
'www.example.com/text',
$input
);
Better yet, use DOM with it:
$html = 'foo
<p>hello</p>
bar';
libxml_use_internal_errors(true); // supresses DOM errors
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('//a/#href');
foreach ($hrefs as $href) {
$href->value = preg_replace(
'#^www.example.com(?!/text/)(.*?)(?<!\.pdf)$#',
'www.example.com/text\\1',
$href->value
);
}
This should give you:
foo
<p>hello</p>
bar