manipulate PHP domdocument string - php

I want to remove the element tag in my domdocument html.
I have something like
this is the <a href='#'>test link</a> here and <a href='#'>there</a>.
I want to change my html to
this is the test link here and there.
My code
$dom = new DomDocument();
$dom->loadHTML($html);
$atags=$dom->getElementsByTagName('a');
foreach($atags as $atag){
$value = $atag->nodeValue;
//I can get the test link and there value but I don't know how to remove the a tag.
}
Thanks for the help!

You are looking for a method called DOMNode::replaceChild().
To make use of that you need to create a DOMText of the $value (DOMDocument::createTextNode()) and also getElementsByTagName return a self-updating list, so when you replace the first element and then you go to the second, there is no second any longer, there is only one a element left.
Instead you need a while on the first item:
$atags = $dom->getElementsByTagName('a');
while ($atag = $atags->item(0))
{
$node = $dom->createTextNode($atag->nodeValue);
$atag->parentNode->replaceChild($node, $atag);
}
Something along those lines should do it.

You could just use strip_tags - it should do what you've asked.
<?php
$string = "this is the <a href='#'>test link</a> here and <a href='#'>there</a>.";
echo strip_tags($string);
// output: this is the test link here and there.

Related

Extract entire url content using Regex

Okay, I am using (PHP) file_get_contents to read some websites, these sites have only one link for facebook... after I get the entire site I will like to find the complete Url for facebook
So in some part there will be:
<a href="http://facebook.com/username" >
I wanna get http://facebook.com/username, I mean from the first (") to the last ("). Username is variable... could be username.somethingelse and I could have some attributes before or after "href".
Just in case i am not being very clear:
<a href="http://facebook.com/username" > //I want http://facebook.com/username
<a href="http://www.facebook.com/username" > //I want http://www.facebook.com/username
<a class="value" href="http://facebook.com/username. some" attr="value" > //I want http://facebook.com/username. some
or all example above, could be with singles quotes
<a href='http://facebook.com/username' > //I want http://facebook.com/username
Thanks to all
Don't use regex on HTML. It's a shotgun that'll blow off your leg at some point. Use DOM instead:
$dom = new DOMDocument;
$dom->loadHTML(...);
$xp = new DOMXPath($dom);
$a_tags = $xp->query("//a");
foreach($a_tags as $a) {
echo $a->getAttribute('href');
}
I would suggest using DOMDocument for this very purpose rather than using regex. Here is a quick code sample for your case:
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
$hrefTags = $dom->getElementsByTagName("a");
foreach ($hrefTags as $hrefTag)
$links[] = $hrefTag->getAttribute("href");
print_r($links); // dump all links

Add a nofollow attribute to link if no title tag present using PHP

I have a bunch of text with html in it. Basically what I want to do is for all links found in this text I want to add a rel="noindex" to every link found only if the title attribute is no present.
For example if a link looks like this:
test
I want it to look like:
<a rel="nofollow" href="test.html">test</a>
But if the link looks like this:
<a title="test title" href="test.html">test</a>
I dont want to add the rel="nofollow" attribute to that. How can I do that in php?
EDIT:
Im sorry I didnt mention this but I am using PHP4. Yes I know but Im stuck with PHP4.
Quite simply with DOMDocument:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (!$link->hasAttribute('title')) {
$link->setAttribute('rel', 'nofollow');
}
}
$yourHTML = $dom->saveHTML();
This is far more stable and reliable than mucking about with regex.
First use preg match to get if title is added.
$str = 'test';
if(!preg_match('/title=/', $str))
{
$str = str_replace('href=', 'rel="nofollow" href=', $str);
}

Add an attribute to an HTML element

I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.
It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()
Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.

PHP - How to replace a phrase with another?

How can i replace this <p><span class="headline"> with this <p class="headline"><span>
easiest with PHP.
$data = file_get_contents("http://www.ihr-apotheker.de/cs1.html");
$clean1 = strstr($data, '<p>');
$str = preg_replace('#(<a.*>).*?(</a>)#', '$1$2', $clean1);
$ausgabe = strip_tags($str, '<p>');
echo $ausgabe;
Before I alter the html from the site I want to get the class declaration from the span to the <p> tag.
dont parse html with regex!
this class should provide what you need
http://simplehtmldom.sourceforge.net/
The reason not to parse HTML with regex is if you can't guarantee the format. If you already know the format of the string, you don't have to worry about having a complete parser.
In your case, if you know that's the format, you can use str_replace
str_replace('<p><span class="headline">', '<p class="headline"><span>', $data);
Well, answer was accepted already, but anyway, here is how to do it with native DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile("http://www.ihr-apotheker.de/cs1.html");
$xPath = new DOMXpath($dom);
// remove links but keep link text
foreach($xPath->query('//a') as $link) {
$link->parentNode->replaceChild(
$dom->createTextNode($link->nodeValue), $link);
}
// switch classes
foreach($xPath->query('//p/span[#class="headline"]') as $node) {
$node->removeAttribute('class');
$node->parentNode->setAttribute('class', 'headline');
}
echo $dom->saveHTML();
On a sidenote, HTML has elements for headings, so why not use a <h*> element instead of using the semantically superfluous "headline" class.
Have you tried using str_replace?
If the placement of the <p> and <span> tags are consistent, you can simply replace one for the other with
str_replace("replacement", "part to replace", $string);

Regex match HTML tag NOT containing another tag

I am writing a regex find/replace that will insert a <span> into every <a href> in a file where a <span> does not already exist. It will allow other tags to be in the <a href> like <img>, <b>, etc.
Currently I have this regex:
Find: (<a[^>]+?style=".*?color:#(\w{6}).*?".*?>)(.+?)(<\/a>)
Replace: '$1<span style="color:#$2;">$3</span>$4'
It works great except if i run it over the same file, it will insert a <span> inside of a <span> and it gets messy.
Target Example:
We want it to ignore this:
<span style="color:#bfbcba;">Howdy</span>
But not this:
Howdy
Or this:
<img src="myimg.gif" />Howdy
--EDIT--
Using the PHP DOM library as suggested in the comments, this is what I have so far:
$doc = new DOMDocument();
$doc->loadHTML($input);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
$spancount = $tag->getElementsByTagName("span")->length;
if($spancount == 0){
$element = $doc->createElement('span');
$tag->appendChild($element);
}
}
echo $doc->saveHTML();`
Currently it will detect if there is a span inside an anchor and if there is, it will append a span to the inside of the anchor, however, i have yet to figure out how to get the original contents of the anchor inside the span.
Don't use regex for this, it's not ideal for HTML.
Use a DOM library and getElementsByTagName('a') then iterate through each anchor and see if it contains a sub span element with getElementsByTagName('span'), using the length property. If it doesn't, appendChild or assign the firstChild of the anchor node to your new span created with document.createElement('span').
EDIT: As for grabbing the inner html of the anchor, if there are lots of nodes inside, try using this:
<?php
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
$html = innerHTML( $anchorRef );
This may also help you out: Change innerHTML of a php DOMElement

Categories