How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!
Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>
You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);
Related
I need to remove all html codes from a php string except:
<p>
<em>
<small>
You know, strip_tags() function is good, but it strips all html tags, how can I tell it remove all html except those tags above?
You should check out the manual: Example #1 strip_tags() example
Syntax: strip_tags ( Your-string, Allowable-Tags )
If you pass the second parameter, these tags will not be stripped.
strip_tags($string, '<p><em><small>');
According to your comment, you want to remove HTML elements only if they have some class or attribute. You'll need to build up a DOM then:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>I will be deleted as well</p>
<p>But keep this</p>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
$elements_to_be_removed = $xpath->query("//*[count(#*)>0]");
foreach ($elements_to_be_removed as $element) {
$element->parentNode->removeChild($element);
}
// just to check
echo $dom->saveHTML();
?>
To change which elements shall be removed, you'll need to change the query, ie to remove all elements with the class myclass, it must read "//*[class='myclass']".
I'm trying to get full accurate img tags from a html code using DOM:
$content=new DOMDocument();
$content->loadHTML($htmlcontent);
$imgTags=$content->getElementsByTagName('img');
foreach($imgTags as $tag) {
echo $content->saveXML($tag); }
If i had the original <img src="img">, the result would be <img src="img"/>. But i need the exact value corresponding to the original.
It is possible - to get the exact img tag using DOM without regular expressions or thirdparty libraries (Simple HTML DOM)?
No. It isn't possible to do this.
However, you can achieve your goal of removing the <img> elements from an HTML document if they meet specific conditions using DOMDocument. Here's some sample code which removes images which contain the class attribute "removeme".
$htmlcontent =
'<!DOCTYPE html><html><head><title>Example</title></head><body>'
. '<img src="1"><img src="2" class="removeme"><img src="3"><img class="removeme" src="4">'
. '</body></html>';
$content=new DOMDocument();
$content->loadHTML($htmlcontent);
foreach ($content->getElementsByTagName('img') as $image) {
if ($image->getAttribute("class") == "removeme") {
$image->parentNode->removeChild($image);
}
}
echo $content->saveHTML();
Output:
<!DOCTYPE html> <html><head><title>Example</title></head><body><img src="1"><img src="3"></body></html>
I tried all the solutions posted on this question. Although it is similar to my question, it's solutions aren't working for me.
I am trying to get the plain text that is outside of <b> and it should be inside the <div id="maindiv>.
<div id=maindiv>
<b>I don't want this text</b>
I want this text
</div>
$part is the object that contains <div id="maindiv">.
Now I tried this:
$part->find('!b')->innertext;
The code above is not working. When I tried this
$part->plaintext;
it returned all of the plain text like this
I don't want this text I want this text
I read the official documentation, but I didn't find anything to resolve this:
Query:
$selector->query('//div[#id="maindiv"]/text()[2]')
Explanation:
// - selects nodes regardless of their position in tree
div - selects elements which node name is 'div'
[#id="maindiv"] - selects only those divs having the attribute id="maindiv"
/ - sets focus to the div element
text() - selects only text elements
[2] - selects the second text element (the first is whitespace)
Note! The actual position of the text element may depend on
your preserveWhitespace setting.
Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace
Example:
$html = <<<EOF
<div id="maindiv">
<b>I dont want this text</b>
I want this text
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXpath($doc);
$node = $selector->query('//div[#id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text
remove the <b> first:
$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text
I tried using preg_match_all to get all the contents between a given html tag but it produces an empty result and I'm not good at php.
Is there a way to get get contents between tags? Like this -
<span class="st"> EVERYTHING IN HERE INCLUDING TAGS<B></B><EM></EM><DIV></DIV>&+++ TEXT </span>
preg_match is not very good at HTML parsing, especially in your case which is a bit more complex.
Instead you use a HTML parser and obtain the elements you're looking for. The following is a simple example selecting the first span element. This can be more differentiated by looking for the class attribute as well for example, just to give you some pointers for the start:
$html = '<span class="st"> EVERYTHING IN HERE INCLUDING TAGS<B></B><EM></EM><DIV></DIV>&+++ TEXT </span>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$span = $doc->getElementsByTagName('span')->item(0);
echo $doc->saveHTML($span);
Output:
<span class="st"> EVERYTHING IN HERE INCLUDING TAGS<b></b><em></em><div></div>&+++ TEXT </span>
If you look closely, you can see that even HTML errors have been fixed on the fly with the &+++ which was not valid HTML.
If you only need the inner HTML, you need to iterate over the children of the span element:
foreach($span->childNodes as $child)
{
echo $doc->saveHTML($child);
}
Which give you:
EVERYTHING IN HERE INCLUDING TAGS<b></b><em></em><div></div>&+++ TEXT
I hope this is helpful.
Try this with preg_match
$str = "<span class=\"st\"> EVERYTHING IN HERE INCLUDING TAGS<B></B><EM></EM><DIV></DIV>&+++ TEXT </span>";
preg_match("/<span class=\"st\">([.*?]+)<\/span>/i", $str, $matches);
print_r($matches);
I want to retrieve the data of the next element tag in a document, for example:
I would like to retrieve <blockquote> Content 1 </blockquote> for every different span only.
<html>
<body>
<span id=12341></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12342></span>
<blockquote>Content 1</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12343></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
<blockquote>Content 4</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12344></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
</body>
</html>
Now two things I'm wondering:
1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?
2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?
Now two things I'm wondering:
1.)How can I write an expression that matches and only outputs a blockquote
that's followed right after a closed
element (<span></span>)?
Assuming that the provided text is converted to a well-formed XML document (you need to enclose the values of the id attributes in quotes)
Use:
/*/*/span/following-sibling::*[1][self::blockquote]
This means in English: Select all blockquote elements each of which is the first, immediate following sibling of a span element that is a grand-child of the top element of the document.
2.)If I wanted, how could I get Content 2, Content 3, etc if I ever
have a need to output them in the
future while still applying to the
rules of the previous question?
Yes.
You can get all sets of contigious blockquote elements following a span:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*[not(self::blockquote)][1][self::span]]
You can get the contigious set of blockquote elements following the (N+1)-st span by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=$vN]
]
where $vN should be substituted by the number N.
Thus, the set of contigious set of blockquote elements following the first span is selected by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=0]
]
the set of contigious set of blockquote elements following the second span is selected by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=1]
]
etc. ...
See in the XPath Visualizer the nodes selected by the following expression :
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=3]
]
Short answer: Load your HTML into DOMDocument, and select the nodes you want with XPath.
http://www.php.net/DOM
Long answer:
$flag = false;
$TEXT = array();
foreach ($body->childNodes as $el) {
if ($el->nodeName === '#text') continue;
if ($el->nodeName === 'span') {
$flag = true;
continue;
}
if ($flag && $el->nodeName === 'blockqoute') {
$TEXT[] = $el->firstChild->nodeValue;
$flag = false;
continue;
}
}
Try the following *
/html/body/span/following-sibling::*[1][self::blockquote]
to match any first blockquotes after a span element that are direct children of body or
//span/following-sibling::*[1][self::blockquote]
to match any first blockquotes following a span element anywhere in the document
* edit: fixed Xpath. Credits to Dimitre. My initial version would match any first blockquote after the span, e.g. it would match span p blockquote, which is not what you wanted.
Both of the above would match "Content 1" blockquotes. If you'd want to match the other blockquotes following the span (siblings, not descendants) remove the [1]
Example:
$dom = new DOMDocument;
$dom->load('yourFile.xml');
$xp = new DOMXPath($dom);
$query = '/html/body/span/following-sibling::*[1][self::blockquote]';
foreach($xp->query($query) as $blockquote) {
echo $dom->saveXml($blockquote), PHP_EOL;
}
If you want to do that without XPath, you can do
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('yourFile.xml');
$body = $dom->getElementsByTagName('body')->item(0);
foreach($body->getElementsByTagName('span') as $span) {
if($span->nextSibling !== NULL &&
$span->nextSibling->nodeName === 'blockquote')
{
echo $dom->saveXml($span->nextSibling), PHP_EOL;
}
}
If the HTML you scrape is not valid XHTML, use loadHtmlFile() instead to load the markup. You can suppress errors with libxml_use_internal_errors(TRUE) and libxml_clear_errors().
Also see Best methods to parse HTML for alternatives to DOM (though I find DOM a good choice).
Besides #Dimitre good answer, you could also use:
/html
/body
/blockquote[preceding-sibling::*[not(self::blockquote)][1]
/self::span[#id='12341']]