How to remove all from page except text inside <p> tag?
Page:
This is text.
<div class="text">This is text in 'div' tag</div>
<p>This is text in 'p' tag</p>
Expected result:
This is text in 'p' tag
Greetings.
Basically, you'll have to parse the markup. PHP comes with a good parser in the form of the DOMDocument class, so that's really quite easy:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
Next, get all p tags:
$paragraphs = $dom->getElementsByTagName('p');
This method returns a DOMNodeList object, which implements the Traversable interface, so you can use it as an array of DOMNode instances (DOMElement in this case):
$first = $paragraphs->item(0);//or $paragraphs[0] even
foreach ($paragraphs as $p) {
echo $p->textContent;//echo the inner text
}
If you only want the paragraph elements that do not contain child elements, then you can easily check that:
foreach ($paragraphs as $p) {
if (!$p->hasChildNodes()) {
echo $p->textContent; // or $p->nodeValue
}
}
A closely related answer with some more links/info: How to split an HTML string into chunks in PHP?
You can easily do this with the native php strip_tags function like so:
strip_tags("<p>This is text in 'p' tag</p>");
Which will return as you expected, "This is text in 'p' tag". NOTE: this is only useful when you have an outer-container div, and you use a little bit of dirty RegExp in order to strip not only the P, but the whole tags the user expected (ex. the div tag). This function has one argument, and a second optional argument. The first one is the string that you are stripping the tags from, and the second one specifies allowable tags that won't be stripped as a string. These tags will not be removed in the process. For more information on the strip_tags function click here. I hope you got the idea :)
Related
I have:
<span>something or other</span>
<b>blarg</b>
<b>blarg and stuff</b>
<span>blarg</span>
<em>wakka wakka</em>
<em>wakka blarg</em>
<em>blarg</em>
and I just want to get the elements that ONLY contain "blarg" and no other text, so:
<b>blarg</b>
<span>blarg</span>
<em>blarg</em>
The important issue here is that I'm trying to check if blarg exists within one element alone on the page or not. I've had some general luck with regex but I'd rather do it with simple_html_dom so that I can look at child and sibling elements as well.
Does anyone know what is the simplest way to do this with simple_html_dom?
A way to do it, is to parse every tag, and test if it contains 'blarg'...
Here's a working example:
$text = '<span>something or other</span>
<b>blarg</b>
<b>blarg and stuff</b>
<span>blarg</span>
<em>wakka wakka</em>
<em>wakka blarg</em>
<em>blarg</em>';
echo "<div>Original Text: <xmp>$text</xmp></div>";
$html = str_get_html($text);
// Find all elements
$tags = $html->find('*');
foreach ($tags as $key => $tag) {
// If text in tag contains 'blarg'
if (strcmp(trim($tag->plaintext),'blarg') == 0) {
echo "<div> 'blarg' found in \$tags[$key]: <xmp>".$tag->outertext."</xmp></div>";
}
}
I don't know what you want to do with, but this may be a start :)
I want a preg_match code that will detect a given string and get its wrapping element.
I have a string and a html code like:
$string = "My text";
$html = "<div><p class='text'>My text</p><span>My text</span></div>";
So i need to create a function that will return the element wrapping the string like:
$element = get_wrapper($string, $html);
function get_wrapper($str, $code){
//code here that has preg_match and return the wrapper element
}
The returned value will be array since it has 2 possible returning values which are <p class='text'></p> and <span></span>
Anyone can give me a regex pattern on how to get the HTML element that wraps the given string?
Thanks! Answers are greatly appreciated.
It's bad idea use regex for this task. You can use DOMDocument
$oDom = new DOMDocument('1.0', 'UTF-8');
$oDom->loadXML("<div>" . $sHtml ."</div>");
get_wrapper($s, $oDom);
after recursively do
function get_wrapper($s, $oDom) {
foreach ($oDom->childNodes AS $oItem) {
if($oItem->nodeValue == $s) {
//needed tag - $oItem->nodeName
}
else {
get_wrapper($s, $oItem);
}
}
}
The simple pattern would be the following, but it assumes a lot of things. Regexes shouldn't be used with these. You should look at something like the Simple HTML DOM parser which is more intelligent.
Anyway, the regex that would match the wrapper tags and surrounding html elements is as follows.
/[A-Za-z'= <]*>My text<[A-Za-z\/>]*/g
Even if regex is never the correct answer in the domain of dom parsing, I came out with another (quite simple) solution
<[^>/]+?>My String</.+?>
if the html is good (ie it has closing tags, < is replaced with < & so on). This way you have in the first regex group the opening tag and in the second the closing one.
How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))
I know RegExp not well, I did not succeeded to split string to array.
I have string like:
<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>
So What I am trying to do is to split text string into array where KEY would be text from header and CONTENT would be all the rest content till the next header like:
array("some text in header" => "some other content, that belongs to header...", ...)
I would suggest looking at the PHP DOM http://php.net/manual/en/book.dom.php. You can read / create DOM from a document.
i've used this one and enjoyed it.
http://simplehtmldom.sourceforge.net/
you could do it with a regex as well.
something like this.
/<h5>(.*)<\/h5>(.*)<h5>/s
but this just finds the first situation. you'll have to cut hte string to get the next one.
any way you cut it, i don't see a one liner for you. sorry.
here's a crummy broken 4 liner.
$chunks = explode("<h5>", $html);
foreach($chunks as $chunk){
list($key, $val) = explode("</h5>", $chunk);
$res[$key] = $val;
}
dont parse HTML via preg_match
instead use php Class
The DOMDocument class
example:
<?php
$html= "<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>";
// a new dom object
$dom = new domDocument('1.0', 'utf-8');
// load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$hFive= $dom->getElementsByTagName('h5');
echo $hFive->item(0)->nodeValue; // u can get all h5 data by changing the index
?>
Reference
I'm trying to perform a preg_replace on the text in an HTML string. I want to avoid replacing the text within tags, so I'm loading the string as a DOM element and grabbing the text within each node. For example, I have this list:
<ul>
<li>Boxes 1-3: 1925 - 1928 <em>(A-Ma)</em></li>
<li>Boxes 4-6: 1928 <em>(Mb-Z)</em> - 1930 <em>(A-Wi)</em></li>
<li>Boxes 7-9: 1930 <em>(Wo-Z)</em>- 1932 <em>(A-Fl)</em></li>
</ul>
I want to be able to highlight the character "1", or the letter "i", without disturbing the links or list item tag. So I grab each list item and get its value to perform the replace on:
$invfile = [string of the unordered list above]
$invcontents = new DOMDocument;
$invcontents->loadHTML($invfile);
$inv_listitems = $invcontents->getElementsByTagName('li');
foreach ($inv_listitems as $f) {
$f->nodeValue = preg_replace($to_highlight, "<span class=\"highlight\">$0</span>", $f->nodeValue);
}
echo html_entity_decode($invcontents->saveHTML());
The problem is, when I grab the node values, the child nodes inside the list item are lost. If I print out the original string as-is, the < a >, < em >, etc. tags are all there. But when I run the script, it prints out without the links or any formatting tags. For example, if my $to_replace is the string "Boxes", the list becomes:
<ul>
<li><span class="highlight">Boxes</span> 1-3: 1925 - 1928 (A-Ma)</li>
<li><span class="highlight">Boxes</span> 4-6: 1928 (Mb-Z) - 1930 (A-Wi)</li>
<li><span class="highlight">Boxes</span> 7-9: 1930 (Wo-Z)- 1932 (A-Fl)</li>
</ul>
How can I get the text without losing the tags inside?
The problem here is that you're operating on the entire element. Boxes is part of the nodeValue of an anchor tag.
If the structure above is always the same you can do something like
$new_html = preg_replace("##", "", $f->item(0)->nodeValue);
In reality, the best way to go about it is to unset the anchor's node value and create an entirely new element and append it.
(Consider this psuedo code)
$inv_listitems = $invcontents->getElementsByTagName('li');
foreach ($inv_listitems as $f) {
$span = $invcontents->createElement("span");
$span->setAttribute("class", "highlight");
$span->nodeValue = $f->item(0)->nodeValue;
$f->appendChild($span);
}
echo $invcontents->saveHTML();
You'll have to do some matching in there, as well as unsetting the nodeValue of $f but hopefully this makes it a little more clear.
Also, don't set HTML in nodeValue directly, because it will run htmlentities() against all of the html you set. That is why I create a new element above. If you absolutely have to set HTML in nodeValue then you should create a DocumentFragment Object
YOu're better of operating only on the textnodes:
$x = new DOMXPath(invcontents);
foreach($x->query('//li/text()' as $textnode){
//replace text node with list of plain text nodes & your highlighting span.
}
I always use xpath for this kind of actions. It'll give you more flexibility.
This example handles
<mainlevel>
<toplevel>
<detaillevel key=...>
<xmlvalue1></xmlvalue1>
<xmlvalue1></xmlvalue2>
<sublevel key=...>
<xmlvalue1></xmlsubvalue1>
<xmlvalue1></xmlsubvalue2>
</sublevel>
</detaillevel>
</toplevel>
</mainlevel>
To parse this:
$xpath = new DOMXPath($xmlDoc);
$mainNodes = $xpath->query("/mainlevel/toplevel/detaillevel");
foreach( $mainNodes as $subNode ) {
$parameter1=$subNode->getAttribute('key');
$parameter2=$subNode->getElementsByTagName("xmlvalue1")->item(0)->nodeValue;
$parameter3=$subNode->getElementsByTagName("xmlvalue2")->item(0)->nodeValue;
foreach ($subNode->getElementsByTagName("sublevel") as $detailNode) {
$parameter1=$detailNode->getAttribute('key');
$parameter2=$detailNode->getAttribute('xmlsubvalue1');
$parameter2=$detailNode->getAttribute('xmlsubvalue2');
}
}