PHP Strip all content around text - php

I have text that looks like this or a billion variant of this, for example:
<div>content goes here... </div><div style="some style..."><span style="some styles..."><strong>[END_CONTACT]</strong></span></div><div>content goes here... </div>
<div>content goes here... </div><div style="other style..."><span style="other styles..."><strong>[END_CONTACT]</strong></span></div><div>content goes here... </div>
<div>content goes here... </div><div style="random stuff..."><span style="random stuff..."><strong>[END_CONTACT]</strong></span></div><div>content goes here... </div>
and a billion variations of this...
I want to be able to remove any variation of the text surrounding [END_CONTACT] so that all I am left with this is this:
<div>content goes here... </div><div>[END_CONTACT]</div><div>content goes here... </div>
How do I strip the content between the opening div tag and [END_CONTACT] and the content between [END_CONTACT] and the ending div tag?
Thanks

Use regular expressions! The following example using preg_replace will work as long as your content doesn't contain angle brackets, which you should not put in HTML.
$result = preg_replace('#<div\b[^>]*><span\b[^>]*><strong\b[^>]*>([^<]*)</strong></span></div>#i', '<div>$1</div>', $html);

How do I strip the content between the opening div tag and [END_CONTACT] and the content between [END_CONTACT] and ending div tag?
If the terms [END_CONTACT] and the <div> tag are always present, you can use PCRE REGEX in preg_replace():
$string = preg_replace('/<div[^>]*>.*\[END_CONTACT\].*<\/div>/i','<div>[END_CONTACT]</div>',$string);
Example:
$data = [];
$data[] = 'some text <div style="some style..."><span style="some styles..."><strong>[END_CONTACT]</strong></span></div>';
$data[] = 'somrthing else etc.<div style="other style..."><span style="other styles..."><strong>[END_CONTACT]</strong></span></div>';
$data[] = '<div style="random stuff..."><span style="random stuff..."><strong>[END_CONTACT]</strong></span></div>';
$data[] = 'and a billion variations of this...';
foreach ($data as $row){
$string = preg_replace('/<div[^>]*>.*\[END_CONTACT\].*<\/div>/i','<div>[END_CONTACT]</div>',$row);
print $string."<BR>";
}
Output:
<div>[END_CONTACT]</div>
<div>[END_CONTACT]</div>
<div>[END_CONTACT]</div>
and a billion variations of this...
UPDATE:
Sorry, wasn't clear about that in my original post. Is there any way to keep text or code outside of the string in question but still do the operation as you've suggested?
Try this Regex in the above PHP code:
(?!<div).(<div[^>]*>.*\[END_CONTACT\][^\div]*<\/div>)
Example:
content content content... <div style="random stuff..."><span style="random stuff..."><strong>[END_CONTACT]</strong></span></div> content content content
Output:
content content content... <div>[END_CONTACT]</div> content content content
NOTE:
It must be stated that you should use a DOM parser to work with HTML elements in complex compositions rather than Regex.
I have tested my answer and it does what is desired. And as stated above, what you should be using to deal with multilayered complex HTML is a proper PHP DOM Parser.

Related

How can i match and replace every character of (or between) different Nodes that has similar tagName? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I am trying to replace every character (including newline, tabs, whitespace etc) between Nodes that has the same tag name. The problem is that the regex matches the different node (string) as one based on similarity between the beginning and closing tags of the nodes and then output a single result.
For Example:
$html_string = "
<div> Below are object Node with the html code </div>
<script> alert('i want this to be replaced. it has no newline'); </script>
<div> I don't want this to be replaced </div>
<script>
console.log('i also want this to be replaced. It has newline');
</script>
<div> This is a div tag and not a script, so it should not be replaced </div>
<script> console.warn(Finally, this should be replaced, it also has newline');
</script>
<div> The above is the final result of the replacements </div> ";
$regex = '/(?:\<script\>)(.*)?(?:\<\/script\>)/ims';
$result = preg_replace($regex, '<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->', $html_string);
echo $result;
Expected Result:
<div> Below are object Node with the html code </div>
<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->
<div> I don't want this to be replaced </div>
<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->
<div> This is a div tag and not a script, so it should not be replaced </div>
<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->
<div> The above is the final result of the replacements </div>
Actual Output:
<div> Below are object Node with the html code </div>
<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->
<div> The above is the final result of the replacements </div>
How can i sort this out. Thanks in advance.
Using DOMDocument is generally preferable to trying to parse HTML with regex. Based on your question, this will give you the results you want. It finds each script node in the HTML and replaces it with the comment you specified:
$doc = new DOMDocument();
$doc->loadHTML("<html>$html_string</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//script') as $script) {
$comment = $doc->createComment('THIS SCRIPT CONTENT HERE HAS BEEN ALTERED');
$script->parentNode->replaceChild($comment, $script);
}
echo substr($doc->saveHTML(), 6, -8);
Note that because you don't have a top-level element in the HTML, one (<html>) has to be added on read and then removed on output (using substr).
Output:
<div> Below are object Node with the html code </div>
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED-->
<div> I don't want this to be replaced </div>
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED-->
<div> This is a div tag and not a script, so it should not be replaced </div>
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED-->
<div> The above is the final result of the replacements </div>
Demo on 3v4l.org
If you insist on using regex (but you should read this before you do), the problem with your regex lies in this part:
(.*)?
This looks for an optional string of as many characters as possible, leading up to </script>. So it basically absorbs all the characters between the first <script> and the last </script> (because all the characters in </script> match .). What you actually wanted was (.*?) which is non-greedy and so matches only up to the first </script> i.e.
$regex = '/(?:\<script\>)(.*?)(?:\<\/script\>)/ims';
$result = preg_replace($regex, '<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->', $html_string);
echo $result;
The output from this is as you require.
Demo on 3v4l.org

PHP Regex to find matching div tags

Example: https://regex101.com/r/nHiyU3/1
CODE:
<div id="content">
<div>
<div class="col-image"></div> <!-- STOPS HERE -->
THIS CONTENT HERE DOES NOT GET CAPTURED
</div>
</div>
REGEX:
/<div id=[\'|"]content[\'|"][^>]*>(.*)<\/div>/sUi
So it stops where I've added the note to say
Any reason why? Have followed other topics on SO but can't get it to grab the whole lot.
So I know how to do it across multiple lines, it's finding the matching tag across multiple lines
As mentioned in the comments, it's easier to achieve this using DOMDocument. For example:
<?php
$html = '<div id="content"><div><div class="col-image"></div> <!--
STOPS HERE -->THIS CONTENT HERE DOES NOT GET CAPTURED</div></div>';
$domDocument = DOMDocument::loadHTML($html);
$divList = $domDocument->getElementsByTagName('div');
foreach ($divList as $div) {
var_dump($div);
}

putting paragraph tags on content paragraphs

Im trying to put paragraph text around the paragraphs. This code pulls out the blockquotes from my Wordpress post and outputs everything else
html
<?php
$block2 = get_the_content();
$block2 = preg_replace('~<blockquote>([\s\S]+?)</blockquote>~', '', $block2);
echo '<p>'.$block2.'</p>';
?>
But it only puts < p > tags around the fist paragraph and not the others
If I've understood this correctly, you could try splitting $block2 by newlines, looping through the resulting array and wrapping each element of the array in <p> tags as you have done.
Currently, your code wraps the entire content of $block2 in <p> tags, where I assume you wanted it to wrap the sections separated by newlines.
Example (I don't remember the exact syntax for PHP - sorry):
$split_block = split($block2, '\n');
for ($i in $split_block) {
$split_block[$i] = '<p>'.$split_block[$i].'</p>';
}
echo $split_block;

How to extract HTML element from a source file

I need to replace a HTML section identified by a tag id in a source code, which is combination of HTML and PHP using PHP. In case it's pure HTML, DOM parser could be used; in case there is no DIV in DIV, I can imagine how to use preg_match. This is what I am trying to do - I have a code (loaded into a string) like:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div>
<div>
<img >
</div>
</div>
</div>
and my task is to replace content of "mydiv" DIV with a new one e.g.
<div id="newdiv>
some text
</div>
so the string will look like this after the change:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div id="newdiv>
some text
</div>
</div>
I have already tried:
1) parsing the code using DOMdocument's loadHTML => it produces a lot of errors in case PHP code is included.
2) I played around a bit with regexes like preg_match_all('/<div id="myid"([^<]*)<\/div>/', $src, $matches), which fails in case more child divs are included.
The best approach I have found so far is:
1) find id="mydiv" string
2) search for '<' and '>' chars and count them like '<'=1 and '>'=-1 (not exactly, but it gives the idea)
3) once I get sum == 0 I should be on position of the closing tag, so I know, which portion string I should exchange
This is quite "heavy" solution, which can stop working in some cases, where the code is different (e.g. onpage PHP code contains the chars as well instead of just simple "include"). So I am looking so some better solution.
You could try something like this:
$file = 'filename.php';
$content = file_get_contents($file);
$array_one = explode( '<div id="mydiv">' , $content );
$my_div_content = explode("</div>" , $array_one[1] )[0];
Or use preg_match like you said:
preg_match('/<div id="mydiv"(.*?)<\/div>/s', $content, $matches)
Yes there is. First you need to use a function that will get the content of the file. Lets call the file homepage.php:
$homepageString = file_get_contents('homepage.php');
Now you have a string with all the content. The next thing you would do is use the preg_replace() function to take out the part of code that you want to take out:
$newHomepageString = preg_replace('/id="mydiv"/',"", $homepageString);
Now you overwrite the existing homepage.php file with the new source code:
file_put_contents("homepage.php", $newHomepageString);
Let me know if it worked for you! :)

convert DIV to SPAN using str_replace

I have some data that is provided to me as $data, an example of some of the data is...
<div class="widget_output">
<div id="test1">
Some Content
</div>
<ul>
<li>
<p>
<div>768hh</div>
<div>2308d</div>
<div>237ds</div>
<div>23ljk</div>
</p>
</li>
<div id="temp3">
Some more content
</div>
<li>
<p>
<div>lkgh322</div>
<div>32khhg</div>
<div>987dhgk</div>
<div>23lkjh</div>
</p>
</li>
</div>
I am attempting to change the non valid HTML DIVs inside the paragraphs so i end up with this instead...
<div class="widget_output">
<div id="test1">
Some Content
</div>
<ul>
<li>
<p>
<span>768hh</span>
<span>2308d</span>
<span>237ds</span>
<span>23ljk</span>
</p>
</li>
<div id="temp3">
Some more content
</div>
<li>
<p>
<span>lkgh322</span>
<span>32khhg</span>
<span>987dhgk</span>
<span>23lkjh</span>
</p>
</li>
</div>
I am trying to do this using str_replace with something like...
$data = str_replace('<div>', '<span>', $data);
$data = str_replace('</div>', '</span', $data);
Is there a way I can combine these two statements and also make it so that they only affect the 'This is a random item' and not the other occurences?
$data = str_replace(array('<div>', '</div>'), array('<span>', '</span>'), $data);
As long as you didn't give any other details and only asked:
Is there a way I can combine these two statements and also make it so that they only affect the 'This is a random item' and not the other occurences?
Here you go:
$data = str_replace('<div>This is a random item</div>', '<span>This is a random item</span>', $data);
You'll need to use a regular expression to do what you are looking to do, or to actually parse the string as XML and modify it that way. The XML parsing is almost surely the "safest," since as long as the string is valid XML, it will work in a predictable way. Regexes can at times fall prey to strings not being in exactly the expected format, but if your input is predictable enough, they can be ok. To do what you want with regular expressions, you'd so something like
$parsed_string = preg_replace("~<div>(?=This is a random item)(.*?)</div>~", "<span>$1</span>, $input_string);
What's happening here is the regex is looking for a <div> tag which is followed by (using a lookahead assertion) This is a random item. It then captures any text between that tag and the next </div> tag. Finally, it replaces the match with <span>, followed by the captured text from inside the div tags, followed by </span>. This will work fine on the example you posted, but will have problems if, for example, the <div> tag has a class attribute. If you are expecting things like that, either a more complex regular expression would be needed, or full XML parsing might be the best way to go.
I'm a little surprised by the other answers, I thought someone would post a good one, but that hasn't happened. str_replace is not powerful enough in this case, and regular expressions are hit-and-miss, you need to write a parser.
You don't have to write a full HTML-parser, you can cheat a bit.
$in = '<div class="widget_output">
(..)
</div>';
$lines = explode("\n", $in);
$in_paragraph = false;
foreach ($lines as $nr => $line) {
if (strstr($line, "<p>")) {
$in_paragraph = true;
} else if (strstr($line, "</p>")) {
$in_paragraph = false;
} else {
if ($in_paragraph) {
$lines[$nr] = str_replace(array('<div>', '</div>'), array('<span>', '</span>'), $line);
}
}
}
echo implode("\n", $lines);
The critical part here is detecting whether you're in a paragraph or not. And only when you're in a paragraph, do the string replacement.
Note: I'm splitting on newlines (\n) which is not perfect, but works in this case. You might want to improve this part.

Categories