Conditional search and replace using PHP and regex

Conditional search and replace using PHP and regex - php

I need to hide all "p" tags in a HTML file that have an inline style with a "left" offset of 400 or more.
I'm hoping some clever regex will replace "left:XXX" with "display:none" should "xxx" be 400 or more.
For example, this:
<p style="position:absolute;top:98px;left:472px;white-space:nowrap">
...would need to be replaced with this:
<p style="position:absolute;top:98px;display:none;white-space:nowrap">
It seems simple enough logic, but the regex and PHP is mind boggling for me.
Here is what I've been trying to do, but I can only get it to work line-by-line:
$width = preg_match("left:(.*?)px",$contents);
if ($width >399)
{
$contents = preg_replace('/left:(.*?)px/', "display:none", $contents);
}
Any suggestions greatly appreciated! :)
Wonko

Don't believe that regex will solve all the problem of the world:
Use DOMDocument to extract the p tags with a style attribute, extract the "left" value with a regex pattern from the style attribute and then proceed to the replacement when the "left" value is greater or equal to 400 (test this with a simple comparison).
$dom = new DOMDocument;
$dom->loadHTML($html);
$pTags = $dom->getElementsByTagName('p');
foreach($pTags as $pTag) {
if ($pTag->hasAttribute('style')) {
$style = $pTag->getAttribute('style');
$style = preg_replace_callback(
'~(?<=[\s;]|^)left\s*:\s*(\d+)\s*px\s*(?:;|$)~i',
function ($m) {
return ($m[1] > 399) ? 'display:none;' : $m[0];
},
$style
);
$pTag->setAttribute('style', $style);
}
}
$result = $dom->saveHTML();
EDIT: in the worst scenario, the style attribute may contain display:block; or display with a value other than none after the left value. To avoid any problem, it is better to put display:none at the end.
$style = preg_replace_callback(
'~(?<=[\s;]|^)left\s*:\s*(\d+)\s*px\s*(;.*|$)~i',
function ($m) {
return ($m[1] > 399) ? $m[2]. 'display:none;' : $m[0];
},
$style
);

I've tested it and it works correctly:
$string = '<p style="position:absolute;top:98px;left:472px;white-space:nowrap">';
$test = str_replace('left:', 'display:none;[', $string );
$test = str_replace('white-space', ']white-space', $test );
$out = delete_all_between('[', ']', $test);
print($out); // output
function delete_all_between($beginning, $end, $string) {
$beginningPos = strpos($string, $beginning);
$endPos = strpos($string, $end);
if ($beginningPos === false || $endPos === false) {
return $string;
}
$textToDelete = substr($string, $beginningPos, ($endPos + strlen($end)) - $beginningPos);
return str_replace($textToDelete, '', $string);
}
output:
<p style="position:absolute;top:98px;display:none;white-space:nowrap">
enjoy it ... !

Related

Multiple occurances of delimeters within a HTML template

I am facing a problem that I can't get my head around. I thought I would turn to the experts once again to shine some light.
I have a HTML template and within the template I have delimiters like:
[has_image]<p>The image is <img src="" /></p>[/has_image]
These delimiters may have multiple occurances within the template and below is what I am trying to achieve:
Find all occurances of these delimiters and replace the content between these delimiters with an image source or replace it empty if image doesn't exist but still keep the value/content of the remaining template.
Below is my code that works only for one occurance but struggling to accomplish it for multiple occurances.
function replace_text_template($template_body, $start_tag, $end_tag, $replacement = ''){
$occurances = substr_count($template_body, $start_tag);
$x = 1;
while($x <= $occurances) {
$start = strpos($template_body, $start_tag);
$stop = strpos($template_body, $end_tag);
$template_body = substr($template_body, 0, $start) . $start_tag . $replacement . substr($template_body, $stop);
$x++;
}
return $template_body;
}
$template_body will have HTML code with delimiters
replace_text_template($template_body, "[has_image]", "[/has_image]");
Whether I remove the while loop it still works for a single delimiter.

I have managed to solve the problem. If anybody finds this useful please feel free to use the code. However, if anyone finds a better way please do share it.
function replace_text_template($template_body, $start_tag, $end_tag, $replacement = ''){
$occurances = substr_count($template_body, $start_tag);
$x = 1;
while($x <= $occurances) {
$start = strpos($template_body, $start_tag);
$stop = strpos($template_body, $end_tag);
$template_body = substr($template_body, 0, $start) . $start_tag . $replacement . substr($template_body, $stop);
$template_body = str_replace($start_tag.''.$end_tag, '', $template_body); // replace the tags so on next loop the position will be correct
$x++;
}
return $template_body;
}

function replace_text_template($template_body, $start_tag, $replacement = '') {
return preg_replace_callback("~\[".preg_quote($start_tag)."\].*?\[\/".preg_quote($start_tag)."\]~i", function ($matches) use ($replacement) {
if(preg_match('~<img.*?src="([^"]+)"~i', $matches[0], $match)) {
if (is_array(getimagesize($match[1]))) return $match[1];
}
return $replacement;
}, $template_body);
}
$template_body = <<<EOL
text
[has_image]<p>The image is <img src="" /></p>[/has_image]
abc [has_image]<p>The image is <img src="http://blog.stackoverflow.com/wp-content/themes/se-company/images/logo.png" /></p>[/has_image]xyz
EOL;
echo replace_text_template($template_body, "has_image", "replacement");
Returns:
text
replacement
abc http://blog.stackoverflow.com/wp-content/themes/se-company/images/logo.pngxyz

Preg Replace in PHP for Heading Tags

I have a markdown text content which I have to replace without using library functions.So I used preg replace for this.It works fine for some cases.For cases like heading
for eg Heading
=======
should be converted to <h1>Heading</h1> and also
##Sub heading should be converted to <h2>Sub heading</h2>
###Sub heading should be converted to <h3>Sub heading</h3>
I have tried
$text = preg_replace('/##(.+?)\n/s', '<h2>$1</h2>', $text);
The above code works but I need to have count of hash symbol and based on that I have to assign heading tags.
Anyone help me please....

Try using preg_replace_callback.
Something like this -
$regex = '/(#+)(.+?)\n/s';
$line = "##Sub heading\n ###sub-sub heading\n";
$line = preg_replace_callback(
$regex,
function($matches){
$h_num = strlen($matches[1]);
return "<h$h_num>".$matches[2]."</h$h_num>";
},
$line
);
echo $line;
The output would be something like this -
<h2>Sub heading</h2> <h3>sub-sub heading</h3>
EDIT
For the combined problem of using = for headings and # for sub-headings, the regex gets a bit more complicated, but the principle remains the same using preg_replace_callback.
Try this -
$regex = '/(?:(#+)(.+?)\n)|(?:(.+?)\n\s*=+\s*\n)/';
$line = "Heading\n=======\n##Sub heading\n ###sub-sub heading\n";
$line = preg_replace_callback(
$regex,
function($matches){
//var_dump($matches);
if($matches[1] == ""){
return "<h1>".$matches[3]."</h1>";
}else{
$h_num = strlen($matches[1]);
return "<h$h_num>".$matches[2]."</h$h_num>";
}
},
$line
);
echo $line;
Whose Output is -
<h1>Heading</h1><h2>Sub heading</h2> <h3>sub-sub heading</h3>

Do a preg_match_all like this:
$string = "#####asdsadsad";
preg_match_all("/^#/", $string, $matches);
var_dump ($matches);
And based on count of matches you can do whatever you want.
Or, use the preg_replace_callback function.
$input = "#This is my text";
$pattern = '/^(#+)(.+)/';
$mytext = preg_replace_callback($pattern, 'parseHashes', $input);
var_dump($mytext);
function parseHashes($input) {
var_dump($input);
$matches = array();
preg_match_all('/(#)/', $input[1], $matches);
var_dump($matches[0]);
var_dump(count($matches[0]));
$cnt = count($matches[0]);
if ($cnt <= 6 && $cnt > 0) {
return '<h' . $cnt . ' class="if you want class here">' . $input[2] . '</h' . $cnt . '>';
} else {
//This is not a valid h tag. Do whatever you want.
return false;
}
}

Convert lib_string to string w/o Regex

I need to convert lib_someString to someString inside a block of text using str_replace [not regex].
Here's an example to give an exact sense what I mean: lib_12345 => 12345. I need to do this for a bunch of instances in a block of text.
Below is my attempt. Problem I'm getting is that my function is not doing anything (I just get lib_id returned).
function extractLibId($val){ // function to get the "12345" in the above example
$lclRetVal = substr($val, 5, strlen($val));
return $lclRetVal;
}
function Lib($text){ // does the replace for all lib_ instances in the text
$lclVar = "lib_";
$text = str_replace($lclVar, "<a href='".extractLibId($lclVar)."'>".extractLibId($lclVar)."</a>", $text);
return $text;
}

Regexp gonna be faster and more clear, you will have no need to call your function for every possible 'lib_' string:
function Lib($text) {
$count = null;
return preg_replace('/lib_([0-9]+)/', '$1', $text, -1, $count);
}
$text = 'some text lib_123123 goes here lib_111';
$text = Lib($text);
Without regexp, but every time Lib2 will be called somewhere will die cute kitten:
function extractLibId($val) {
$lclRetVal = substr($val, 4);
return $lclRetVal;
}
function Lib2($text) {
$count = null;
while (($pos = strpos($text, 'lib_')) !== false) {
$end = $pos;
while (!in_array($text[$end], array(' ', ',', '.')) && $end < strlen($text))
$end++;
$sub = substr($text, $pos, $end - $pos);
$text = str_replace($sub, ''.extractLibId($sub).'', $text);
}
return $text;
}
$text = 'some text lib_123123 goes here lib_111';
$text = Lib2($text);
Use preg_replace.

Although it is possible to do what you need without regular expressions, you say you don't want to use them because of performance reasons. I doubt the other solution will be faster, so here is a simple regex to benchmark against:
echo preg_replace("/lib_(\w+)/", '$1', $str);
As shown here: http://codepad.org/xGj78r9r

Ignoring how ridiculous area of optimizing this is, even the simplest implementation with minimal validation already takes only 33% less time than a regex
<?php
function uselessFunction( $val ) {
if( strpos( $val, "lib_" ) !== 0 ) {
return $val;
}
$str = substr( $val, 4 );
return "{$str}";
}
$l = 100000;
$now = microtime(TRUE);
while( $l-- ) {
preg_replace( '/^lib_(.*)$/', "$1", 'lib_someString' );
}
echo (microtime(TRUE)-$now)."\n";
//0.191093
$l = 100000;
$now = microtime(TRUE);
while( $l-- ) {
uselessFunction( "lib_someString" );
}
echo (microtime(TRUE)-$now);
//0.127598
?>

If you're restricted from using a regex, you're going to have difficult time searching for a string you describe as "someString", i.e. not precisely known in advance. If you know the string is exactly lib_12345, for example, then set $lclVar to that string. On the other hand, if you don't know the exact string in advance, you'll have to use a regex via preg_replace() or a similar function.

Strip tag with class in PHP

So I need to strip the span tags of class tip.
So that would be <span class="tip"> and the corresponding </span>, and everything inside it...
I suspect a regular expression is needed but I terribly suck at this.
Laugh...
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace($pattern, $replacement, $string);
?>
Gives no error... But
<?php
$str = preg_replace('<span class="tip">.+</span>', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
Gives me the error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '.' in <A FILE> on line 4
previously, the error was at the ); in the 2nd line, but now.... >.>

This is the "proper" method (adapted from this answer).
Input:
<?php
$str = '<div>lol wut <span class="tip">remove!</span><span>don\'t remove!</span></div>';
?>
Code:
<?php
function recurse(&$doc, &$parent) {
if (!$parent->hasChildNodes())
return;
for ($i = 0; $i < $parent->childNodes->length; ) {
$elm = $parent->childNodes->item($i);
if ($elm->nodeName == "span") {
$class = $elm->attributes->getNamedItem("class")->nodeValue;
if (!is_null($class) && $class == "tip") {
$parent->removeChild($elm);
continue;
}
}
recurse($doc, $elm);
$i++;
}
}
// Load in the DOM (remembering that XML requires one root node)
$doc = new DOMDocument();
$doc->loadXML("<document>" . $str . "</document>");
// Iterate the DOM
recurse($doc, $doc->documentElement);
// Output the result
foreach ($doc->childNodes->item(0)->childNodes as $node) {
echo $doc->saveXML($node);
}
?>
Output:
<div>lol wut <span>don't remove!</span></div>

A simple regular expression like:
<span class="tip">.+</span>
Wont work, the issue being that if another span was opened and closed inside the tip span, your regex will terminate with its ending, rather than the tip one. DOM Based tools like the one linked in the comments will really provide a more reliable answer.
As per my comment below, you need to add pattern delimiters when working with regular expressions in PHP.
<?php
$str = preg_replace('\<span class="tip">.+</span>\', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
may be moderately more successful. Please take a look at the documentation page for the function in question.

Now without regexp, and without heavy XML parsing:
$html = ' ... <span class="tip"> hello <span id="x"> man </span> </span> ... ';
$tag = '<span class="tip">';
$tag_close = '</span>';
$tag_familly = '<span';
$tag_len = strlen($tag);
$p1 = -1;
$p2 = 0;
while ( ($p2!==false) && (($p1=strpos($html, $tag, $p1+1))!==false) ) {
// the tag is found, now we will search for its corresponding closing tag
$level = 1;
$p2 = $p1;
$continue = true;
while ($continue) {
$p2 = strpos($html, $tag_close, $p2+1);
if ($p2===false) {
// error in the html contents, the analysis cannot continue
echo "ERROR in html contents";
$continue = false;
$p2 = false; // will stop the loop
} else {
$level = $level -1;
$x = substr($html, $p1+$tag_len, $p2-$p1-$tag_len);
$n = substr_count($x, $tag_familly);
if ($level+$n<=0) $continue = false;
}
}
if ($p2!==false) {
// delete the couple of tags, the farest first
$html = substr_replace($html, '', $p2, strlen($tag_close));
$html = substr_replace($html, '', $p1, $tag_len);
}
}

Close open HTML tags in a string

Situation is a string that results in something like this:
<p>This is some text and here is a <strong>bold text then the post stop here....</p>
Because the function returns a teaser (summary) of the text, it stops after certain words. Where in this case the tag strong is not closed. But the whole string is wrapped in a paragraph.
Is it possible to convert the above result/output to the following:
<p>This is some text and here is a <strong>bold text then the post stop here....</strong></p>
I do not know where to begin. The problem is that.. I found a function on the web which does it regex, but it puts the closing tag after the string.. therefore it won't validate because I want all open/close tags within the paragraph tags. The function I found does this which is wrong also:
<p>This is some text and here is a <strong>bold text then the post stop here....</p></strong>
I want to know that the tag can be strong, italic, anything. That's why I cannot append the function and close it manually in the function. Any pattern that can do it for me?

Here is a function i've used before, which works pretty well:
function closetags($html) {
preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)) {
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]);
}
}
return $html;
}
Personally though, I would not do it using regexp but a library such as Tidy. This would be something like the following:
$str = '<p>This is some text and here is a <strong>bold text then the post stop here....</p>';
$tidy = new Tidy();
$clean = $tidy->repairString($str, array(
'output-xml' => true,
'input-xml' => true
));
echo $clean;

A small modification to the original answer...while the original answer stripped tags correctly. I found that during my truncation, I could end up with chopped up tags. For example:
This text has some <b>in it</b>
Truncating at character 21 results in:
This text has some <
The following code, builds on the next best answer and fixes this.
function truncateHTML($html, $length)
{
$truncatedText = substr($html, $length);
$pos = strpos($truncatedText, ">");
if($pos !== false)
{
$html = substr($html, 0,$length + $pos + 1);
}
else
{
$html = substr($html, 0,$length);
}
preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened)
{
return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++)
{
if (!in_array($openedtags[$i], $closedtags))
{
$html .= '</'.$openedtags[$i].'>';
}
else
{
unset($closedtags[array_search($openedtags[$i], $closedtags)]);
}
}
return $html;
}
$str = "This text has <b>bold</b> in it</b>";
print "Test 1 - Truncate with no tag: " . truncateHTML($str, 5) . "<br>\n";
print "Test 2 - Truncate at start of tag: " . truncateHTML($str, 20) . "<br>\n";
print "Test 3 - Truncate in the middle of a tag: " . truncateHTML($str, 16) . "<br>\n";
print "Test 4: - Truncate with less text: " . truncateHTML($str, 300) . "<br>\n";
Hope it helps someone out there.

And what about using PHP's native DOMDocument class? It inherently parses HTML and corrects syntax errors...
E.g.:
$fragment = "<article><h3>Title</h3><p>Unclosed";
$doc = new DOMDocument();
$doc->loadHTML($fragment);
$correctFragment = $doc->getElementsByTagName('body')->item(0)->C14N();
echo $correctFragment;
However, there are several disadvantages of this approach.
Firstly, it wraps the original fragment within the <body> tag. You can get rid of it easily by something like (preg_)replace() or by substituting the ...->C14N() function by some custom innerHTML() function, as suggested for example at http://php.net/manual/en/book.dom.php#89718.
The second pitfall is that PHP throws an 'invalid tag in Entity' warning if HTML5 or custom tags are used (nevertheless, it will still proceed correctly).

This PHP method always worked for me. It will close all un-closed HTML tags.
function closetags($html) {
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)){
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]);
}
}
return $html;
}

There are numerous other variables that need to be addressed to give a full solution, but are not covered by your question.
However, I would suggest using something like HTML Tidy and in particular the repairFile or repaireString methods.

if tidy module is installed, use php tidy extension:
tidy_repair_string($html)
reference

Using a regular expression isn't an ideal approach for this. You should use an html parser instead to create a valid document object model.
As a second option, depending on what you want, you could use a regex to remove any and all html tags from your string before you put it in the <p> tag.

I've done this code witch doest the job quite correctly...
It's old school but efficient and I've added a flag to remove the unfinished tags such as " blah blah http://stackoverfl"
public function getOpennedTags(&$string, $removeInclompleteTagEndTagIfExists = true) {
$tags = array();
$tagOpened = false;
$tagName = '';
$tagNameLogged = false;
$closingTag = false;
foreach (str_split($string) as $c) {
if ($tagOpened && $c == '>') {
$tagOpened = false;
if ($closingTag) {
array_pop($tags);
$closingTag = false;
$tagName = '';
}
if ($tagName) {
array_push($tags, $tagName);
}
}
if ($tagOpened && $c == ' ') {
$tagNameLogged = true;
}
if ($tagOpened && $c == '/') {
if ($tagName) {
//orphan tag
$tagOpened = false;
$tagName = '';
} else {
//closingTag
$closingTag = true;
}
}
if ($tagOpened && !$tagNameLogged) {
$tagName .= $c;
}
if (!$tagOpened && $c == '<') {
$tagNameLogged = false;
$tagName = '';
$tagOpened = true;
$closingTag = false;
}
}
if ($removeInclompleteTagEndTagIfExists && $tagOpened) {
// an tag has been cut for exemaple ' blabh blah <a href="sdfoefzofk' so closing the tag will not help...
// let's remove this ugly piece of tag
$pos = strrpos($string, '<');
$string = substr($string, 0, $pos);
}
return $tags;
}
Usage example :
$tagsToClose = $stringHelper->getOpennedTags($val);
$tagsToClose = array_reverse($tagsToClose);
foreach ($tagsToClose as $tag) {
$val .= "</$tag>";
}

This is works for me to close any open HTML tags in a script.
<?php
function closetags($html) {
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)) {
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]);
}
}
return $html;
}

An up-to-date solution with parsing HTML would be:
function fix_html($html) {
$dom = new DOMDocument();
$dom->loadHTML( mb_convert_encoding( $html, 'HTML-ENTITIES', 'UTF-8' ), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
return $dom->saveHTML();
}
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD is needed to avoid implementing doctype, html and body.. the rest looks pretty obvious :)
UPDATE:
After some testing noticed, that the solution above ruins a correct layout time-after-time. The following works well, though:
function fix_html($html) {
$dom = new DOMDocument();
$dom->loadHTML( mb_convert_encoding( $html, 'HTML-ENTITIES', 'UTF-8' ) );
$return = '';
foreach ( $dom->getElementsByTagName( 'body' )->item(0)->childNodes as $v ) {
$return .= $dom->saveHTML( $v );
}
return $return;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Conditional search and replace using PHP and regex - php

Related

Multiple occurances of delimeters within a HTML template

Preg Replace in PHP for Heading Tags

Convert lib_string to string w/o Regex

Strip tag with class in PHP

Close open HTML tags in a string

Categories

Resources