Regular Expression: Converting non-block elements with <br /> to <p> in PHP - php

Someone has asked a similar question, but the accepted answer doesn't meet my requirements.
Input:
<strong>bold <br /><br /> text</strong><br /><br /><br />
link<br /><br />
<pre>some code</pre>
I'm a single br, <br /> leave me alone.
Expected output:
<p><strong>bold <br /> text</strong><br /></p>
<p>link<br /></p>
<pre>some code</pre>
<p>I'm a single br, <br /> leave me alone.</p>
The accepted answer I mentioned above will convert multiple br to p, and at last wrap all the input with another p. But in my case, you can't wrap pre inside a p tag. Can anyone help?
update
the expected output before this edit was a little bit confusing. the whole point is:
convert multiple br to a single one (achieved with preg_replace('/(<br />)+/', '<br />', $str);)
check for inline elements and unwrapped text (there's no parent element in this case, input is from $_POST) and wrap with <p>, leave block level elements alone.

Do not use regex. Why? See: RegEx match open tags except XHTML self-contained tags
Use proper DOM manipulators. See: http://php.net/manual/en/book.dom.php
EDIT:
I'm not really a fan of giving cookbook-recipes, so here's a solution for changing double <br />'s to text wrapped in <p></p>:
script.php:
<?php
function isBlockElement($nodeName) {
$blockElementsArray = array("pre", "div"); // edit to suit your needs
return in_array($nodeName, $blockElementsArray);
}
function hasBlockParent(&$node) {
if (!($node instanceof DOMNode)) {
// return whatever you wish to return on error
// or throw an exception
}
if (is_null($node->parentNode))
return false;
if (isBlockElement($node->parentNode))
return true;
return hasBlockParent($node->parentNode);
}
$myDom = new DOMDocument;
$myDom->loadHTMLFile("in-file");
$myDom->normalizeDocument();
$elems =& $myDom->getElementsByTagName("*");
for ($i = 0; $i < $elems->length; $i++) {
$element =& $elems->item($i);
if (($element->nextSibling->nodeName == "br" && $element->nextSibling->nextSibling->nodeName == "br") && !hasBlockParent($element)) {
$parent =& $element->parentNode;
$parent->removeChild($element->nextSibling->nextSibling);
$parent->removeChild($element->nextSibling);
// check if there are further nodes on the same level
$nSibling;
if (!is_null($element->nextSibling))
$nSibling = $element->nextSibling;
else
$nSibling = NULL;
// delete the old node
$saved = $parent->removeChild($element);
$newNode = $myDom->createElement("p");
$newNode->appendChild($saved);
if ($nSibling == NULL)
$parent->appendChild($newNode);
else
$parent->insertBefore($newNode, $nSibling);
}
}
$myDom->saveHTMLFile("out-file");
?>
This is not really a full solution, but it's a starting point. This is the best I could write during my lunch break, and please bear in mind that the last time I coded in PHP was about 2 years ago (been doing mostly C++ since then). I was not writing it as a full solution but rather to give you a...well, starting point :)
So anyways, the input file:
[dare2be#schroedinger dom-php]$ cat in-file
<strong>bold <br /><br /> text</strong><br /><br /><br />
link<br /><br />
<pre>some code</pre>
I'm a single br, <br /> leave me alone.
And the output file:
[dare2be#schroedinger dom-php]$ cat out-file
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p><strong>bold <br><br> text</strong></p><br><p>link</p><pre>some code</pre>
I'm a single br, <br> leave me alone.</body></html>
The whole DOCTYPE mumbo jumbo is a side-effect. The code doesn't do the rest of the things you said, like changing <bold><br><br></bold> to <bold><br></bold>. Also, this whole script is a quick draft, but you'll get the idea.

Alright, I'v got myself an answer, and I believe this is gonna work really well.
It's from WordPress...the wpautop function.
I'v tested it with the input (from my question), and the output is -almost- the same as I expected, I just need to modify it a bit to fit my needs.
Thanks dare2be, but I'm not very familiar with DOM manipulator in PHP.

Related

Match tags inside tag

I want to modify:
<ins><br/> <b>bold</b> <br/><br/> <br/> <br/></ins> <br/> <ins> <br/> </ins>
to:
<ins><br/>NL: <b>bold</b> <br/>NL:<br/>NL: <br/>NL: <br/>NL:</ins> <br/> <ins> <br/>NL: </ins>
(inside every <ins> and </ins> tag find and change <br/> to <br/>NL:. Ignore <br/> outside <ins>. Also, <ins> might contain various other tags)
To do this, I have this peace of code:
$string= preg_replace('~(?:<ins>|(?!^)\G)(.*?)<br\/>~', '$0NL:', $string);
https://regex101.com/r/xI8mW9/4
It would work just fine, but the problem is that matching doesn't end after </ins> tag. How do I replace <br/> with <br/>NL: only withing <ins> and </ins> tags. It modifies every <br/> after first <ins>
I have also tried pattern:
~(<ins>.*?)(?<my_br><br/>)(?!NL:)(.*?</ins>)~
https://regex101.com/r/xI8mW9/15
(in this case for each my_br changed as $1$2NL:$3) Problem: In case <ins><br/></ins><br/><ins><br/></ins> middle <br/> is affected.
Tried doing it with DOMDocument as suggested in comment:
$rendered_diff = "Some<ins>a<br/></ins><br/><ins>b<br/></ins>text";
$doc = new \DOMDocument();
$doc->loadHTML($rendered_diff);
$items = $doc->getElementsByTagName('ins');
for ($i = 0; $i < $items->length; $i++) {
foreach ($items->item($i)->childNodes as $node) {
if ($node->nodeName == 'br') {
$node->appendData('NL:');
}
}
}
$doc->saveHTML();
dd($rendered_diff);
Got an error:
ERROR: Call to undefined method DOMElement::appendData()
Have no idea why this approach is bad.
You can try the following code:
<?php
$rendered_diff = "<br/>Some<ins>a<br/><div>blablaa</div></ins><br/><ins>b<br/></ins>text";
$doc = new \DOMDocument();
$doc->loadHTML($rendered_diff);
$xpath = new DOMXpath($doc);
$items = $doc->getElementsByTagName('ins');
foreach ($xpath->query("//ins/br") as $br) {
$text = $doc->createTextNode('NS:');
$br->parentNode->insertBefore( $text, $br->nextSibling);
}
echo $doc->saveXML();
It outputs the following:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><br/>Some<ins>a<br/>NS:<div>blablaa</div></ins><br/><ins>b<br/>NS:</ins>text</body></html>
Which seems to solve the problem.
Note that I modified a bit your initial XML, to test your
Ignore <br/> outside <ins>
condition. See the 1st <br/>.
Answering your question
Have no idea why this approach is bad.
Your approach is not good because of this and compare it with the code I placed above: doesn't the latter look cleaner? And moreover, it uses XPath and you can create more complicated queries to match certain elements, not only <br>'s inside <ins>

Replace new line breaks only between two tags in php

I have content managed in TinyMCE. I have a class called code and I need to replace line breaks in PHP but only between two tags.
So my html will look something like this
<p>hello see my css below</p>
<p class="code">
h1 {
font-size:10px
font-color:#FFF
}
h2 {
font-size:10px
font-color:#FFF
}
</p>
In CSS, I have a <code></code> tag that will put line numbers by each line automatically. I therefore want to replace the above to something like this:
<p>hello see my css below</p>
<p class="code">
<code>h1 {</code>
<code>font-size:10px</code>
<code>font-color:#FFF</code>
<code>}</code>
<code></code>
<code>h2 { </code>
<code>font-size:10px</code>
<code>font-color:#FFF</code>
<code>}</code>
<code></code>
</p>
I take it I want to str_replace or preg_replace every line break with </code>
<code>. However, I can't work out how to do it only between <p class="code"> and </p>
As always, help is appreciated.
I threw together a quick and dirty solution to your problem. It doesn't cover additional attributes in your .code paragraph (you would most likely need to change the first explode to preg_split and use some regex to get that to work).
It may not be the best way to solve your problem, but it seemed to work for your stated requirements.
function codify($string){
$final = '';
$array = explode('<p class="code">',$string);
if(count($array)){
foreach($array as $k=>$str){
if($k == 0){
$final .= $str;
}
else{
$tmp = explode("</p>",$str);
$tmp2 = explode("\n",$tmp[0]);
$final .= "<p class=\"code\">\n<code>".implode("</code>\n<code>",$tmp2)."</code>\n</p>".implode("</p>",array_slice($tmp,1));
}
}
}
else{
$final = $string;
}
return $final;
}

Strlen is not giving correct output

I have a string which I retrieve from a database, I want to calculate the length of the string without spaces but it is displaying a larger value of length(21 characters greater than the actual count) I have removed tab and newline characters and also the php and html tags but no result! I have tried almost every function on the w3schools php reference but I'm unable to find any success. I also have observed that if I don't retrieve the value from the database and input it like this:
$string = "my string";
I get the correct length, please help me. Here is the code:
if($res_tutor[0]['tutor_experience']){
$str = trim(strip_tags($res_tutor[0]['tutor_experience']));
$str = $this->real_string($str);
$space = substr_count($str, ' ');
$experience = strlen($str) - $space;
function real_string($str)
{
$search = array("\t","\n","\r\n","\0","\v");
$replace = array('','','','','');
$str = str_replace($search,$replace,$str);
return $str;
}
And this is the string from the database but as you can see above I have removed all php and html tags using strip_tags() :
<span class=\"experience_font\">You are encouraged to write a short description of yourself, teaching experience and teaching method. You may use the guidelines below to assist you in your writing.<br />
<br />
.Years of teaching experience<br />
.Total number of students taught<br />
.Levels & subjects that you have taught<br />
.The improvements that your students have made<br />
.Other achievements/experience (Relief teaching, a tutor in a tuition centre, Dean's list, scholarship, public speaking etc.)<br />
.For Music (Gigs at Esplanade, Your performances in various locations etc.)</span><br />
</p>
and when I print it, it displays as:
<span class=\"experience_font\">You are encouraged to write a short description of yourself, teaching experience and teaching method. You may use the guidelines below to assist you in your writing.<br />
<br />
.Years of teaching experience<br />
.Total number of students taught<br />
.Levels & subjects that you have taught<br />
.The improvements that your students have made<br />
.Other achievements/experience (Relief teaching, a tutor in a tuition centre, Dean's list, scholarship, public speaking etc.)<br />
.For Music (Gigs at Esplanade, Your performances in various locations etc.)</span><br />
</p>
#Svetilo, not to be rude just wanted to post my findings, your str_replace worked wonderfully, except for the fact that I was still outputting incorrect values with it in the order that you currently have, I found that the following worked flawlessly.
$string = str_replace(array("\t","\r\n","\n","\0","\v"," "),'', $string);
mb_strlen($string, "UTF-8");
Changing around the \r\n & \n made the str_replace not strip out the \n from the \r\n leaving it just a \r.
Cheers.
Try using mb_strlen. http://php.net/manual/en/function.mb-strlen.php
Its more more precise.
mb_strlen($str,"UTF-8")
Where UTF-8 is your default encoding...
To remove all freespaces try something like that..
$string = str_replace(array("\t","\n","\r\n","\0","\v"," "),"",$string);
mb_strlen($string, "UTF-8");

Which regular expression should I use for getting image link from text?

I have following code:
<center><img src="http://trustedyouautocorrect.com/wp-content/uploads/2012/02/ixxx66134057.jpg" alt="daniel7531sarah" /></center><input id="gwProxy" type="hidden" />
<!--Session data--><input id="jsProxy" onclick="if(typeof(jsCall)=='function'){jsCall();}else{setTimeout('jsCall()',500);}" type="hidden" />
<div id="refHTML"></div>
I need to make script that will get link from image-src. How can I do it? I hope you help me. Thank you.
I'm assuming with Javascript. Easiest way would be to put an id attribute on the img, then you can extract the src easily.
<script type="text/javascript">
function getSRC()
{
var imgID = document.getElementById("imgID");
alert( imgID.getAttribute('src') );
}
</script>
<img id="imgID" src="someIMG.png" /><br />
get
Using regex to extract the img src...
<?php
$str = '<center><img src="http://trustedyouautocorrect.com/wp-content/uploads/2012/02/ixxx66134057.jpg" alt="daniel7531sarah" /></center><input id="gwProxy" type="hidden" />
<!--Session data--><input id="jsProxy" onclick="if(typeof(jsCall)==\'function\'){jsCall();}else{setTimeout(\'jsCall()\',500);}" type="hidden" />
<div id="refHTML"></div>';
// regex to match all src attributes of image tags
preg_match_all("/<img[^>]+src=(['\"])(.+?)\\1/",$str,$matches);
// $matches{
// [0] -> the whole matched string
// [1] -> the quotation char (could be ' or ") captured so it can be used
// to match the closing quote (of the same type)
// [2] -> the src attr value
// }
// loop through each src attr value we captured
foreach($matches[2] as $m){
echo "This is what you are after ~~> " . $m;
}
?>
Regex means...
<img followed by
one or more not > ([^>]+) followed by
src= followed by
' or " captured for later use and escaped ((['\"]))
followed by a bunch of stuff un-greedily (.+?)
followed by same quote as captured before escaped (\\1)
This is however a bad way to approach the problem, and comes with a few issues. My regex does not capture src attributes that are unquoted. There might also be unusual circumstances where it matches false positives, or does not match real url's.
Regex, although great for many circumstances, as a swiss army knife of matching patterns, it makes a poor parser. Whe you need to parse HTML, you should use the appropriate methods.
A much much better (less error prone, faster, easier to understand and maintain) way to do it...
<?php
$str = '<center><img src="http://trustedyouautocorrect.com/wp-content/uploads/2012/02/ixxx66134057.jpg" alt="daniel7531sarah" /></center><input id="gwProxy" type="hidden" />
<!--Session data--><input id="jsProxy" onclick="if(typeof(jsCall)==\'function\'){jsCall();}else{setTimeout(\'jsCall()\',500);}" type="hidden" />
<div id="refHTML"></div>';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
//get img tags
$items = $DOM->getElementsByTagName('img');
//loop through found image tags
for ($i = 0; $i < $items->length; $i++){
$node = $items->item($i);
if ($node->hasAttributes()){
// attach all attributes of tag to array
foreach ($node->attributes as $attr){
$array[$attr->nodeName] = $attr->nodeValue;
}
}
// print out just the src attribute
echo "This is what you want ~~> " . $array['src'];
}
?>

Use XPath with PHP's SimpleXML to find nodes containing a String

I try to use SimpleXML in combination with XPath to find nodes which contain a certain string.
<?php
$xhtml = <<<EOC
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Test</title>
</head>
<body>
<p>Find me!</p>
<p>
<br />
Find me!
<br />
</p>
</body>
</html>
EOC;
$xml = simplexml_load_string($xhtml);
$xml->registerXPathNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
$nodes = $xml->xpath("//*[contains(text(), 'Find me')]");
echo count($nodes);
Expected output: 2
Actual output: 1
When I change the xhtml of the second paragraph to
<p>
Find me!
<br />
</p>
then it works like expected. How has my XPath expression has to look like to match all nodes containing 'Find me' no matter where they are?
Using PHP's DOM-XML is an option, but not desired.
Thank's in advance!
It depends on what you want to do. You could select all the <p/> elements that contain "Find me" in any of their descendants with
//xhtml:p[contains(., 'Find me')]
This will return duplicates and so you don't specify the kind of nodes then it will return <body/> and <html/> as well.
Or perhaps you want any node which has a child (not a descendant) text node that contains "Find me"
//*[text()[contains(., 'Find me')]]
This one will not return <html/> or <body/>.
I forgot to mention that . represents the whole text content of a node. text() is used to retrieve [a nodeset of] text nodes. The problem with your expression contains(text(), 'Find me') is that contains() only works on strings, not nodesets and therefore it converts text() to the value of the first node, which is why removing the first <br/> makes it work.
Err, umm? But thanks #Jordy for the quick answer.
First, that's DOM-XML, which is not desired, since everything else in my script is done with SimpleXML.
Second, why do you translate to uppercase and search for an unchanged string 'Find me'? 'Searching for 'FIND ME' would actually give a result.
But you pointed me towards the right direction:
$nodes = $xml->xpath("//text()[contains(., 'Find me')]");
does the trick!
I was looking for a way to find whether a node with exact value "Find Me" exists and this seemed to work.
$node = $xml->xpath("//text()[.='Find Me']");
$doc = new DOMDocument();
$doc->loadHTML($xhtml);
$xPath = new DOMXpath($doc);
$xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 'Find me')]";
$elements = $xPath->query($xPathQuery);
if($elements->length > 0){
foreach($elements as $element){
print "Found: " .$element->nodeValue."<br />";
}}

Categories