regex: find new line character in string that isn't in textarea - php

heya, so I'm looking for a regex that would allow me to basically replace a newline character with whatever (eg. 'xxx'), but only if the newline character isn't within textarea tags.
For example, the following:
<strong>abcd
efg</strong>
<textarea>curious
george
</textarea>
<span>happy</span>
Would become:
<strong>abcdxxxefg</strong>xxx<textarea>curious
geroge
</textarea>xxx<span>happy</span>
Anyone have any idea on where I should start? I'm kinda clueless here :(
Thanks for any help possible.

I've got it, but you're not gonna like it. ;)
$result = preg_replace(
'~[\r\n]++(?=(?>[^<]++|<(?!/?textarea\b))*+(?!</textarea\b))~',
'XYZ', $source);
After matching a line break, the lookahead scans ahead, consuming any character that's not a left angle bracket, or any left angle bracket that's not the beginning of a <textarea> or </textarea> tag. When it runs out of those, the next thing it sees has to be one of those tags or the end of the string. If it's a </textarea> tag, that means the line break was found inside a textarea element, so the match fails, and that line break is not replaced.
I've included an expanded version below, and you can see it an action on ideone. You can adapt it to handle those other tags too, if you really want to. But it sounds to me like what you need is an HTML minimizer (or minifier); there are plenty of those available.
$re=<<<EOT
~
[\r\n]++
(?=
(?>
[^<]++ # not left angle brackets, or
|
<(?!/?textarea\b) # bracket if not for TA tag (opening or closing)
)*+
(?!</textarea\b) # first TA tag found must be opening, not closing
)
~x
EOT;

If you still want to go with regexp, you may try this - escape newlines inside special tags, delete newlines and then unescape:
<?php //5.3 syntax here
//Regex matches everything within textarea, pre or code tags
$str = preg_replace_callback('#<(?P<tag>textarea|pre|code)[^>]*?>.*</(?P=tag)>#sim',
function ($matches) {
//and then replaces every newline by some escape sequence
return str_replace("\n", "%ESCAPED_NEWLINE%", $matches[0]);
}, $str);
//after all we can safely remove newlines
//and then replace escape sequences by newlines
$str = str_replace(array("\n", "%ESCAPED_NEWLINE%"), array('', "\n"), $str);

Why use a regex for this? Why not use a very simple state machine to do it? Work through the string looking for opening <textarea> tags, and when inside them look for the closing tag instead. When you come across a newline, convert it or not based on whether you're currently inside a <textarea> or not.

What you are doing is parsing HTML. You cannot parse HTML with a regular expression.

Related

PHP RegExp: Capture all HTML closing tags followed by new line character

I want to capture any HTML closing tags that are followed by a newline character and replace them by only the HTML tag.
For example I want to turn this:
<ul>\n
<li>element</li>\n
</ul>\n\n
<br/>\n\n
Some text\n
into this:
<ul>
<li>element</li>
</ul>\n
<br/>\n
Some text\n
The problem is that I cannot capture \n characters with regex:
preg_match_all('/(<\/[a-zA-Z]*>|<[a-zA-Z]*\/>)\n/s', $in, $matches);
As soon I place the \n somewhere in my pattern the matches array will return empty values.
Interesting thing that if I try to match the \n character standalone only, it finds all of them:
preg_match_all('/\n/s', $in, $matches);
Try :
preg_match_all('/(<\/[a-zA-Z]*>|<[a-zA-Z]*\/>)\\n/s', $in, $matches);
You have to escape the "\" character.
You could use something like the following:
(<[^>]+>)$\R{2}
# capture anything between a pair of < and > at the end of the line
# followed by two newline characters
You'll need to use the multiline mode, see a demo on regex101.com.
In PHP this would be:
$regex = '~(<[^>]+>)$\R{2}~m';
$string = preg_replace($regex, "$1", $your_string_here);
Generally, the DomDocument parser offers the possibility to preserve or throw away whitespaces, so you might be better of using this instead.

Explode and/or regex text to HTML link in PHP

I have a database of texts that contains this kind of syntax in the middle of English sentences that I need to turn into HTML links using PHP
"text1(text1)":http://www.example.com/mypage
Notes:
text1 is always identical to the text in parenthesis
The whole string always have the quotation marks, parenthesis, colon, so the syntax is the same for each.
Sometimes there is a space at the end of the string, but other times there is a question mark or comma or other punctuation mark.
I need to turn these into basic links, like
text1
How do I do this? Do I need explode or regex or both?
"(.*?)\(\1\)":(.*\/[a-zA-Z0-9]+)(?=\?|\,|\.|$)
You can use this.
See Demo.
http://regex101.com/r/zF6xM2/2
You can use this replacement:
$pattern = '~"([^("]+)\(\1\)":(http://\S+)(?=[\s\pP]|\z)~';
$replacement = '\1';
$result = preg_replace($pattern, $replacement, $text);
pattern details:
([^("]+) this part will capture text1 in the group 1. The advantage of using a negated character class (that excludes the double quote and the opening parenthesis) is multiple:
it allows to use a greedy quantifier, that is faster
since the class excludes the opening parenthesis and is immediatly followed by a parenthesis in the pattern, if in an other part of the text there is content between double quotes but without parenthesis inside, the regex engine will not go backward to test other possibilities, it will skip this substring without backtracking. (This is because the PCRE regex engine converts automatically [^a]+a into [^a]++a before processing the string)
\S+ means all that is not a whitespace one or more times
(?=[\s\pP]|\z) is a lookahead assertion that checks that the url is followed by a whitespace, a punctuation character (\pP) or the end of the string.
You can use this regex:
"(.*?)\(.*?:(.*)
Working demo
An appropriate Regular Expression could be:
$str = '"text1(text1)":http://www.example.com/mypage';
preg_match('#^"([^\(]+)' .
'\(([^\)]+)\)[^"]*":(.+)#', $str, $m);
print ''.$m[2].'' . PHP_EOL;

Strip leading and trailing whitespaces within a string matched using regex in php

I have a html string
$html = <p>I'm a para</p><b> I'm bold </b>
Now I can replace the bold tags with wiki markup(*) using php regex as:
$html = preg_replace('/<b>(.*?)<\/b>/', '*\1*', $html);
Apart from this I also want the leading and trailing whitespaces in the bold tag removed in the same php's preg_replace function.
Can anyone help me on how to do this?
Thanks in advance,
Varun
Try using:
$html = preg_replace('~<b>\s*(.*?)\s*</b>\s*~i', '*\1*', $html);
\s in between the tags and the string to keep will strip away the spaces to trim. The i flag just for case insensitivity and I used ~ as delimiters so you don't have to escape forward slashes.
Apart from this I also want the leading and trailing whitespaces in the bold tag removed
Easy enough this will do it just fine.
$html = preg_replace('/<b>\s+(.*?)\s+<\/b>/', '*\1*', $html);
See the demo
Use \s symbol for that (\s* means that there could be 0 or more occurrences):
$html = preg_replace('/\<b\>\s*(.*?)\s*\<\/b\>/i', '*\1*', $html);
-I also suggest to use i modifier since html tags are case insensitive. And, finally, symbols < and > should be escaped for more safety (they are part of some regex construsts. Your regex will work without escaping them, but it's a good habit to escape them so be sure to avoid errors with that)
(edit): it seems I've misunderstood 'trailing/leading' sense.

Need regex to add spaces in long words but ignore HTML tags and attributes

I need to add spaces in words within product description at a user supplied positon (we'll say 25 for example) to allow proper wrapping. I know CSS tricks can be used but that's not what I'm loooking for.
So far I can do this using this syntax but the problem I'm having is that it's splitting stuff it shouldn't be splitting such as URLs in HTML tag attributes.
$string = 'longwordlongwordlongword someanchortext and title here';
$spacer = 20;
$newtext = preg_replace('/([^\s]{' . $spacer . '})(?=[^\s])/m', '$1 ', $newtext);
The result is this....
longwordlongwordlong word somean chortext and title here
I need to somehow tell the regex to split everything EXCEPT HTML tags and attributes.
If you're sure that you'll never have angle brackets (<>) inside attribute values or comments of your HTML file, then you could try this:
$result = preg_replace(
'/( # Match and capture...
[^\s<>] # anything except whitespace and angle brackets
{20} # 20 times.
) # End of capturing group.
(?! # Assert that it\'s impossible to match the following:
[^<>]* # any number of characters except angle brackets
> # followed by a closing bracket.
) # End of lookahead assertion.
/x',
'\1 ', $subject);
The idea here is to match a 20-character non-space-string only if the next angle bracket in the text isn't a closing bracket (which would mean that that string is inside a tag). Obviously this breaks if angle brackets could occur elsewhere.
You might also want to use \w instead of [^\s<>], so you really only match alphanumeric strings (if that's what you want).

Problem with PHP strip tag

I am trying to get my site feed working.
I need to select some content and display in my feed. After selecting, i strip tags then display.
The problem is this:
The data still displays as if the tags still exist (but no visible html tag) eg. after stripping, in my source ill have:
Hello (just illustrating)
----There will be gap in between as if html character still exist, but cant see any when i view my source-----
Hi
How can i fix this . Thanks
EDIT:
To make it clearer, after stripping i still get text like this:
This is my first line
This is my second line with a gap in between the first line and second line as if there is a paragraph tag
UPDATE
i am using this:
$body=substr(strip_tags(preg_replace('/\n{2,}/',"\n",$row["post_content"])),0,150);
when i echo $body, it still maintains new lines
you may have a \n which was at the end of the paragraphs after the closing tags you stripped.
preg_replace('/[\p{Z}\s]{2,}/s',' ',$string);
will strip out all white space, tabs, new lines and double spaces and replace with single space.
\s Matches any white-space character. Equivalent to the Unicode character categories [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v].
strip_tags will literally strip the tags, leaving any other whitespace behind.
You could get rid of extra newlines and whitespace with regular expressions, but depending on your content, you might mangle it.
Remove newlines:
$string = preg_replace('/\n{2,}/',"\n",$string);
Remove extra spaces:
$string = preg_replace('/ {2,}/',' ',$string);
I was experiencing some very annoyingly similar. Solved with trim
$body=strip_tags(trim($row["post_content"]));

Categories