preg_match('/<div class="prices">/s<h3>(.+?)<\/h3>/is', $response, $matches);
There's whitespace and potentially new lines between the prices div and the h3 tag. How do I use /s to match that?
You don't use /s, you use \s*.
It's a backslash (\), not a slash (/).
The * afterwards means that it matches zero or more whitespace characters.
Also, please consider using an HTML parser if you are trying to find HTML tags. A proper HTML parser will be able to correctly handle whitespace, HTML comments and other features of HTML that your regular expression cannot handle.
Related
I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?
You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.
To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%
You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>
There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.
An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.
There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.
I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?
You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.
To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%
You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>
There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.
An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.
There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.
I have a number of items in a table, formatted like this
<td class="product highlighted">
Item Name
</td>
and I am using the following PHP code
$regex_pattern = "/<td class=\"product highlighted\">(.*)<\/td>/";
preg_match_all($regex_pattern,$buffer,$matches);
print_r($matches);
I am not getting any output, yet I can see the items in the html.
Is there something wrong with my regexp?
Apart from your using regex to parse HTML, yes, there is something wrong: The dot doesn't match newlines.
So you need to use
$regex_pattern = "/<td class=\"product highlighted\">(.*?)<\/td>/s";
The /s modifier allows the dot to match any character, including newlines. Note the reluctant quantifier .*? to avoid matching more than one tag at once.
In order to match your example, you will need to add the dot all flag, s, so the . will match newlines.
Try the following.
$regex_pattern = "/<td class=\"product highlighted\">(.*?)<\/td>/s";
Also note that I changed the capture to non-greedy, (.*?). It's best to do so when matching open ended text.
It's worth noting regular expressions are not the right tool for HTML parsing, you should look into DOMDocument. However, for such a simple match you can get away with regular expressions provided your HTML is well-formed.
I am trying to get my site feed working.
I need to select some content and display in my feed. After selecting, i strip tags then display.
The problem is this:
The data still displays as if the tags still exist (but no visible html tag) eg. after stripping, in my source ill have:
Hello (just illustrating)
----There will be gap in between as if html character still exist, but cant see any when i view my source-----
Hi
How can i fix this . Thanks
EDIT:
To make it clearer, after stripping i still get text like this:
This is my first line
This is my second line with a gap in between the first line and second line as if there is a paragraph tag
UPDATE
i am using this:
$body=substr(strip_tags(preg_replace('/\n{2,}/',"\n",$row["post_content"])),0,150);
when i echo $body, it still maintains new lines
you may have a \n which was at the end of the paragraphs after the closing tags you stripped.
preg_replace('/[\p{Z}\s]{2,}/s',' ',$string);
will strip out all white space, tabs, new lines and double spaces and replace with single space.
\s Matches any white-space character. Equivalent to the Unicode character categories [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v].
strip_tags will literally strip the tags, leaving any other whitespace behind.
You could get rid of extra newlines and whitespace with regular expressions, but depending on your content, you might mangle it.
Remove newlines:
$string = preg_replace('/\n{2,}/',"\n",$string);
Remove extra spaces:
$string = preg_replace('/ {2,}/',' ',$string);
I was experiencing some very annoyingly similar. Solved with trim
$body=strip_tags(trim($row["post_content"]));
heya, so I'm looking for a regex that would allow me to basically replace a newline character with whatever (eg. 'xxx'), but only if the newline character isn't within textarea tags.
For example, the following:
<strong>abcd
efg</strong>
<textarea>curious
george
</textarea>
<span>happy</span>
Would become:
<strong>abcdxxxefg</strong>xxx<textarea>curious
geroge
</textarea>xxx<span>happy</span>
Anyone have any idea on where I should start? I'm kinda clueless here :(
Thanks for any help possible.
I've got it, but you're not gonna like it. ;)
$result = preg_replace(
'~[\r\n]++(?=(?>[^<]++|<(?!/?textarea\b))*+(?!</textarea\b))~',
'XYZ', $source);
After matching a line break, the lookahead scans ahead, consuming any character that's not a left angle bracket, or any left angle bracket that's not the beginning of a <textarea> or </textarea> tag. When it runs out of those, the next thing it sees has to be one of those tags or the end of the string. If it's a </textarea> tag, that means the line break was found inside a textarea element, so the match fails, and that line break is not replaced.
I've included an expanded version below, and you can see it an action on ideone. You can adapt it to handle those other tags too, if you really want to. But it sounds to me like what you need is an HTML minimizer (or minifier); there are plenty of those available.
$re=<<<EOT
~
[\r\n]++
(?=
(?>
[^<]++ # not left angle brackets, or
|
<(?!/?textarea\b) # bracket if not for TA tag (opening or closing)
)*+
(?!</textarea\b) # first TA tag found must be opening, not closing
)
~x
EOT;
If you still want to go with regexp, you may try this - escape newlines inside special tags, delete newlines and then unescape:
<?php //5.3 syntax here
//Regex matches everything within textarea, pre or code tags
$str = preg_replace_callback('#<(?P<tag>textarea|pre|code)[^>]*?>.*</(?P=tag)>#sim',
function ($matches) {
//and then replaces every newline by some escape sequence
return str_replace("\n", "%ESCAPED_NEWLINE%", $matches[0]);
}, $str);
//after all we can safely remove newlines
//and then replace escape sequences by newlines
$str = str_replace(array("\n", "%ESCAPED_NEWLINE%"), array('', "\n"), $str);
Why use a regex for this? Why not use a very simple state machine to do it? Work through the string looking for opening <textarea> tags, and when inside them look for the closing tag instead. When you come across a newline, convert it or not based on whether you're currently inside a <textarea> or not.
What you are doing is parsing HTML. You cannot parse HTML with a regular expression.