Is there a token for capture line breaks in multiline regex? - php

I've run into this problems several times before when trying to do some html scraping with php and the preg* functions.
Most of the time I've to capture structures like that:
<!-- comment -->
<tag1>lorem ipsum</tag>
<p>just more text with several html tags in it, sometimes CDATA encapsulated…</p>
<!-- /comment -->
In particular I want something like this:
/<tag1>(.*?)<\/tag1>\n\n<p>(.*?)<\/p>/mi
but the \n\n doesn't look like it would work.
Is there a general line-break switch?

I think you could replace the \n\n with (\r?\n){2} this way you capture the CRLF pair instead of just the LF char.

Are you sure you want to parse HTML using regexps ? HTML isn't regular and there are too many corner cases.
I would investigate some form of HTML parser (perhaps this one ?), and then identify the pattern you're interested in via the returned HTML data structure.

Or you could look at the Dom Extension to php. It has a function to load html from a string or a file. You can then use the php dom methods to traverse the dom and find the data you are interested in.

Related

PHP Regex : Ignore closing tag of HTML if

I can't seem to get this to work and I was hoping for some help.
I'm trying to capture the contents of a specific div (please save the DOM talk, for this specific purpose it doesn't really come into play.)
The problem is, I can't seem to get it to work if there is another div with attributes before it on the same line. I tried specifying only match if there's no > between <div and class="myClass", but I think I'm doing it wrong.
I'm still pretty mystified by regex.
/<div(?!>).*?class="myClass".*?>(.*?)<\/div>/mi
(semi) Working example: http://regex101.com/r/cW0lW6
Try
/<div(?=\s)(?:(?!>).)+?class="myClass".*?>(.*?)<\/div>/si
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML.
See: RegEx match open tags except XHTML self-contained tags
I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.
You can use this (simple way):
~<div[^>]+?class="myClass"[^>]*>(.*?)</div>~si
or this (more efficient way if you have a lot of attributes):
~<div(?>[^>c]++|\Bc|c(?!lass=))+class="myClass"[^>]*+>(.*?)</div>~si
Note that these patterns don't work if your div tag contains another div tag.

str_replace inline script code from html in php not working

I have a html page stored in the mysql database. I get the html from the database and try to replace some of the inline javascript code from the html content. I tried using str_replace() but it does not replace the inline javascript code. I can replace other html content like divs but not inline javascript code.
How can I do find and replace the inline javascript code?
PHP should be seeing the entire HTML page as a big string, so in theory, it should be able to alter JS and HTML alike. Is it possible the string still has slashes, and your str_replace can't find the search criteria due to the slashes?
Try printing the entirety of the string to the screen to make sure, and if it does still have slashes, use a stripslashes($string) call to get rid of them.
You probably want to use a DOM parser to handle your webpage as a DOM structure, not a serialised string of HTML (where things like string replacement and regular expressions can be troublesome).

Skip replace if inside specific tag

I have the following regex to detect text inside <?php (including the tag)
'/<\?php(.*)\?>/isU'
I also got function called compress. the function compress html content by replacing new lines, comments etc... I don't want the function to replace text inside the <?php tag. how can I do it using the regex above?
Thanks!
You would be MUCH safer walking the DOM and operating on the text nodes. Using regex on HTML/XML is typically unsafe (there are numerous SO arguments/discussions on the issue). The essense of the problem is that regex (esp. the Javascript implementation) lacks a means to accurately establish the context and nesting of a pattern.

Recursive Contents of HTML tag using regex

I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.
The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.
Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)
But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.
Here are the gotchyas:
String may or may not contain a blockquote block
String could contain multiple blockquotes
String could contain potentially any level of nesting of blockquotes blocks
We can rely on the HTML being properly formed
A sample input string would be look something like something like this:
Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.
To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.
You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.
~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~
You will need to run this regex recursively using preg_replace_callback:
const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}
$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);
My regex assumes, that there won't be any attributes on the blockquote or anywhere else.
(PS: I'll leave the "Use a DOM parser" comment to someone else.)
Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily.
A simple way way to explain this, is to say that regular expression can't keep a count ..
i.e. they can't count the nesting level...
what is you need is a limited CFG ( the paren-counting types ) ..
you need to somehow keep a count ..may be a stack or tree ...

regex for php to find all self-closing tags

I've got a system that uses a DomDocumentFragment which is created based on markup from a database or another area of the system (i.e. other XHTML code).
One such tag that may be included is:
<div class="clear"></div>
Before the string is added to the DomDocumentFragment, the content is correct - the class is closing correctly.
However, the DomDocumentFragment transforms this into:
<div class="clear"/>
This does not display correctly in browsers due to the incorrect closing of the tag.
So my thought is to post-process the XML string that the DomDocument returns me (that includes the incorrect div structure, as shown above), and transform self-closing tags back to their correct structure... i.e. turn back to .
But I'm having trouble with the pattern for preg_match to find these tags - I've seen some patterns that return all tags (i.e. find all tags), but not just those that are self closing.
I've tried something along the lines of this, but my head gets a little confused with regex (and I start over-complicating things)
/<div(["\d\w\s])\/>/
The aim is for a pattern to match , where the "...." could be any valid XHTML attributes.
Any suggestions or pointers to put me back on track?
Limit the problem domain -- you need to change <div class="clear"/> to <div class="clear"></div> ... so search for the former, and replace it with the latter using a straightforward find and replace operation. It should be faster and it will definitely be safer
Whatever you do, do not try to parse HTML with a regular expression (which you're trying to do by building a regex that can detect a <div> with arbitrary attributes.)
Putting
<div></div>
into a DomDocumentFragment doesn't actually change it into
<div/>
it changes it into
A-DOM-Element-Node-with-name-"div"-and-no-content.
It's only when the DomDocumentFragment is serialized that either <div></div> or <div/> is created. In other words, the problem lies not with the DomDocumentFragment, but with the serialization process that you are using.
PHP is not my language, so I can't be much more help, but I would be looking for an HTML-compatible serializer for your DomDocumentFragment, rather than try to patch the output after serialization.

Categories