Regular expression to match block of HTML - php

First I'll show you a sample of the code I'm working with:
<div class="entry">
<p>Any HTML content could go here!</p>
</div>
</div><!--/post -->
Normally I'd use a regex rule such as the following to look for a prefix and a suffix and grab everything in between:
(?<=<div class="entry">).*(?=</div><!--/post -->)
However, that doesnt appear to be working as it seems to be pulling the white space in between then following parts instead of the HTML content itself:
<div class="entry">
<p>
Any help/suggestions would be much appreciated as I've been bashing my head with this one for a good few hours now.
Many thanks in advance.

Don't use Regex to parse HTML. You need an Xml Parser or similar.
Search Stackoverflow for the best one, like so: Robust and Mature HTML Parser for PHP

You can also consider php strip_tags().

Related

Make HTML readable again

I have some HTML code in a file created by an online JS editor
<h1>Title</h1><p>Some text</p><p>Some text</p>
that is not easily readable offline.
I'd like to split it like this with php, that is more readable
<h1>Title</h1>
<p>Some text</p>
<p>Some text</p>
I can make a string replace adding the newline after each closure, but if I save several times it adds newlines every time I save.
Do you have any suggestion?
Thank you.
P.S. the online JS editor is Summernote, maybe there is a config to work around this?
what you looking to is "unminify html",there is some online tools can do the work like:
unminify.com
textfixer.com
Following the suggestions of Mohamed, I found Tidy.
Tidy comes with both a shell command (http://tidy.sourceforge.net/) and a PHP library (http://php.net/manual/en/book.tidy.php), both of them work very well and provide sereal tools to maintain HTML code.

PHP Regex : Ignore closing tag of HTML if

I can't seem to get this to work and I was hoping for some help.
I'm trying to capture the contents of a specific div (please save the DOM talk, for this specific purpose it doesn't really come into play.)
The problem is, I can't seem to get it to work if there is another div with attributes before it on the same line. I tried specifying only match if there's no > between <div and class="myClass", but I think I'm doing it wrong.
I'm still pretty mystified by regex.
/<div(?!>).*?class="myClass".*?>(.*?)<\/div>/mi
(semi) Working example: http://regex101.com/r/cW0lW6
Try
/<div(?=\s)(?:(?!>).)+?class="myClass".*?>(.*?)<\/div>/si
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML.
See: RegEx match open tags except XHTML self-contained tags
I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.
You can use this (simple way):
~<div[^>]+?class="myClass"[^>]*>(.*?)</div>~si
or this (more efficient way if you have a lot of attributes):
~<div(?>[^>c]++|\Bc|c(?!lass=))+class="myClass"[^>]*+>(.*?)</div>~si
Note that these patterns don't work if your div tag contains another div tag.

Replace the content inside a DIV

I have a div called
<div id="form">Content</div>
and I want to replace the content of the div with new content using Preg_replace.
what Regex should be used.?
You shouldn't be using a regex at all. HTML can come in many forms, and you would need to take all of them in account. What if the id/class doesn't come in the place you expect? The regex would have to be really complex to get you reasonable results.
Instead, you should use a DOM parser - or a really cool tool I recently stumbled across, phpQuery. With it, you can access your document in PHP almost exactly as you would with jQuery.
This will work in your case:
$html = '<div id="content">Content</div>';
$html = preg_replace('/(<\s*div[^>]*>)[^<]*(<\s*\/div\s*>)/', '$1New Content$2', $html);
echo $html; // <div id="content">New Content</div>
However note that since HTML is not a regular language it is impossible to handle all cases. The simple regex I provided will produce bad output in the following example:
<div class=">">Content</div>

Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.

regex for php to find all self-closing tags

I've got a system that uses a DomDocumentFragment which is created based on markup from a database or another area of the system (i.e. other XHTML code).
One such tag that may be included is:
<div class="clear"></div>
Before the string is added to the DomDocumentFragment, the content is correct - the class is closing correctly.
However, the DomDocumentFragment transforms this into:
<div class="clear"/>
This does not display correctly in browsers due to the incorrect closing of the tag.
So my thought is to post-process the XML string that the DomDocument returns me (that includes the incorrect div structure, as shown above), and transform self-closing tags back to their correct structure... i.e. turn back to .
But I'm having trouble with the pattern for preg_match to find these tags - I've seen some patterns that return all tags (i.e. find all tags), but not just those that are self closing.
I've tried something along the lines of this, but my head gets a little confused with regex (and I start over-complicating things)
/<div(["\d\w\s])\/>/
The aim is for a pattern to match , where the "...." could be any valid XHTML attributes.
Any suggestions or pointers to put me back on track?
Limit the problem domain -- you need to change <div class="clear"/> to <div class="clear"></div> ... so search for the former, and replace it with the latter using a straightforward find and replace operation. It should be faster and it will definitely be safer
Whatever you do, do not try to parse HTML with a regular expression (which you're trying to do by building a regex that can detect a <div> with arbitrary attributes.)
Putting
<div></div>
into a DomDocumentFragment doesn't actually change it into
<div/>
it changes it into
A-DOM-Element-Node-with-name-"div"-and-no-content.
It's only when the DomDocumentFragment is serialized that either <div></div> or <div/> is created. In other words, the problem lies not with the DomDocumentFragment, but with the serialization process that you are using.
PHP is not my language, so I can't be much more help, but I would be looking for an HTML-compatible serializer for your DomDocumentFragment, rather than try to patch the output after serialization.

Categories