I'm trying to get the divs from many of my website files using regexes, but I'm failing
This is the thing I'm trying to do http://regexr.com/38to9
I need the following div with class data and more, with classes plainText and extData to actually be fitting the regex, everything inside. There's no extra divs inside the ones I listed.
I'm sitting on this for around 2 hours now and I can't figure it out.
It's the following for anyone who doesn't want to go visit that cool site
<div class="data">
Something
</div>
<div class="data">
Text in here
<a class="data" href="links"><img src="whatever.png"></a>
</div>
With regex
\s*<div class="(data|plainText|extData)">\s*(...)\s*<\/div>
The first div is highlighted, the second one isn't. Nor do I get any results with preg_match_all with php. Does it have anything to do with the fact I'm using tabs in the second div and I'm not using them in the first one?
(Wrote it quickly on the website to see if it works)
Have you tried using a parser instead?
$dom = new DOMDocument();
$dom->loadHTML($input);
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div) {
if( preg_match("/\b(data|plainText|extData)\b/",$div->getAttribute("class")) {
// do something to the $div
$div->setAttribute("title","I matched!");
}
}
$out = $dom->saveHTML();
// Because DOMDocument wraps our HTML in a minimal document, we need to extract
// in this case, regex is okay because we have a known structure:
$out = preg_replace("~.*?<body>(.*)</body>.*~","$1",$out);
You have a great non-regex answer, but you should also know that you were really close...
With all disclaimers about parsing html with regex, adding the DOTALL modifier (?s) to your original expression matches what you want:
(?s)<div class="(data|plainText|extData)">\s*(.*?)\s*<\/div>
See demo.
How does this work?
The DOTALL modifier (?s) tells the engine that a dot can match a newline character. This is important for your (.*?) because the content of the divs can span several lines.
Related
I've written a regular expression to get the first two paragraphs from a database clob which stores its content in HTML formatting.
I've checked with these online RegEx builder/checkers here and here and they both seem to be doing what I want them to do (I've altered the RegEx slightly since these checkers to handle the new line formatting which I found after.
However when I go to use this in my PHP it doesn't seem to want to get just the group I'm after, and instead matches everything.
Here is my preg_replace line:
$description = preg_replace('/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/', "$2", $description);
And here is my testing content in the format of the content I am getting
<p>
Paragraph 1</p>
<p>
Paragraph 2</p>
<p>
Paragraph 3</p>
I've had a look at this SO Post which didn't help.
Any Ideas?
EDIT
As pointed out in one of the comments you cannot Regex HTML in PHP (Don't know why, I'm not really bothered by that).
Now I'm opening the option for getting it in PL/SQL as well.
select
DBMS_LOB.substr(description, 32000, 1) /* How do I make this into a regular expression? */
from
blog_posts
Your input contains newlines, therefore you have to add the s modifier:
/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/s
Otherwise, .* breaks on newlines and the regex doesn't match.
You could take a look at the PHP Simple DOM Parser. Going by their manual, you could do something like so:
$html = str_get_html('your html string');
foreach($html->find('p') as $element) //This should get all the paragraph elements in your string.
echo $element->plaintext. '<br>';
I'm using regex to match specific div's in a page and replace them with a custom formatted one. I can't use domdocument as often the pages we process are mal-formed and after running it through domdocument, the pages are reformatted and don't display the same.
I'm currently using the following which works perfectly:
preg_match('#(\<div id=[\'|"]'.$key.'[\'|"](.*?)\>)(.*?)\<\/div\>#s', $contents, $response);
To match div tags such as:
<div id="test"></div>
<div id="test" style="width: 300px; height: 200px;"></div>
etc...
The problem I'm encountering is tags where the id is after the style or class, example:
<div class="test" id="test"></div>
If I run the following, the regex seems to become greedy and matches a ton of html before the div tag, so I'm not sure how to fix this:
preg_match('#(\<div(.*?)id=[\'|"]'.$key.'[\'|"](.*?)\>)(.*?)\<\/div\>#s', $contents, $response);
Does anyone have any ideas?
You can use the Ungreedy modifer (U), and also - do not use .*, but [^>]* (which means anything that is not > as > is the end of the tag and you are searching withing the tag). You don't need to escape / when this is not your delimiter (you are using # as delimiter)
preg_match('#(<div[^>]*id=[\'|"]'.$key.'[\'|"][^>]*>)(.*)</div>#isU', $contents, $response);
Don't use regex for HTML parsing, there are DOM parsers out there, like PHP DOM: http://www.php.net/manual/en/book.dom.php
I have a div called
<div id="form">Content</div>
and I want to replace the content of the div with new content using Preg_replace.
what Regex should be used.?
You shouldn't be using a regex at all. HTML can come in many forms, and you would need to take all of them in account. What if the id/class doesn't come in the place you expect? The regex would have to be really complex to get you reasonable results.
Instead, you should use a DOM parser - or a really cool tool I recently stumbled across, phpQuery. With it, you can access your document in PHP almost exactly as you would with jQuery.
This will work in your case:
$html = '<div id="content">Content</div>';
$html = preg_replace('/(<\s*div[^>]*>)[^<]*(<\s*\/div\s*>)/', '$1New Content$2', $html);
echo $html; // <div id="content">New Content</div>
However note that since HTML is not a regular language it is impossible to handle all cases. The simple regex I provided will produce bad output in the following example:
<div class=">">Content</div>
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.
WordPress spits posts in this format:
<h2>Some header</h>
<p>First paragraph of the post</p>
<p>Second paragraph of the post</p>
etc.
To get my cool styling on the first paragraph (it's one of those things that looks good only sparingly) I need to hook into the get_posts function to filter its output with a preg_replace.
The goal is to get the above code to look like:
<h2>Some header</h>
<p class="first">First paragraph of the post</p>
<p>Second paragraph of the post</p>
I have this so far but it's not even working (the error is: "preg_replace() [function.preg-replace]: Unknown modifier ']'")
$output=preg_replace('<p[^>]*>', '<p class="first">', $content);
I can't use CSS3 meta-selectors because I need to support IE6, and I can't apply the :first-line meta-selector (this is one that IE6 supports) on the parent container because it would hit the H2 instead of the first P.
You may find it easier and more reliable to use an HTML parser such as this one. HTML is notoriously difficult to parse reliably (technically, impossible) with regular expressions, and the parser will give you a very simple means to find the nodes you're interested in. The first page of the doc has a tab labelled "How to modify HTML elements".
Two right possibilities :
Do that in Javascript. Using jQuery, for example, it's a matter of one line : $("h2").next().addClass("first")
Use an HTML parser. Indeed, regexp are not a good tool to do what you want to do. Since loading a whole HTML parser for just this purpose is overkill, you'd really better be using Javascript.
The wrong way
Of course, in order to anwser the question, here is the best way I can't think of to make it happends with regexp. Though, I don't recommend it.
preg_replace('#(</h2>\s*<p[^>]*)>#im', '$1 class="first">', '<h2>Some header</h> <p>First paragraph of the post</p> <p>Second paragraph of the post</p> ');
What we do is:
using preg_replace so we can use advanced regexp to replace the code;
using "m" and "i" flag so the regexp does not bother about line break or case;
using </h2>\s* to match the closing "h2" tags and all the spaces/line breaks after;
using *<p[^>]* to match the "p" tag and its current attributs;
using parenthesis to save that;
using "$1" to replace to replace the matched string we the part we save;
adding the class and closing the ">".
The first draw back I can think of is that it doesn't handle the case where a class already exists.
Of, and by the way, you have <h2>...</h> instead of <h2>...</h2>. I don't know if it's a typo but I assumed it was. Replace in the regexp accordingly if it's not.
The problem is that the first character of the regex in a preg_* function is taken as a modifier delimiter. What you'd need is something like:
$output = preg_replace('~<p\b([^>]*)>~', '<p class="first" \1>', $content, 1);
This also puts back any extra attributes the <p> may have.
Overall, though, it's cleaner to do with CSS selectors and a JS fallback for IE.
EDIT: Added replacement limit and word break.
in this particular case regexp solution would be fairly easy
echo preg_replace('~</h2>\s*<p~', "$0 class='first'", $html);
Reading through the answers there are some that will work but all have drawbacks of either using an external parsing library or possibly matching tags other than the P tag or also matching its attributes.
I ended up using this solution with the str_replace_once function from here:
str_replace_once('<p>', '<p class="first">', $content);
Simple enough and it works just as intended. Here's full WordPress code snippet to filter the first paragraph any time the_content() is called:
add_filter('the_content', 'first_p_style');
function first_p_style($content) {
$output=str_replace_once('<p>', '<p class="first">', $content);
return ($output);
}
Thanks for all the answers!