I've written a regular expression to get the first two paragraphs from a database clob which stores its content in HTML formatting.
I've checked with these online RegEx builder/checkers here and here and they both seem to be doing what I want them to do (I've altered the RegEx slightly since these checkers to handle the new line formatting which I found after.
However when I go to use this in my PHP it doesn't seem to want to get just the group I'm after, and instead matches everything.
Here is my preg_replace line:
$description = preg_replace('/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/', "$2", $description);
And here is my testing content in the format of the content I am getting
<p>
Paragraph 1</p>
<p>
Paragraph 2</p>
<p>
Paragraph 3</p>
I've had a look at this SO Post which didn't help.
Any Ideas?
EDIT
As pointed out in one of the comments you cannot Regex HTML in PHP (Don't know why, I'm not really bothered by that).
Now I'm opening the option for getting it in PL/SQL as well.
select
DBMS_LOB.substr(description, 32000, 1) /* How do I make this into a regular expression? */
from
blog_posts
Your input contains newlines, therefore you have to add the s modifier:
/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/s
Otherwise, .* breaks on newlines and the regex doesn't match.
You could take a look at the PHP Simple DOM Parser. Going by their manual, you could do something like so:
$html = str_get_html('your html string');
foreach($html->find('p') as $element) //This should get all the paragraph elements in your string.
echo $element->plaintext. '<br>';
Related
I'm trying to get the divs from many of my website files using regexes, but I'm failing
This is the thing I'm trying to do http://regexr.com/38to9
I need the following div with class data and more, with classes plainText and extData to actually be fitting the regex, everything inside. There's no extra divs inside the ones I listed.
I'm sitting on this for around 2 hours now and I can't figure it out.
It's the following for anyone who doesn't want to go visit that cool site
<div class="data">
Something
</div>
<div class="data">
Text in here
<a class="data" href="links"><img src="whatever.png"></a>
</div>
With regex
\s*<div class="(data|plainText|extData)">\s*(...)\s*<\/div>
The first div is highlighted, the second one isn't. Nor do I get any results with preg_match_all with php. Does it have anything to do with the fact I'm using tabs in the second div and I'm not using them in the first one?
(Wrote it quickly on the website to see if it works)
Have you tried using a parser instead?
$dom = new DOMDocument();
$dom->loadHTML($input);
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div) {
if( preg_match("/\b(data|plainText|extData)\b/",$div->getAttribute("class")) {
// do something to the $div
$div->setAttribute("title","I matched!");
}
}
$out = $dom->saveHTML();
// Because DOMDocument wraps our HTML in a minimal document, we need to extract
// in this case, regex is okay because we have a known structure:
$out = preg_replace("~.*?<body>(.*)</body>.*~","$1",$out);
You have a great non-regex answer, but you should also know that you were really close...
With all disclaimers about parsing html with regex, adding the DOTALL modifier (?s) to your original expression matches what you want:
(?s)<div class="(data|plainText|extData)">\s*(.*?)\s*<\/div>
See demo.
How does this work?
The DOTALL modifier (?s) tells the engine that a dot can match a newline character. This is important for your (.*?) because the content of the divs can span several lines.
I would like to find comment tags in a string that are NOT already inside a <pre> tag, and wrap them in a <pre> tag.
It seems like there's no way of 'finding' comments using the PHP DOM.
I'm using regex to do some of the processing already, however I am very unfamiliar with (have yet to grasp or truly understand) look aheads and look behinds in regex.
For instance I may have the following code;
<!-- Comment 1 -->
<pre>
<div class="some_html"></div>
<!-- Comment 2 -->
</pre>
I would like to wrap Comment 1 in <pre> tags, but obviously not Comment 2 as it already resides in a <pre>.
How would this usually be done in RegEx?
Here's kind of what I've understood about negative look arounds, and my attempt at one, I'm clearly doing something very wrong!
(?<!<pre>.*?)<!--.*-->(?!.*?</pre>)
You should really use a DOM parser if you are planning on re-using this code. Every regex approach will fail horribly sooner rather than later when presented with real-world HTML.
Having said that, here's what you could (but should not, see above) do:
First, identify comments, e.g. using
<!-- (?:(?!-->).)*-->
The negative look-ahead block ensures that the .* does not run out of the comment block.
Now, you need to figure out if this comment is inside a <pre> block. The key observation here, is that there is an even number of either <pre> or </pre> elements following every comment NOT already included in one.
So, run through the rest of your text, always in pairs of <pre>s, and check if you arrive at the end.
This would look like
(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)
So, together this would be
<!-- (?:(?!-->).)*-->(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)
A hurray for write-only code =)
The prominent building block of this expression is (?:(?!</?pre>).) which matches every character that is not the starting bracket of a <pre> or </pre> sequence.
Allowing attributes on the <pre> and proper escaping are left as an exercise for the reader. See this in action at RegExr.
It seems like there's no way of 'finding' comments using the PHP DOM.
Of course you can... Check this code using PHP Simple HTML DOM Parser:
<?php
$text = '<!-- Comment 1 -->
<pre>
<div class="some_html"></div>
<!-- Comment 2 -->
</pre>';
echo "<div>Original Text: <xmp>$text</xmp></div>";
$html = str_get_html($text);
$comments = $html->find('comment');
// if find exists
if ($comments) {
echo '<br>Find function found '. count($comments) . ' results: ';
foreach($comments as $key=>$com){
echo '<br>'.$key . ': ' . $com->tag . ' wich contains = <xmp>' . $com->innertext . '</xmp>';
}
}
else
echo "Find() fails !";
?>
$com->innertext will give you the comments like <!-- Comment 1 -->...
You have now just to clean them as you wish. For example using <!--\s*(.*)\s*-->... Try it HERE
Edit:
Just a note concerning the lookbehind, it MUST have a fixed-width, therefore you cannot use repetition *+ or optional items ?
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.
Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.
Source: http://www.regular-expressions.info/lookaround.html
Xpath is your friend:
$xpath = new DOMXpath($doc);
foreach($xpath->query('//comment()[not(ancestor::pre)]') as $comment){
$pre = $doc->createElement("pre");
$comment->parentNode->insertBefore($pre, $comment);
$pre->appendChild($comment);
}
its quite easy, using a principle called the stack-counter, essentially you count the amount of <pre> tags and the amount of </pre> tags until the point in the HTML code your segment is placed. if there are more <pre> than </pre> - this means that "<pre>..--you are here--..</pre>". in that case, simply return back the match, unmodified - simple as that.
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.
I am trying to index some content from a series of .html's that share the same format.
So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...
And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).
So then I have something like this:
$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);
Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.
At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.
What can I do? I've been struggling with this all day.
PHP Tidy is your friend. Don't use regexes.
Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/ seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.
Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.
As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.
I suggest using an XML parser such as PHP's DomDocument.
Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.
It might look like
// Create a DomDocument object
$html = new DOMDocument();
// Load the url's contents into the DOM
$html->loadHTMLFile("http://whatever.com/some.htm");
// make an array to hold the text
$anchors = array();
//Loop through the a tags and store them in an array
foreach($html->getElementsByTagName('a') as $link) {
$anchors[] = $link->nodeValue;
}
One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.
WordPress spits posts in this format:
<h2>Some header</h>
<p>First paragraph of the post</p>
<p>Second paragraph of the post</p>
etc.
To get my cool styling on the first paragraph (it's one of those things that looks good only sparingly) I need to hook into the get_posts function to filter its output with a preg_replace.
The goal is to get the above code to look like:
<h2>Some header</h>
<p class="first">First paragraph of the post</p>
<p>Second paragraph of the post</p>
I have this so far but it's not even working (the error is: "preg_replace() [function.preg-replace]: Unknown modifier ']'")
$output=preg_replace('<p[^>]*>', '<p class="first">', $content);
I can't use CSS3 meta-selectors because I need to support IE6, and I can't apply the :first-line meta-selector (this is one that IE6 supports) on the parent container because it would hit the H2 instead of the first P.
You may find it easier and more reliable to use an HTML parser such as this one. HTML is notoriously difficult to parse reliably (technically, impossible) with regular expressions, and the parser will give you a very simple means to find the nodes you're interested in. The first page of the doc has a tab labelled "How to modify HTML elements".
Two right possibilities :
Do that in Javascript. Using jQuery, for example, it's a matter of one line : $("h2").next().addClass("first")
Use an HTML parser. Indeed, regexp are not a good tool to do what you want to do. Since loading a whole HTML parser for just this purpose is overkill, you'd really better be using Javascript.
The wrong way
Of course, in order to anwser the question, here is the best way I can't think of to make it happends with regexp. Though, I don't recommend it.
preg_replace('#(</h2>\s*<p[^>]*)>#im', '$1 class="first">', '<h2>Some header</h> <p>First paragraph of the post</p> <p>Second paragraph of the post</p> ');
What we do is:
using preg_replace so we can use advanced regexp to replace the code;
using "m" and "i" flag so the regexp does not bother about line break or case;
using </h2>\s* to match the closing "h2" tags and all the spaces/line breaks after;
using *<p[^>]* to match the "p" tag and its current attributs;
using parenthesis to save that;
using "$1" to replace to replace the matched string we the part we save;
adding the class and closing the ">".
The first draw back I can think of is that it doesn't handle the case where a class already exists.
Of, and by the way, you have <h2>...</h> instead of <h2>...</h2>. I don't know if it's a typo but I assumed it was. Replace in the regexp accordingly if it's not.
The problem is that the first character of the regex in a preg_* function is taken as a modifier delimiter. What you'd need is something like:
$output = preg_replace('~<p\b([^>]*)>~', '<p class="first" \1>', $content, 1);
This also puts back any extra attributes the <p> may have.
Overall, though, it's cleaner to do with CSS selectors and a JS fallback for IE.
EDIT: Added replacement limit and word break.
in this particular case regexp solution would be fairly easy
echo preg_replace('~</h2>\s*<p~', "$0 class='first'", $html);
Reading through the answers there are some that will work but all have drawbacks of either using an external parsing library or possibly matching tags other than the P tag or also matching its attributes.
I ended up using this solution with the str_replace_once function from here:
str_replace_once('<p>', '<p class="first">', $content);
Simple enough and it works just as intended. Here's full WordPress code snippet to filter the first paragraph any time the_content() is called:
add_filter('the_content', 'first_p_style');
function first_p_style($content) {
$output=str_replace_once('<p>', '<p class="first">', $content);
return ($output);
}
Thanks for all the answers!