I would like to process the html from a webpage and extract the paragraphs that match my criteria. The flavor of regex is PHP.
This is the sample webpage HTML:
<div class="special">
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
</div>
The regex looks between the <div class="special"> and </div> tags and puts everything into a capture group or variable for reference in the next step. This next step is what I am having trouble with. I cannot for the life of me write a regex that captures each paragraph of text between <p> and </p>.
I have tried /<p>(.+?)<\/p>/s which returns:
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
I would like each paragraph to be returned individually as items in an array. The non greedy ? does not seem to work. Any suggestions?
You have to escape your slash for the p tag.
So it's going to be
/<p>(.+?)<\/p>/s
So stupid! The regex works perfectly. All the regexs work perfectly. The problem was with the inputs.
The input HTML file I was processing had the following structure which made the regex not work.
<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>
I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data. I was able to get the raw data and find my mistake by using:
include 'filename.php';
file_put_contents('filename.php', $data);
Now I know to not trust my browser to return raw data ever again!
Related
I've written a regular expression to get the first two paragraphs from a database clob which stores its content in HTML formatting.
I've checked with these online RegEx builder/checkers here and here and they both seem to be doing what I want them to do (I've altered the RegEx slightly since these checkers to handle the new line formatting which I found after.
However when I go to use this in my PHP it doesn't seem to want to get just the group I'm after, and instead matches everything.
Here is my preg_replace line:
$description = preg_replace('/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/', "$2", $description);
And here is my testing content in the format of the content I am getting
<p>
Paragraph 1</p>
<p>
Paragraph 2</p>
<p>
Paragraph 3</p>
I've had a look at this SO Post which didn't help.
Any Ideas?
EDIT
As pointed out in one of the comments you cannot Regex HTML in PHP (Don't know why, I'm not really bothered by that).
Now I'm opening the option for getting it in PL/SQL as well.
select
DBMS_LOB.substr(description, 32000, 1) /* How do I make this into a regular expression? */
from
blog_posts
Your input contains newlines, therefore you have to add the s modifier:
/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/s
Otherwise, .* breaks on newlines and the regex doesn't match.
You could take a look at the PHP Simple DOM Parser. Going by their manual, you could do something like so:
$html = str_get_html('your html string');
foreach($html->find('p') as $element) //This should get all the paragraph elements in your string.
echo $element->plaintext. '<br>';
I have got field in database with <p>intro content</p><p>rest of content....</p>
I would like to take the part from intro content means the content between first <p>...</p>
After that I want to remove the intro content <p>intro content</p> from the content part.
Could you help me with that?
If you want to do any kind of parsing on HTML, you really shouldn't use regex. However, your question is fairly simple if all you want is the text content of the first <p> tag in a string:
preg_match('#<p>.*?</p>#', $string, $matches);
For my customer I wrote a custom web-based WYSIWYG HTML editor. It allows them to format basic HTML text and insert images. When they insert images I insert them with pattern like ##image1##. The produced HTML can be something like this:
<p>some text and some more text</p>
<p>some text and some <b>bold text</b></p>
<div>##image1##</div>
<p>more text can follow here</p>
<div>##image2##</div>
When outing this HTML I am searching trough it and replacing occurrences for images and replacing ##image1##, ##image2## and so on with HTML markup that actually display images. My replace code is here:
// first find all occurrences of image string
preg_match_all('|##(.+)##|', $inputHTML, $matches);
for every match in $inputHTML
$output = preg_replace('|##(.+)##|', $imageHTML, $inputHTML, 1 );
This will work mot of the times, but in some variations of input HTML will parse strange result. One of the HTML that produces strange result is:
<div>##image1##</div><p class="align-justify"><strong>Peter Dekleva</strong>, <strong>Damir Lisica</strong>, <strong>Anej Kočevar</strong> in <strong>Gregor Jakac</strong> so glasbeniki, ki v svoji glasbi združujejo silovite instrumentalne vložke, markantne melodije in močna besedila.</p><div>##image2##</div><p class="align-justify">Video dvojček skladbe Brez strahu torej prikazuje oblico sproščenih trenutkov iz zaodrja, veličasnih posnetkov s koncertnega dogajanja, priprav na nastope, nepredvidljive zaključke noči.</p>
If I edit that HTML and add a line brake before <div>##image2##</div> then it will parse it OK. Any idea what is happening here and why I have problems?
I am also opened to suggestions for a better way of doing this. I can insert something else instead ##image1## when inserting image in my WYSIWYG editor... Thanks
This is because the + modifier is greedy. So it will match everything until the last instance of ##. Try adding a ? after the + to change it to ungreedy.
|##(.+?)##|
The reason that a line break fixes the problem is because by default the . doesn't match line breaks. however if you had done instead: |##(.+)##|s the line break wouldn't have fixed the problem.
Edit I just noticed that churk's answer to your previous question would have also worked correctly.
you should create <img/> directly - but anyway, if you don't use # for your image names, use ^# instead of .
also if you are not sure that ## won't be used in other HTML, test for <div> too
<div>##(^#+)##</div>
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.
A seen in the title, here the HTML Code example:
<body>
<!--CODE_START-->
<p>I <strong>Want</strong> this</p>
<p>And this one too</p>
<!--CODE_STOP-->
<p>This sould be go to trash</p>
<!--CODE_START-->
<p>This one should be included too</p>
<!--CODE_STOP-->
The question is, I want everything inside <!--CODE_START--> and <!--CODE_STOP--> so the result should be:
<p>I <strong>Want</strong> this</p>
<p>And this one too</p>
and <p>This one should be included too</p>
I tried using this /<!--CODE_START-->([^<]*)<!--CODE_STOP-->/ and /<!--CODE_START-->(.*)<!--CODE_STOP-->/ with combination of pattern modifiers like su, imu, im it won't work, just return an empty array. And also, its full HTML page that i was try to grab.
Thanks in advance.
[^<] means everything that is not a <, obviously it's going to fail at < p>. Just catch everything and use the non greedy option :
preg_match_all('/<!--CODE_START-->(.*)<!--CODE_STOP-->/sU', $foo);