I have got field in database with <p>intro content</p><p>rest of content....</p>
I would like to take the part from intro content means the content between first <p>...</p>
After that I want to remove the intro content <p>intro content</p> from the content part.
Could you help me with that?
If you want to do any kind of parsing on HTML, you really shouldn't use regex. However, your question is fairly simple if all you want is the text content of the first <p> tag in a string:
preg_match('#<p>.*?</p>#', $string, $matches);
Related
I would like to process the html from a webpage and extract the paragraphs that match my criteria. The flavor of regex is PHP.
This is the sample webpage HTML:
<div class="special">
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
</div>
The regex looks between the <div class="special"> and </div> tags and puts everything into a capture group or variable for reference in the next step. This next step is what I am having trouble with. I cannot for the life of me write a regex that captures each paragraph of text between <p> and </p>.
I have tried /<p>(.+?)<\/p>/s which returns:
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
I would like each paragraph to be returned individually as items in an array. The non greedy ? does not seem to work. Any suggestions?
You have to escape your slash for the p tag.
So it's going to be
/<p>(.+?)<\/p>/s
So stupid! The regex works perfectly. All the regexs work perfectly. The problem was with the inputs.
The input HTML file I was processing had the following structure which made the regex not work.
<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>
I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data. I was able to get the raw data and find my mistake by using:
include 'filename.php';
file_put_contents('filename.php', $data);
Now I know to not trust my browser to return raw data ever again!
I've written a regular expression to get the first two paragraphs from a database clob which stores its content in HTML formatting.
I've checked with these online RegEx builder/checkers here and here and they both seem to be doing what I want them to do (I've altered the RegEx slightly since these checkers to handle the new line formatting which I found after.
However when I go to use this in my PHP it doesn't seem to want to get just the group I'm after, and instead matches everything.
Here is my preg_replace line:
$description = preg_replace('/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/', "$2", $description);
And here is my testing content in the format of the content I am getting
<p>
Paragraph 1</p>
<p>
Paragraph 2</p>
<p>
Paragraph 3</p>
I've had a look at this SO Post which didn't help.
Any Ideas?
EDIT
As pointed out in one of the comments you cannot Regex HTML in PHP (Don't know why, I'm not really bothered by that).
Now I'm opening the option for getting it in PL/SQL as well.
select
DBMS_LOB.substr(description, 32000, 1) /* How do I make this into a regular expression? */
from
blog_posts
Your input contains newlines, therefore you have to add the s modifier:
/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/s
Otherwise, .* breaks on newlines and the regex doesn't match.
You could take a look at the PHP Simple DOM Parser. Going by their manual, you could do something like so:
$html = str_get_html('your html string');
foreach($html->find('p') as $element) //This should get all the paragraph elements in your string.
echo $element->plaintext. '<br>';
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.
A seen in the title, here the HTML Code example:
<body>
<!--CODE_START-->
<p>I <strong>Want</strong> this</p>
<p>And this one too</p>
<!--CODE_STOP-->
<p>This sould be go to trash</p>
<!--CODE_START-->
<p>This one should be included too</p>
<!--CODE_STOP-->
The question is, I want everything inside <!--CODE_START--> and <!--CODE_STOP--> so the result should be:
<p>I <strong>Want</strong> this</p>
<p>And this one too</p>
and <p>This one should be included too</p>
I tried using this /<!--CODE_START-->([^<]*)<!--CODE_STOP-->/ and /<!--CODE_START-->(.*)<!--CODE_STOP-->/ with combination of pattern modifiers like su, imu, im it won't work, just return an empty array. And also, its full HTML page that i was try to grab.
Thanks in advance.
[^<] means everything that is not a <, obviously it's going to fail at < p>. Just catch everything and use the non greedy option :
preg_match_all('/<!--CODE_START-->(.*)<!--CODE_STOP-->/sU', $foo);
I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd need a separate function for each website.
In any case, I don't know much about regexes, so bear with me.
I've got an HTML page in a format similar to this
<html>
<head>...</head>
<body>
<div class=nav>...</div><p id="someshit" />
<div class=body>....</div>
<div class=footer>...</div>
</body>
I need to extract the contents of the body class container.
I tried this.
$pattern = "/<div class=\"body\">\(.*?\)<\/div>/sui"
$text = $htmlPageAsIs;
if (preg_match($pattern, $text, $matches))
echo "MATCHED!";
else
echo "Sorry gambooka, but your text is in another castle.";
What am I doing wrong? My text ends up in another castle.
*EDIT: ooohh... never mind, I found readability's code
You are matching for class="body" your document has class=body: you're missing the quotes. Use "/<div class=\"?body\"?>(.*?)<\/div>/sui".