A seen in the title, here the HTML Code example:
<body>
<!--CODE_START-->
<p>I <strong>Want</strong> this</p>
<p>And this one too</p>
<!--CODE_STOP-->
<p>This sould be go to trash</p>
<!--CODE_START-->
<p>This one should be included too</p>
<!--CODE_STOP-->
The question is, I want everything inside <!--CODE_START--> and <!--CODE_STOP--> so the result should be:
<p>I <strong>Want</strong> this</p>
<p>And this one too</p>
and <p>This one should be included too</p>
I tried using this /<!--CODE_START-->([^<]*)<!--CODE_STOP-->/ and /<!--CODE_START-->(.*)<!--CODE_STOP-->/ with combination of pattern modifiers like su, imu, im it won't work, just return an empty array. And also, its full HTML page that i was try to grab.
Thanks in advance.
[^<] means everything that is not a <, obviously it's going to fail at < p>. Just catch everything and use the non greedy option :
preg_match_all('/<!--CODE_START-->(.*)<!--CODE_STOP-->/sU', $foo);
Related
I would like to process the html from a webpage and extract the paragraphs that match my criteria. The flavor of regex is PHP.
This is the sample webpage HTML:
<div class="special">
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
</div>
The regex looks between the <div class="special"> and </div> tags and puts everything into a capture group or variable for reference in the next step. This next step is what I am having trouble with. I cannot for the life of me write a regex that captures each paragraph of text between <p> and </p>.
I have tried /<p>(.+?)<\/p>/s which returns:
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
I would like each paragraph to be returned individually as items in an array. The non greedy ? does not seem to work. Any suggestions?
You have to escape your slash for the p tag.
So it's going to be
/<p>(.+?)<\/p>/s
So stupid! The regex works perfectly. All the regexs work perfectly. The problem was with the inputs.
The input HTML file I was processing had the following structure which made the regex not work.
<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>
I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data. I was able to get the raw data and find my mistake by using:
include 'filename.php';
file_put_contents('filename.php', $data);
Now I know to not trust my browser to return raw data ever again!
I am using preg_replace to replace HTML comment tags with empty space but it seems to be replacing the whole HTML comment with empty space.
echo preg_replace('/<!--(.*?)-->/','',$r->pageCont);
Where $r->pageCont is a database entry containing HTML, for example:
<div class="col-lg-12">
<p>The year is:</p>
<!-- <?php echo date(Y); ?> -->
</div>
In the above example, the HTML comment tags would be stripped away leaving only the PHP code to echo the year. Like I said, what is happening is the entire HTML comment is being stripped away.
Can someone recommend a pattern to use? Would appreciate your input.
EDIT: updated question to reflect the code I am using.
It seems like you're trying to replace the comment line with the php code present inside that. If yes, then you need to put the replacement string as $1 so that it would refer to the group index 1.
echo preg_replace('/<!--(.*?)-->/', '$1', $r->pageCont);
DEMO
I must be overcomplicating this, but I can't figure it out for the life of me.
I have a standard html document stored as a string, and I need to get the contents of the paragraph.
I'll make an example case.
$stringHTML=
"<html>
<head>
<title>Title</title>
</head>
<body>
<p>This is the first paragraph</p>
<p>This is the second</p>
<p>This is the third</p>
<p>And fourth</p>
</body>
</html>";
If I use
$regex='~(<p>)(.*)(</p>)~i';
preg_match_all($regex, $stringHTML, $newVariable);
I won't get 4 results. Rather, I'll get 10. I get 10 because the regex matches the first <p> and first </p> as well as the first <p> and fourth </p>
How can I search between two words, and return only the results of whats between each paragraph?
Use HTML parser like DOM or XPATH to parse HTML. Dont use Regex to parse HTML. Here is how it can be easily parsed by DOMDocument.
$doc = new \DOMDocument;
$doc->loadHTML($stringHTML);
$ps = $doc->getElementsByTagName("p");
for($i=0;$i<$ps->length; $i++){
echo $ps->item($i)->textContent. "\n";
}
Code in action
Using this RegEx (as you said its a regex practice) you'll get 4 results.
preg_match_all("#<p>(.*)</p>#", $stringHTML, $matches);
print_r($matches[1]);
Here look around syntaxes are used. See the code in action.
Use .*? to get the shortest match instead of the longest match.
Your regex should be /<p>(.*?)<\/p>/i . It will only matches the strings between <p></p> and put it in an array.
you shouldn't do a group : (<p>)
I have got field in database with <p>intro content</p><p>rest of content....</p>
I would like to take the part from intro content means the content between first <p>...</p>
After that I want to remove the intro content <p>intro content</p> from the content part.
Could you help me with that?
If you want to do any kind of parsing on HTML, you really shouldn't use regex. However, your question is fairly simple if all you want is the text content of the first <p> tag in a string:
preg_match('#<p>.*?</p>#', $string, $matches);
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.