Ok so I have a regular expression I'm trying to use to match a certain pattern in some html files. Here's the preg_match statement:
preg_match('#<'.$htmlElementType.' id\s*=\s*"{{ALViewElement_'.$this->_elementId.'}}".*>[\s\S]*</'.$htmlElementType.'(>)#i', $htmlString, $newMatches, PREG_OFFSET_CAPTURE)
To be clear, this is attempting to match an html element with an id of {{ALViewElement_.*}} but it also needs to end itself with a closing tag, for example if $htmlElementType was "section" it would end in "/section>".
If my html looked just like this with nothing else in it, it works as expected:
<section id="{{ALViewElement_resume}}">
<!--{{RESUME_ADD_CHANGE_PIECE}}-->
<!--{{RESUME}}-->
</section>
The problem is when we have a section element later in the html and it ALSO has a closing /section>. Example:
<section id="{{ALViewElement_resume}}">
<!--{{RESUME_ADD_CHANGE_PIECE}}-->
<!--{{RESUME}}-->
</section>
<div>
</div>
<section>
HEY THIS IS ME
</section>
In this case the full mach is everything above. But I want it to stop at the that opens my first one. This is important because later on in my code I need the location of the last > in that ending tag.
Any ideas how I could change this regular expression a little bit?
Thanks for the help!
Yes, just use an ungreedy quantifier:
preg_match('#<'.$htmlElementType.' id\s*=\s*"{{ALViewElement_'.$this->_elementId.'}}".*?>[\s\S]*?</'.$htmlElementType.'(>)#i', $htmlString, $newMatches, PREG_OFFSET_CAPTURE)
another way: with DOMDocument:
$html = <<<LOD
<section id="{{ALViewElement_resume}}">
<!--{{RESUME_ADD_CHANGE_PIECE}}-->
<!--{{RESUME}}-->
</section>
<div>
</div>
<section>
HEY THIS IS ME
</section>
LOD;
$doc= new DOMDocument();
#$doc->loadHTML($html);
$node = $doc->getElementById("{{ALViewElement_resume}}");
$docv = new DOMDocument();
$docv->appendChild($docv->importNode($node, TRUE));
$result = $docv->saveHTML();
echo htmlspecialchars($result);
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I am trying to replace every character (including newline, tabs, whitespace etc) between Nodes that has the same tag name. The problem is that the regex matches the different node (string) as one based on similarity between the beginning and closing tags of the nodes and then output a single result.
For Example:
$html_string = "
<div> Below are object Node with the html code </div>
<script> alert('i want this to be replaced. it has no newline'); </script>
<div> I don't want this to be replaced </div>
<script>
console.log('i also want this to be replaced. It has newline');
</script>
<div> This is a div tag and not a script, so it should not be replaced </div>
<script> console.warn(Finally, this should be replaced, it also has newline');
</script>
<div> The above is the final result of the replacements </div> ";
$regex = '/(?:\<script\>)(.*)?(?:\<\/script\>)/ims';
$result = preg_replace($regex, '<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->', $html_string);
echo $result;
Expected Result:
<div> Below are object Node with the html code </div>
<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->
<div> I don't want this to be replaced </div>
<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->
<div> This is a div tag and not a script, so it should not be replaced </div>
<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->
<div> The above is the final result of the replacements </div>
Actual Output:
<div> Below are object Node with the html code </div>
<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->
<div> The above is the final result of the replacements </div>
How can i sort this out. Thanks in advance.
Using DOMDocument is generally preferable to trying to parse HTML with regex. Based on your question, this will give you the results you want. It finds each script node in the HTML and replaces it with the comment you specified:
$doc = new DOMDocument();
$doc->loadHTML("<html>$html_string</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//script') as $script) {
$comment = $doc->createComment('THIS SCRIPT CONTENT HERE HAS BEEN ALTERED');
$script->parentNode->replaceChild($comment, $script);
}
echo substr($doc->saveHTML(), 6, -8);
Note that because you don't have a top-level element in the HTML, one (<html>) has to be added on read and then removed on output (using substr).
Output:
<div> Below are object Node with the html code </div>
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED-->
<div> I don't want this to be replaced </div>
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED-->
<div> This is a div tag and not a script, so it should not be replaced </div>
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED-->
<div> The above is the final result of the replacements </div>
Demo on 3v4l.org
If you insist on using regex (but you should read this before you do), the problem with your regex lies in this part:
(.*)?
This looks for an optional string of as many characters as possible, leading up to </script>. So it basically absorbs all the characters between the first <script> and the last </script> (because all the characters in </script> match .). What you actually wanted was (.*?) which is non-greedy and so matches only up to the first </script> i.e.
$regex = '/(?:\<script\>)(.*?)(?:\<\/script\>)/ims';
$result = preg_replace($regex, '<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->', $html_string);
echo $result;
The output from this is as you require.
Demo on 3v4l.org
Example: https://regex101.com/r/nHiyU3/1
CODE:
<div id="content">
<div>
<div class="col-image"></div> <!-- STOPS HERE -->
THIS CONTENT HERE DOES NOT GET CAPTURED
</div>
</div>
REGEX:
/<div id=[\'|"]content[\'|"][^>]*>(.*)<\/div>/sUi
So it stops where I've added the note to say
Any reason why? Have followed other topics on SO but can't get it to grab the whole lot.
So I know how to do it across multiple lines, it's finding the matching tag across multiple lines
As mentioned in the comments, it's easier to achieve this using DOMDocument. For example:
<?php
$html = '<div id="content"><div><div class="col-image"></div> <!--
STOPS HERE -->THIS CONTENT HERE DOES NOT GET CAPTURED</div></div>';
$domDocument = DOMDocument::loadHTML($html);
$divList = $domDocument->getElementsByTagName('div');
foreach ($divList as $div) {
var_dump($div);
}
I have a variable in which I store some HTML code.
Let's say:
<div>
<span> test of {my_string} </span>
{my_string}
test of {my_string}
</div>
<h1> {my_string} </h1>
I would need to remove some lines containing a specific value so the end result looks like:
<div>
</div>
So I was thinking of getting the position of the string with strpos and then get the \n which are before and after. But how can I search backwards with strpos as I already have an offset specified?
$rep_pos = strpos($message, 'my_string');
$line_begining = ????
$line_end = strpos($message, '\n', $rep_pos);
I can't use strip_tags because I don't know in advance what will be the tags around and some other strings can use the same tags.
You should use DOMDocument for parsing HTML tags string. Here we are using XPath query which is //*[text()=" my_string "] which means get all elements which contains my_string text.
Try this code snippet here
<?php
ini_set('display_errors', 1);
$string='<html>
<body>
<div>
<span> my_string </span>
</div>
<h1> my_string </h1>
</body>
</html>';
$domobject= new DOMDocument();
$domobject->loadHTML($string);
$xpath= new DOMXPath($domobject);
$result=$xpath->query('//*[text()=" my_string "]');
Foreach($result as $nodes)
{
$nodes->parentNode->removeChild($nodes);
}
echo $domobject->saveHTML();
Solution 2:
Regex demo
i am trying to get specific HTML code portion with regex preg_match_all by matching it with class tag But it is returning empty array.
This is the html portion which i want to get from complete HTML
<div class="details">
<div class="title">
<a href="citation.cfm?id=2892225&CFID=598850954&CFTOKEN=15595705"
target="_self">Restrictification of function arguments</a>
</div>
</div>
Where I am using this regex
preg_match_all('~<div class=\'details\'>\s*(<div.*?</div>\s*)?(.*?)</div>~is', $html, $matches );
NOTE: $html variable is having the whole html from which I want to search.
Thanks.
You are looking for single quotes in your regex in contrast to the double quotes in $html.
Your regex should look like:
'~<div class="details">\s*(<div.*?</div>\s*)?(.*?)</div>~is'
or better:
'~<div class=[\'"]details[\'"]>\s*(<div.*?</div>\s*)?(.*?)</div>~is'
Better use a DOM approach !
<?php
$html = '<div class="details">
<div class="title">
<a href="citation.cfm?id=2892225&CFID=598850954&CFTOKEN=15595705"
target="_self">Restrictification of function arguments</a>
</div>
</div>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$divs = $xpath->query('//div[#class="title"]');
print_r($divs);
?>
How would I get content from HTML between h3 tags inside an element that has class pricebox? For example, the following string fragment
<!-- snip a lot of other html content -->
<div class="pricebox">
<div class="misc_info">Some misc info</div>
<h3>599.99</h3>
</div>
<!-- snip a lot of other html content -->
The catch is 599.99 has to be the first match returned, that is if the function call is
preg_match_all($regex,$string,$matches)
the 599.99 has to be in $matches[0][1] (because I use the same script to get numbers from dissimilar looking strings with different $regex - the script looks for the first match).
Try using XPath; definitely NOT RegEx.
Code :
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.path.to/your_html_file_html');
$xpath = new DOMXPath( $html );
$nodes = $xpath->query("//div[#class='pricebox']/h3");
foreach ($nodes as $node)
{
echo $node->nodeValue."";
}