Get specific html portion with regex string matching in php

Get specific html portion with regex string matching in php - php

i am trying to get specific HTML code portion with regex preg_match_all by matching it with class tag But it is returning empty array.
This is the html portion which i want to get from complete HTML
<div class="details">
<div class="title">
<a href="citation.cfm?id=2892225&CFID=598850954&CFTOKEN=15595705"
target="_self">Restrictification of function arguments</a>
</div>
</div>
Where I am using this regex
preg_match_all('~<div class=\'details\'>\s*(<div.*?</div>\s*)?(.*?)</div>~is', $html, $matches );
NOTE: $html variable is having the whole html from which I want to search.
Thanks.

You are looking for single quotes in your regex in contrast to the double quotes in $html.
Your regex should look like:
'~<div class="details">\s*(<div.*?</div>\s*)?(.*?)</div>~is'
or better:
'~<div class=[\'"]details[\'"]>\s*(<div.*?</div>\s*)?(.*?)</div>~is'

Better use a DOM approach !
<?php
$html = '<div class="details">
<div class="title">
<a href="citation.cfm?id=2892225&CFID=598850954&CFTOKEN=15595705"
target="_self">Restrictification of function arguments</a>
</div>
</div>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$divs = $xpath->query('//div[#class="title"]');
print_r($divs);
?>

Related

Removing portion between two \n when a specific sub-string is there [PHP]

I have a variable in which I store some HTML code.
Let's say:
<div>
<span> test of {my_string} </span>
{my_string}
test of {my_string}
</div>
<h1> {my_string} </h1>
I would need to remove some lines containing a specific value so the end result looks like:
<div>
</div>
So I was thinking of getting the position of the string with strpos and then get the \n which are before and after. But how can I search backwards with strpos as I already have an offset specified?
$rep_pos = strpos($message, 'my_string');
$line_begining = ????
$line_end = strpos($message, '\n', $rep_pos);
I can't use strip_tags because I don't know in advance what will be the tags around and some other strings can use the same tags.

You should use DOMDocument for parsing HTML tags string. Here we are using XPath query which is //*[text()=" my_string "] which means get all elements which contains my_string text.
Try this code snippet here
<?php
ini_set('display_errors', 1);
$string='<html>
<body>
<div>
<span> my_string </span>
</div>
<h1> my_string </h1>
</body>
</html>';
$domobject= new DOMDocument();
$domobject->loadHTML($string);
$xpath= new DOMXPath($domobject);
$result=$xpath->query('//*[text()=" my_string "]');
Foreach($result as $nodes)
{
$nodes->parentNode->removeChild($nodes);
}
echo $domobject->saveHTML();
Solution 2:
Regex demo

How can I strip html tags except some of them?

I need to remove all html codes from a php string except:
<p>
<em>
<small>
You know, strip_tags() function is good, but it strips all html tags, how can I tell it remove all html except those tags above?

You should check out the manual: Example #1 strip_tags() example
Syntax: strip_tags ( Your-string, Allowable-Tags )
If you pass the second parameter, these tags will not be stripped.
strip_tags($string, '<p><em><small>');

According to your comment, you want to remove HTML elements only if they have some class or attribute. You'll need to build up a DOM then:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>I will be deleted as well</p>
<p>But keep this</p>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
$elements_to_be_removed = $xpath->query("//*[count(#*)>0]");
foreach ($elements_to_be_removed as $element) {
$element->parentNode->removeChild($element);
}
// just to check
echo $dom->saveHTML();
?>
To change which elements shall be removed, you'll need to change the query, ie to remove all elements with the class myclass, it must read "//*[class='myclass']".

PHP preg_match - matching html elements

Ok so I have a regular expression I'm trying to use to match a certain pattern in some html files. Here's the preg_match statement:
preg_match('#<'.$htmlElementType.' id\s*=\s*"{{ALViewElement_'.$this->_elementId.'}}".*>[\s\S]*</'.$htmlElementType.'(>)#i', $htmlString, $newMatches, PREG_OFFSET_CAPTURE)
To be clear, this is attempting to match an html element with an id of {{ALViewElement_.*}} but it also needs to end itself with a closing tag, for example if $htmlElementType was "section" it would end in "/section>".
If my html looked just like this with nothing else in it, it works as expected:
<section id="{{ALViewElement_resume}}">
<!--{{RESUME_ADD_CHANGE_PIECE}}-->
<!--{{RESUME}}-->
</section>
The problem is when we have a section element later in the html and it ALSO has a closing /section>. Example:
<section id="{{ALViewElement_resume}}">
<!--{{RESUME_ADD_CHANGE_PIECE}}-->
<!--{{RESUME}}-->
</section>
<div>
</div>
<section>
HEY THIS IS ME
</section>
In this case the full mach is everything above. But I want it to stop at the that opens my first one. This is important because later on in my code I need the location of the last > in that ending tag.
Any ideas how I could change this regular expression a little bit?
Thanks for the help!

Yes, just use an ungreedy quantifier:
preg_match('#<'.$htmlElementType.' id\s*=\s*"{{ALViewElement_'.$this->_elementId.'}}".*?>[\s\S]*?</'.$htmlElementType.'(>)#i', $htmlString, $newMatches, PREG_OFFSET_CAPTURE)
another way: with DOMDocument:
$html = <<<LOD
<section id="{{ALViewElement_resume}}">
<!--{{RESUME_ADD_CHANGE_PIECE}}-->
<!--{{RESUME}}-->
</section>
<div>
</div>
<section>
HEY THIS IS ME
</section>
LOD;
$doc= new DOMDocument();
#$doc->loadHTML($html);
$node = $doc->getElementById("{{ALViewElement_resume}}");
$docv = new DOMDocument();
$docv->appendChild($docv->importNode($node, TRUE));
$result = $docv->saveHTML();
echo htmlspecialchars($result);

PHP - extract data from a web page HTML

I need to extract the words FIESTA ERASMUS ans /event/83318 in the following HTML code
<div id="tab-soiree" class=""><div class="soireeagenda cat_1">
<img src="http://www.parisbouge.com/img/fly/resize/100/83318.jpg" alt="fiesta erasmus" class="fly">
<ul>
<li class="nom"><h2>FIESTA ERASMUS </h2></li>
<li class="genre" style="margin-bottom:4px;">
soirée étudiante </li>
<li class="lieu">Duplex</li> <li class="musique">house, electro, r&b chic, latino, disco</li>
<li class="pass-label">pass</li> </ul>
<img src="/img/salles/resize/50/10.jpg" alt="duplex" class="flysalle">
<hr class="clearleft">
</div>
I tested something like this
$PATTERN = "/\<div id="tab-soiree".*(.*)/"
preg_match($PATTERN, $html, $matches);
but it doesnt work.

You don't parse HTML with Regular Expressions. Instead, use the built-in DOM parsing tools within PHP itself: http://php.net/manual/en/book.dom.php
Assuming your HTML is accessible from a variable named $html:
$doc = new DOMDocument();
$doc->loadHTML( $html );
$item = $doc->getElementsByTagName("li")->item(0);
$link = $item->getElementsByTagName("a")->item(0);
echo $link->attributes->getNamedItem('href')->nodeValue;
echo $link->textContent;

I suggest the following pattern:
$PATTERN = '%<h2>(.*?)[\s]+</h2>%i';
preg_match($PATTERN, $html, $matches);
The (.*?) part is a non-greedy pattern, which means that the parser won't go all the way to the end of the supplied string but will stop before the " in this case.
You may also want to pre-proccess the html before REGEX'ing it, i.e. remove all line-breaks in order to get rid of the [\s]+ part.
You can try it online here.

PHP preg_match_all - group without returning a match

How would I get content from HTML between h3 tags inside an element that has class pricebox? For example, the following string fragment
<!-- snip a lot of other html content -->
<div class="pricebox">
<div class="misc_info">Some misc info</div>
<h3>599.99</h3>
</div>
<!-- snip a lot of other html content -->
The catch is 599.99 has to be the first match returned, that is if the function call is
preg_match_all($regex,$string,$matches)
the 599.99 has to be in $matches[0][1] (because I use the same script to get numbers from dissimilar looking strings with different $regex - the script looks for the first match).

Try using XPath; definitely NOT RegEx.
Code :
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.path.to/your_html_file_html');
$xpath = new DOMXPath( $html );
$nodes = $xpath->query("//div[#class='pricebox']/h3");
foreach ($nodes as $node)
{
echo $node->nodeValue."";
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get specific html portion with regex string matching in php - php

You are looking for single quotes in your regex in contrast to the double quotes in $html. Your regex should look like: '~<div class="details">\s(<div.?</div>\s)?(.?)</div>~is' or better: '~<div class=[\'"]details[\'"]>\s(<div.?</div>\s)?(.?)</div>~is'

Related

Removing portion between two \n when a specific sub-string is there [PHP]

How can I strip html tags except some of them?

PHP preg_match - matching html elements

PHP - extract data from a web page HTML

PHP preg_match_all - group without returning a match

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get specific html portion with regex string matching in php - php

You are looking for single quotes in your regex in contrast to the double quotes in $html. Your regex should look like: '~<div class="details">\s*(<div.*?</div>\s*)?(.*?)</div>~is' or better: '~<div class=[\'"]details[\'"]>\s*(<div.*?</div>\s*)?(.*?)</div>~is'

Related

Removing portion between two \n when a specific sub-string is there [PHP]

How can I strip html tags except some of them?

PHP preg_match - matching html elements

PHP - extract data from a web page HTML

PHP preg_match_all - group without returning a match

Categories

Resources

You are looking for single quotes in your regex in contrast to the double quotes in $html. Your regex should look like: '~<div class="details">\s(<div.?</div>\s)?(.?)</div>~is' or better: '~<div class=[\'"]details[\'"]>\s(<div.?</div>\s)?(.?)</div>~is'