Regex for selective stripping of HTML

Regex for selective stripping of HTML - php

I'm trying to parse some HTML with PHP as an exercise, outputting it as just text, and I've hit a snag. I'd like to remove any tags that are hidden with style="display: none;" - bearing in mind that the tag may contain other attributes and style properties.
The code I have so far is this:
$page = preg_replace("#<([a-z]+).*?style=\".*?display:\s*none[^>]*>.*?</\1>#s","",$page);`
The code it returning NULL with a PREG_BACKTRACK_LIMIT_ERROR.
I tried this instead:
$page = preg_replace("#<([a-z]+)[^>]*?style=\"[^\"]*?display:\s*none[^>]*>.*?</\1>#s","",$page);
But now it's just not replacing any tags.
Any help would be much appreciated. Thanks!

Using DOMDocument, you can try something like this:
$doc = new DOMDocument;
$doc->loadHTMLFile("foo.html");
$nodeList = $doc->getElementsByTagName('*');
foreach($nodeList as $node) {
if(strpos(strtolower($node->getAttribute('style')), 'display: none') !== false) {
$doc->removeChild($node);
}
}
$doc->saveHTMLFile("foo.html");

You should never parse HTML with Regex. That makes your eyes bleed. HTML is not regular in any form. It should be parsed by using a DOM-parser.
Parse HTML to DOM with PHP

Related

Get "Text-Only" Text With PHP Strip Tags

I'm using PHP Simple HTML DOM Parser. So You Can Use It In Solutions
Okay. So, I'm loading a file like this:
$html = file_get_html('http://localhost/seo/testfile.php');
And I echo the code as echo strip_tags($html);
So far, so good.
The problem occours when user enter inline code like
<script>alert(1)</script>
So I want not to display anything present inside <script>, <style>, etc. tags. How do I do that?
Cheers!

i think php dom will help you and you can get required html of any element and indirectly of whole page. same like below.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($content);
$data = $dom->getElementsByTagName("tr");
foreach($data as $value){
if($value->getAttribute('class')== 'notesRow'){
$aa = $value->nodeValue;
}
}

php dom document remove some html tags but keep inner tags and text

I need to remove some tags (e.g. <div></div>) in HTML document and keep inner tags and text.
I managed to do that with Simple HTML Dom Parser. But it can't process big files due to huge memory requirements.
I would prefer to use native PHP tools like DOMDocument cause I read that it's more optimized and quicker in processing HTML documents.
But I struggle at the first stage - how to remove some tags while keeping inner text and tags.
Source HTML sample is:
<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>
I try this code:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
It produces the output:
<html><body>00000aaaaa<div>bbbbbbccc<a>link</a>cccdddddd</div>eeeee<div>1111</div></body></html>
I need the following:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
Could someone please help me with proper code for the task?

You can use strip_tags function in PHP.
$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');
This remove all tags except html,body,a
And output is:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
EDIT:
If it is input from user, it's better for security reason to use whitelist tags and not blacklist.

If your code only contains simple HTML tags without any attributes you can keep it simple like:
$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';
$removedTags = preg_replace($pattern, '', $value);
Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.
This code snippet is only for simple code, but fits to your HTML input and output example.

Try this..
Just replace the for loop with the below code.
foreach ($oldnodes as $node) {
$children = $node->childNodes;
$string = "";
foreach($children as $child) {
$childString = $doc->saveXML($child);
$string = $string."".$childString;
}
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($string);
$node->parentNode->insertBefore($fragment,$node);
$node->parentNode->removeChild($node);
}

I found a way to make it work.
The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.
So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.
The code is:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!--
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
$node=$oldnodes->item(0);
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
$oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();
I hope that will be helpful for someone who finds same difficulties.

How to replace all ul and li tag with div using PHP Simple HTML DOM Parser?

Ok, I want to create a "website mobilizer" by using PHP Simple HTML DOM Parser. In the present phase, I want to-
change all 'ul' and 'li' tag to 'div' tag and
change all 'table' elements (e.g. table,tr,td,th) to div. I tried an workaround for the first problem in following way:
.
$html=new new simple_html_dom();
$html>load_file($sourceurl);
$div="div";
foreach ($html->find('ul') as $element) {
$element=$div;
}
It does seem dull, but I'm not being able to find any other solution. I am discouraged for using preg_match, though I don't know if it can give me the desired output. Any help will be appreciated.

It is possible:
$html=new new simple_html_dom();
$html>load_file($sourceurl);
foreach ($html->find('ul') as $element) {
$element->innertext = "<div>".$element->innertext."</div>";
}
Of course you can do the same with table.
More in doucmetation: Simple HTML DOM Parser Manual

$html=new new simple_html_dom();
$html>load_file($sourceurl);
$replace="ul,li,table,tr,td,th";
foreach($html->find($replace) as $key=>$element){
$html->find($replace,$key)->outertext="<div>".$element->innertext."</div>"
}
this replaces all the elements from $replace array to <div> in $html DOM without changing the contents of those tags. Everything is stored in $html DOM.
As you can see you cant use $element to change anything in $html, even using $element as a reference, so you have to access $html directly.

select tag in php and get href

i want get all link in page by class "page1" in php.
the same code in jquery
$("a#page1").echo(function()
{
});
can do that in php?
$pattern = '`.*?((http|ftp)://[\w#$&+,\/:;=?#%.-]+)[^\w#$&+,\/:;=?#%.-]*?`i';
preg_match_all($pattern,$page_g,$matches);
this code get all href in the $page_g but its not work for class="page1".
i want only all href in $page_g by class="page1"
can help me for optimize reqular ex or other way?
for example
$page_g="the <strong>office</strong> us s01 05 xvid mu asd";
i want return only /?s=cache:16001429:office+s01e02
tnx

You lack the expertise to use a regular expression for that. Hencewhy using DOMdocument is the advisable solution here. If you want to have a simpler API then use the jQuery-lookalikes phpQuery or QueryPath:
$link = qp($html)->find("a#page1")->attr("href");
print $link;

Edit Edited since you clarified the question.
To get all <a> links with the class .page1:
// Load the HTML from a file
$your_HTML_string = file_get_contents("html_filename.html");
$doc = new DOMDocument();
$doc->loadHTML($your_HTML_string);
// Then select all <a> tags under #page1
$a_links = $doc->getElementsByTagName("a");
foreach ($a_links as $link) {
// If they have more than one class,
// you'll need to use (strpos($link->getAttribute("class"), "page1") >=0)
// instead of == "page1"
if ($link->getAttribute("class") == "page1") {
// do something
}
}

Use DomDocument to parse HTML page, here's a tutorial:
Tutorial

DOM is preferred to be used here, as regex is difficult to maintain if underlying HTML changes, besides, DOM can deal with invalid HTML and provides you access to other HTML parsing related tools.
So, assuming that have a file that contains HTML, and you are searching for classes, this could be the way to go:
$doc = new DOMDocument;
$doc->load(PATH_TO_YOUR_FILE);
//we will use Xpath to find all a containing your class, as a tag can have more than one class and it's just easier to do it with Xpath.
$xpath = new DOMXpath($doc);
$list = $xpath->query("//a[contains(#class, 'page1')]");
foreach ($list as $a_tag) {
$href = $a_tag->getAttribute('href');
//do something
}

Add an attribute to an HTML element

I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.

It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()

Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex for selective stripping of HTML - php

You should never parse HTML with Regex. That makes your eyes bleed. HTML is not regular in any form. It should be parsed by using a DOM-parser. Parse HTML to DOM with PHP

Related

Get "Text-Only" Text With PHP Strip Tags

php dom document remove some html tags but keep inner tags and text

How to replace all ul and li tag with div using PHP Simple HTML DOM Parser?

select tag in php and get href

Add an attribute to an HTML element

Categories

Resources